GRETIL: Past and Present

Jul 6

For quite a few years now, GRETIL has been the go-to place for simple Sanskrit e-texts of all sorts. Its advantages are:

academic-level quality;
minimal encoding requirements (close to plain-text); and
breadth of coverage (all genres and high quantity).

Its main disadvantages are:

frequent lack of metadata (e.g., source editions);
formatting that is sometimes too messy and variable;
frequently insufficient structure; and
insufficient versioning to track changes.

And worst of all, sadly, as of 2020, GRETIL is stalled and no longer being updated (see its “Update History”). The website is still up and fully functional, but the situation is unfortunate nonetheless, because it is no longer possible to add or modify anything. Let’s review the project’s basic history (“past”) and statistics (“present”) before talking about what to do next (“future”).

GRETIL’s Past

GRETIL was started by Reinhold Grünendahl at Göttingen as a “register”, not a repository, as indicated in its name (see also the small Wikipedia entry for more references). That is, rather than storing e-texts on its own server, it focused on simply linking out to such documents on the web, wherever else they might be: TITUS, Sansknet, Sanskrit Documents, Indology E-Texts, Gaudiya Grantha Mandira, Digital Sanskrit Buddhist Canon, etc. etc., not to mention numerous personal or departmental websites. Naturally, there was no unifying format or mission for these various documents. In this form, GRETIL was easy to “add to”, in that this simply entailed GRETIL’s becoming “aware” of the new document, so to speak.

Over time, however, due to its expanding popularity, GRETIL shifted to being the first place the author of a new e-text might want to put it so that it would have the widest audience. This is undoubtedly related to the fact that any given repository (especially a less popular one) has the unfortunate tendency to disappear after a while. That meant that GRETIL needed to embrace a new identity of being a repository in its own right. Its popular mass-download feature allows users to quickly get ahold of everything for use on their own machine. But this requires GRETIL to have downloaded copies of all the “registered” files on its own server. Naturally this creates the possibility that source files can be updated on their own servers in their respective corners of the internet without GRETIL being notified. And this creates a further incentive for GRETIL to be the actual holder of the primary source file. As a measure of the extent to which the various erstwhile sources are largely irrelevant now, an automated check of the external links on the GRETIL site reveals that fully half of these links (384/772) are broken (the Sansknet files being the most obvious case for Sanskrit users) — and this is bothering basically no one. In summary, over time, the GRETIL copies, including the copies downloaded to people’s individual machines at various times, became the primary form of this data in the world. And despite the activity of several other interesting and ambitious projects like DCS and SARIT, GRETIL has remained the dominant all-purpose source for academic Sanskrit e-texts.

Then, as mentioned, in 2020, the collection stopped in its tracks. Around this same time, perhaps as part of an effort to jumpstart it, Maximillian Mehner, a graduate student working with Grünendahl at Göttingen, made an effort to get GRETIL on GitHub, specifically with TEI-XML as the primary encoding and HTML and TXT as secondary transformations. This could have brought numerous benefits, like versioning and a standard for including metadata, but the effort was unfortunately short-lived. The repo has not changed since August 2021, and most (if not all) users still resort to the static HTML transforms on the website.

For more context, let’s quickly review the various alternatives to GRETIL over the years. The closest overall for openness and accessibility is probably Sanskrit Documents, a useful collection especially for devotional texts, but not generally associated with scientific rigor. On the other end of the spectrum, SARIT — whose repo on GitHub (containing 83 texts) is more complete than the now-basically-obsolete web-app (with only 60 texts) — is so rigorous, including in its use of TEI-XML, that most feel discouraged from contributing to it and aren’t able to use the text data directly. Similarly, the Digital Corpus of Sanskrit is impressive in many ways, especially for its analyzed and openly available data, but it is not set up to accept new contributions. From somewhat longer ago, TITUS and Sanskrit Library used to be bigger players, and they still have their audiences, but their decidedly web-1.0 appearances and limited interoperability with other projects are significant obstacles to their continued use. Indology.info no longer maintains its own e-text collection as it once did (Jan 2021), but it does still remain a venerable place for listing other such resources. And last but not least, there are various other substantial and important specialty collections: Muktabodha (for Śaiva content), Digital Sanskrit Buddhist Canon (largely absorbed into GRETIL, actually), UT Austin’s Resource Library for Dharmaśāstra Studies, Andrew Ollett’s Prakrit Digital Text Project, and quite a few more. But nothing has ever quite matched GRETIL’s balance of breadth, overall quality, and simplicity of access.

Given its central position, it is primarily GRETIL that in recent years has served as a data source for numerous corpus-scale computational projects. Many of these projects are not formally connected to academia, and they often take the form of reading environments that offer large amounts of text augmented with additional functionalities like sentence analysis, dictionary entries, translation, and so on. Notable examples include BuddhaNexus (now on its way to being reborn as Dharmamitra), Ambuda, Sanskrit Sahitya, Wisdom Library, and more. Some of these projects are making changes/improvements to the text data as they go; see for example Sanskrit Sahitya’s note here and the substantial digitization effort of Ambuda here. Similarly, Vishvas Vasuki has been maintaining a massive “raw_etext” meta-collection for a few years, and I think this also involves changing content. Even assuming these changes are being made in scientifically principled ways (not a given!), I haven’t yet seen any plans for propagating these changes beyond the GitHub pages of these projects. Unsurprisingly, then, it seems profitable and exciting to use this Sanskrit data, but far fewer seem to want the unglamorous job of taking care of it en masse.

GRETIL’s Present

Now let’s take stock of what exactly GRETIL is today. For this exercise, BuddhaNexus’s usage of GRETIL (see here) provides a useful basis for understanding the latter’s scope and structure. Also note that I’m using file size here as a rough-and-ready proxy for text size, but this certainly breaks down the more XML markup is added.

How many works are in the GRETIL Sanskrit corpus? The mass-download contains approximately 1,300 files, if you exclude the TEI forms straightaway and count only the .htm files. Further excluding niche documents with extra analysis (e.g., BHELA-style sandhi splitting, metrical markup, etc.), we’re left with a little over 1,000 files. To arrive at a count of “unique works” represented, we must further exclude alternate versions (e.g., based on different editions) and additional files used to split up large works (e.g., the 18 books of the Mahābhārata). This is far from trivial to work out, especially given the frequently complex nature of “root texts” and “commentaries”. For example, if GRETIL has only one file for a root text + commentary in one case, but two files in another case (i.e., an additional “extracted root text”), we probably should normalize this — I would argue in the direction of separation in cases where separate authorship of these parts is relatively certain. Also, when the individual parts of large works have distinct characters, should they not be counted separately, especially when tiny stotra texts might themselves count as one? So, maybe counting unique works is not actually very useful. That being said, I’ve gone through, looking at all titles and comparing contents, and I come up with a rough count of 700 unique works.

Before controlling for uniqueness, the mass-download is about 350 MB. Controlling for uniqueness as explained above reduces this number by half, to about 175 MB. Again, this rules out multiple competing versions of the same work, e.g., as based on different editions. Clearly, these alternate versions may have scientific value. However, for the sake of comparing with other collections that do not have such redundancy issues, thinking in terms of uniqueness does have its place.

So, now, using these reduced numbers, let’s compare against other corpora, including average file sizes. (Notes: For Muktabodha, I control for uniqueness in the same way as for GRETIL. For DCS numbers, I exported “text” lines from the GitHub .conllu files plus some basic identifier info and measured that result. For SARIT and its heavy XML markup, in total 132.8 MB, I estimated plain-text equivalents by extrapolating from the case of MBh, which is 36.8 MB on SARIT vs. 15.0 MB on GRETIL.)

Corpus
Total (MB)
Unique works
Avg work (MB)


GRETIL
175
700
0.25

Muktabodha
125
430
0.29

DCS
56.7
257
0.22

SARIT
54.2
83
0.65

UT Austin Dharmaśāstra
31.3
33
0.95

Pramāṇa NLP
18.8
49
0.38

DSBC
9.9
231
0.04

Corpus	Total (MB)	Unique works	Avg work (MB)
GRETIL	175	700	0.25
Muktabodha	125	430	0.29
DCS	56.7	257	0.22
SARIT	54.2	83	0.65
UT Austin Dharmaśāstra	31.3	33	0.95
Pramāṇa NLP	18.8	49	0.38
DSBC	9.9	231	0.04

These numbers show that:

individual works in GRETIL (and Muktabodha and Digital Corpus of Sanskrit) are of medium size compared to other collections;
Muktabodha is overall on the same order of magnitude as GRETIL(!);
individual works in SARIT (even after controlling for XML markup) and in the UT Austin Dharmaśāstra collection are considerably larger than average; and
Digital Sanskrit Buddhist Canon (DSBC) is dominated by quite small texts.

Second, let’s talk about the categorical structure of GRETIL. It’s fairly clear already in the folder structure of the mass download, but further elucidation is provided by BuddhaNexus, which highlights GRETIL’s large amount of Buddhist material by actually breaking this out into its own new categories (see explanation here). Here’s a percentage breakdown by category, both in terms of size and file counts (this time without the uniqueness normalization described above).

It’s handy to compare a given color-coded section across these two charts. If the file-count slice is larger than the file-size slice, that shows that the average file size is relatively small (e.g., Buddhist Non-Scripture); if the file-count slice is smaller, that shows that the average file size is relatively large (e.g. epic).

There are blind spots. Much Buddhist- and other pramāṇa-śāstra needs to be sought elsewhere, like in SARIT (hence the creation of Pramāṇa NLP to bring such resources together). For the vast amount of Śaiva literature, Muktabodha is an incredible supplemental resource. For Dharmaśāstra, the UT Austin collection is surely underestimated. And Jaina literature, Jain Quantum notwithstanding, is sadly not represented really anywhere yet in a usable form. All told, though, GRETIL’s general representation of Sanskrit literature as a whole is still unparalleled.

And format-wise, as mentioned, there is no unifying standard for GRETIL data. Beyond the simple HTML structure (metadata above, text below, but all in the <body> element), users are simply expected to expect the unexpected.

GRETIL’s Future…

This part gets more audacious, and the current post is long enough already, so I’ll continue in a subsequent post… 😉

Tyler Neill

GRETIL: Past and Present

GRETIL’s Past

GRETIL’s Present

GRETIL’s Future…

Splitter Options

OCR options