Vātāyana

Vātāyana is a novel system for extracting expert philological insights from a corpus of Sanskrit philosophy texts. For detailed info about it, you can

  1. visit the web app and play around,

  2. flip through Part II of my dissertation, or

  3. watch the second half of my practice defense talk.

In this post, I give the five-minute version of where the project came from, what it is, and where it’s going.

When I first started the work that would develop into Vātāyana, I was looking for the same thing a lot of Sanskrit people had been for a long time: a better way to search through a large corpus. In particular, my senior peers in the Nyāyabhāṣya project had a long list of particular philosophy texts they wanted to explore for intertextual connections, but there was a long way to go:

  1. The data was not yet collected in one place, much less standardized or cleaned.

  2. There were no particular NLP techniques in mind to actualize the search idea.

  3. There was no particular design for an interface.

Taking these in turn:

The story of the data collection and cleaning turned out to be that of Pramāṇa NLP (see separate post).

As for NLP techniques for search, it turns out that there are, in the entire subfield of computer science called information retrieval 🫠, lots of different components that can be combined in different ways for different purposes. In this case, I ended up combining three techniques:

  1. LDA topic modeling was suggested by my officemate Thomas Köntges, who had just made not one but two new topic modeling tools focused on Ancient Greek and Latin. Once my data preprocessing problems were basically solved by the release of the Sanskrit Sandhi and Compound Splitter (at last! an effective splitter with a Python interface), I was able to conduct some first legit experiments and see a very promising signal (illustrated in Table 8 of this paper), namely: Topic similarity alone was able to identify some parallel passages in my Sanskrit corpus!

  2. TF-IDF, a very standard measure of relative word frequency, was also able to reveal lexical matches between passages, treating each passage as a bag of words, just as LDA topic modeling also does, but now emphasizing exact matching of relatively rare words. Comparing passages in this way is more computationally expensive than doing so with LDA topics, but the results seem to correlate better with the kind of relatively verbatim search that Sanskritists are most often interested in.

  3. Finally, local text alignment (specifically, a variant of Smith-Waterman) was found to be good at emphasizing contiguous runs of matching words, with some room for fuzziness. But it’s even more computationally expensive than TF-IDF.

I used all three of these techniques by chaining them together into a series of progressively fine-grained filters. In other words, only the best topic-similarity results are further compared with TF-IDF, and only the best of the TF-IDF-similarity results are compared with local alignment. I came up with this particular algorithm organically, as I tried to calibrate the system —which I wanted to work efficiently in real-time because I didn’t yet want to rely on a database of pre-computed TF-IDF and alignment scores — to find the parallels I already knew were there in my text based on previously published philological scholarship. (My dissertation ultimately included measuring system performance against just such a benchmark.) It was only later, deep in the write-up phase, when I was reading more widely about similar projects, that I realized just how common for this sort of search/recommendation task not only the overall progressive-filtering approach was but also the particular component techniques of TF-IDF and alignment.

As for an interface, in fact, that’s basically all Vātāyana was at first. You see, I actually built an interface before the above three-component search system existed at all. What was in it, you might ask. Essentially, I was re-implementing my colleague Thomas’s project Metallō, in Python instead of Golang, and with the following additional things in mind:

  1. I wanted to be able to easily exclude certain corpus texts from consideration for topic similarity comparison.

  2. I wanted to be able to move within the “textual space” along two axes: not only across textual boundaries according to topic similarity, as Metallō does, but also, and simply, sequentially back and forth within any given text.

  3. I wanted to include visual aids for understanding topics directly in the same interface: not only word breakdown per topic (there is a nice tool for this called LDAvis, also included in Thomas’s other project ToPān which helps with model training) but also topic breakdown per document (which I implemented myself).

All these original features still exist in Vātāyana if you know where to look. Respectively: 1. the textPrioritize settings menu; 2. any single side of the docCompare view; and 3. topicVisualizeLDAvis + the left side of docExplore. You could think of this peculiar focus on LDA as the older archeological stratum.

And then layered on top of that, as the intertextuality-detection mission progressed, I expanded Vātāyana to have the search-result list in docExplore’s “similar docs” table, the inter-text highlighting of docCompare, and later, (I later admitted to myself that it was the better idea) faster processing and more impressive search performance through database support, which in turn enabled batch-search mode. (Screenshots below.)

docExplore view, with topic analysis on left and overall search results on right

close-up of docExplore single-document search results

multi-document (aka “whole-book”) batch-search results

Fun fact: Database support basically backfired and crippled the site since I implemented it in advance of my presentation at Vienna in April 2023. It breaks stochastically all the time. It turns out that coordinating PythonAnywhere and MongoDB Atlas was more complicated than I hoped 😔. But good news! I have a fix in the works, involving a migration to different infrastructure. Look for another update soon!

EDIT (2024 Apr 22): The bug should hopefully be fixed now! The web app and its database are now running on a Digital Ocean Droplet 💧 which should be more stable than before.

Going forward, there are also a few other bugs I know about. It’s mainly a matter of finding the time. In substance, though, I think Vātāyana is really cool as-is. What I’d like to do next, besides maintain and refactor the code, is to educate people about it, slowly grow/improve its data (e.g., having more Upaniṣads) by expanding the underlying Pramāṇa NLP project, and hopefully one day, when I’m more comfortable spinning the entire system up and down, also deploy separate instances of the project focused on other genres. I doubt such a system would work very well across all genres of Sanskrit literature at once, but I could be wrong. We’ll just have to try! Reach out if you’re interested.

Previous
Previous

OCR options

Next
Next

Pramāṇa NLP