OCR options
EDIT (Jun 25): TL;DR: Google Cloud Vision API is the state of the art, but it’s hard to set up. To make use of it without setting it up yourself, just message me with your PDF, and I’ll run it for you. I’m happy to do this as a favor to the community. If there’s a significant cost incurred, then we can talk reimbursement (max $1.50 per 1,000 pages), but I’ll make that clear ahead of time.
A practical need that many students and scholars of Sanskrit have is to be able to turn images, most often PDFs, into searchable text. Besides having teams of humans manually transcribe what they see — which is a fine option in some circumstances, especially in the case of handwritten manuscripts for which technology may not yet be up to the task — the standard technological solution for doing this is OCR, or optical character recognition. In my experience, despite lots of interest and projects in recent years, Sanskrit OCR has remained confusing for a lot of users out there, and so I thought I would write to explain the top options out there. Read to the end of this post, and you will have a few concrete ways forward. 🚀
Option 1 (spoiler alert! this is the one to beat):
Google Cloud Vision API. When I last seriously looked around in 2018–2019, I found that this was the best option out there for Sanskrit OCR. Now, more than five years later, it apparently still is. Having set it up, you can put PDFs (even large ones!) in a folder, press go, and wait a few minutes for the high-quality OCR results, with no need to specify Sanskrit or whatever other language you’re working with. It’s honestly amazing. I was sad when my Vision setup stopped working a few years ago, but my modest personal needs for OCR did not seem to warrant fixing it.
Recently, though, I came across a very helpful how-to guide Andrew Ollett kindly put together at prakrit.info/dh/ocr.html. In it, Andrew uses a small but instructive example image of printed Sanskrit text to compare Google Cloud Vision, which achieves 97% accuracy in the exercise, against three other options: Google Drive, Sanskrit CR, and Tesseract — all actually Google tech in one way or another (see below). He then gives concrete instructions on how to get set up and using Vision yourself, including the Python interface. These instructions are clear, straightforward, and correct (I’m back up and running again, hooray!) but they will probably be difficult for non-programmers to complete. Also, Vision costs $1.50 per 1,000 pages (after the first 1,000 pages in a given month, which are complimentary).
Option 2:
Google Drive, which actually has two interfaces:
The point-and-click and free-to-use “Open with > Google Docs” command in Google Drive (official instructions here). Andrew estimates accuracy at 93%, and I’ve used this method profitably for years. Here are my two pro tips if you go this route: The 2mb limit means that you generally do need to split up large documents, in differently-sized batches depending on the particular file. Also, I’ve found that if you upload a PDF which already has an OCR layer of data, it will simply give you that pre-existing OCR data back. Therefore, my preprocessing workflow involved exporting PDFs to JPGs, which drops the other layers of data, and then combining the JPGs into a PDF again 🙄. That way, the software will actually look at the images.
The Drive API, which essentially automates the “Open with > Google Docs” option above, along with uploading, downloading, etc. This is what SanskritCR uses.
As can be seen from Andrew’s comparison (and I can confirm this from my own experiments), the results of Google Cloud Vision and Google Drive are very close but not quite the same. Vision is definitely better, and I believe faster too.
Option 3:
Tesseract. This system was invented at HP, but Google adopted it in 2006 and still oversees its further development. Because it is open-source, there have been many projects that incorporate it. For example, OCRmyPDF. This is also the OCR used by Internet Archive (see here). Andrew reported at the time of his write-up that “out-of-the-box” results with Tesseract were unsatisfactory (e.g., 78% accuracy on his example) but also that he was using a version potentially older than Tesseract 4. I’ve seen Internet Archive using 4.1.1, e.g., in the “Ocr” metadata for this item, and I find Internet Archive OCR results pretty close to unusable for Sanskrit. I’ve also installed Tesseract locally and tried it with the v4 models, which are the latest and best available. Nope. Does not compare to Google whatsoever. The LSTM architecture Tesseract uses apparently just doesn’t hold a candle to whatever proprietary CNN+ system Google is using.
The reader should note that there are interesting attempts to train Tesseract further for Sanskrit (see 2018 here and 2022 here, the latter with an interesting focus on post-correction), and perhaps these will beat Google Vision someday. The idea is to stick to open-source where possible, especially for large-scale projects for which the cost of proprietary software becomes prohibitive. But I would argue that, for small- or medium-sized projects, it’s actually most cost-effective, in both money and time, to use the best mainstream option at a non-zero cost, rather than go through the effort to set up a more bespoke solution.
Like Andrew, I’ll also leave aside for now the other options out there like Adobe, Abbyy, and so on. I will mention, though, that Oliver Hellwig’s SanskritOCR, which like other items in his “ind.senz” line of software runs on Windows only, does also produce serviceable results (of roughly the same quality as Tesseract, I believe because it is actually a wrapper for some version of it), and it certainly deserves honorable mention for its historical significance; I understand that it was instrumental in digitizing much of the Digital Corpus of Sanskrit, for example. And there are definitely other interesting attempts out there, too, like this one training a CNN-RNN architecture from scratch with TensorFlow. But the focus in this post is on what’s easiest to set up and use for most people, right now, regardless of OS, without an excessive up-front cost.
To sum up the comparison, Google Cloud Vision is clearly the leading option, but it is only accessible (like, at all) if you are comfortable with
setting up and managing a Google Cloud account;
interacting with the API (application programming interface) with a programming language like Python; and
paying indefinitely for its use, assuming you’re OCRing more than 1,000 pages per month.
The first two criteria, in particular, mean that probably 99.5% of Sanskrit folk have not even come close to trying to use it. Yet everyone serious about Sanskrit needs to OCR a book now and then, and the time needed for cleaning up results is a precious resource.
So, if you want to OCR an entire Sanskrit book, or multiple books, your options are, in order of increasing quality of final output per unit of time spent:
1. settle for free but relatively low-quality options like Tesseract (or Abbyy, etc.), which may be easy to use in certain forms, but which will leave you with A LOT of post-processing;
2. commit to using Google Drive, which is also free but does (in my experience at least) involve some additional pre-processing, plus a good amount of post-processing (the difference between 93% and 97% accuracy is actually quite a bit of manual correction); or if you have relatively little material, use the very convenient SanskritCR website to do one page at a time with Google Drive API, with the same caveat about post-processing;
4. set up Google Cloud Vision API yourself and pay the modest fee to use it; or
5. get someone who has Google Cloud Vision API already set up to use it for you.
I highly recommend reading Andrew's post on setting up Vision and giving it a go. If you find it's too much, though, don’t despair. You can still get Vision-level results. Just drop me a line by email or through the contact form of this website. I’ve got a little pipeline I can run locally, and I'll process your PDF quickly in exchange for basically just reimbursement of Google’s per-page price (via e.g. Venmo), plus a nominal markup for my own time (EDIT Jun 25). I'm not doing anything that's very different than Andrew is talking about in his instructions. I just recognize that for a lot of people, the hurdle is too great, and they may just end up not using the best-in-class technology. Which is a shame. Everyone who wants it should have an easy option for high-quality Sanskrit OCR. But it’s not hard to make this happen. We just need to cooperate more.
Having this kind of technology at your fingertips opens up new possibilities. Sanskrit Research Institute (sri.auroville.org, led by Martin Gluckman), applied Google Drive API to its mirror of the 550k-item (31TB) Digital Library of India (dli.sanskritdictionary.com) and put MySQL-based search in front of it. This does give interesting insight into the insides of these books, but the mixed-language content is not yet perfectly handled by the Drive OCR; I suspect based on some experimentation (e.g., "अधिकार" yields only 93 results) that recall may be quite low in some cases. At a smaller scale, the Jain Quantum project processed the 17k-item Jain eLibrary, again with Google Drive API, and also made it searchable, this time with a custom combination of fuzzy matching and trigram search. The additional fact that the “book text” is served in full (e.g., in this Sanskrit book here) means that the content is findable by search engines like Google, but only to a certain extent, as the text is not segmented into individual words. These are just two examples of large scale application of OCR to Sanskritic material.
But whatever the size of your own Sanskrit PDF collection, imagine making that content searchable, too, and with Vision rather than Drive. Let’s say you have your speciality collection of 100 works, at 200 pages each. That’s 20,000 pages, or $30 to OCR with Vision. Pre-processing (mostly cropping pages) takes some time, as does post-processing. But the OCR itself can be done overnight and for the cost of, what, a nice dinner? Imagine us collectively getting together and making a huge leap forward in the amount of material digitized to be machine-searchable, if only in a preliminary, 95%-accurate way at first. It is closer than we have been thinking.
There are more components of this overall workflow that can still be improved. Handwritten text recognition (HTR) for manuscripts has been worked on recently especially in the Tibetan space (see several presentations at this event in Vienna last year). Software options for this kind of training for bespoke material include not only the relatively well-known Transkribus system but also Kraken, in the lineage of ORCropus. What’s more, we could also train a relatively simple model for automated post-processing, as well (cp. e.g. the 2022 project cited above). If you’re interested in doing this in combination with Vision, please be in touch with your material that you’ve corrected from Vision output, and we’ll train something up together.
So someday, it might be possible to use fine-tuned open-source models on appropriately user-friendly platforms, and compensate for imperfections with effective automatic post-processing. Until that day comes, though, for the most part, in Google we trust. 😇