Fan Database

Posted: **Wed Jul 09, 2025 10:00 am**

Tesseract has made a major step forward in the last few years. it was not as good as the proprietary OCR, but that has changed– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.

Underlying the new Tesseract is a LSTM engine buy sales lead similar to the one developed for Ocropus2/ocropy, which was a project led by Tom Breuel (funded by Google, his former German University, and probably others– thank you!). He has continued working on this project even though he left academia. A machine learning based program is introducing us to GPU based processing, which is an extra win. It can also be trained on corrected texts so it can get better.

The time it takes on our cluster to compute is approximately the same, but if we add GPU’s we should be able to speed up OCR and PDF creation, maybe 10 times, which would help a great deal since we are processing millions of pages a day.

The PDF generation is a balance trying to achieve small file size as well as rendering quickly in browser implementations, have useful functionality (text search, page numbers, cut-and-paste of text), and comply with archival (PDF/A) and accessibility standards (PDF/UA). At the heart of the new PDF generation is the “archive-pdf-tools” Python library, which performs Mixed Raster Content (MRC) compression, creates a hidden text layer using a modified Tesseract PDF renderer that can read hOCR files as input, and ensures the PDFs are compatible with archival standards (VeraPDF is used to verify every PDF that we generate against the archival PDF standards).

Fan Database

When we last evaluated the accuracy

When we last evaluated the accuracy