A single OCR foundation model trained jointly across 13 Indic scripts.
International Institute of Information Technology, Hyderabad
tathagata.ghosh@research.iiit.ac.in Β· sai.gunda@research.iiit.ac.in Β· ssandral@gitam.in Β· ravi.kiran@iiit.ac.in
Optical character recognition (OCR) for handwritten Indic manuscripts is essential for large-scale digitization and computational access to manuscript heritage. However, existing approaches are typically developed for one script at a time and require substantial script-specific customization. This limits scalability and practical deployment across diverse collections. We present UniLipi, a unified multi-script OCR model for handwritten Indic manuscripts trained jointly across 13 Indic scripts within a single framework. UniLipi directly handles realistic manuscript conditions, including extreme variation in line geometry, large variation in line length, and partial interruptions caused by non-textual manuscript entities such as holes, stains, or pictorial illustrations. To operate effectively under ultra low-resource conditions, the model leverages script-aware synthetic manuscript data generation, substantially reducing reliance on large volumes of real annotated data. Beyond historical manuscripts, we show that UniLipi serves as an effective foundational pretrained model β its learned representations enable strong OCR performance for contemporary Indic handwriting and extend to several non-Indic scripts including Tibetan, Italian, Latin, and Chinese. In addition to transcription, UniLipi predicts script identity and per-line native character counts, supporting practical manuscript cataloging workflows.
Indic manuscripts span centuries, regions, and writing substrates. UniLipi targets 13 scripts drawn from five regional traditions: North/Central Indian, Southern Indian, Himalayan/Buddhist, Western Indian, and Eastern Indian.
UniLipi is a unified multi-task OCR model. Given a manuscript line image, it jointly predicts a text sequence in
the unified Roman WX representation, a script identity label, and a per-line
glyph count. The architecture is a hybrid CNNβTransformer with task-specific heads.
WX transcription, while the latent task tokens feed dedicated heads for
script identification and glyph-count prediction. Training uses a joint OCR + script + count objective.
The model is trained end-to-end with a multi-task loss combining CTC for text recognition, cross-entropy for script classification, and mean absolute error for glyph count:
β = βOCR + Ξ»script Β· βscript + Ξ»count Β· βcount
Joint supervision encourages the encoder to learn richer representations that capture textual content, script identity, and structural length information together β improving transcription accuracy across diverse manuscripts.
Real annotated manuscript lines are extremely scarce for many Indic scripts. We generate large-scale synthetic training lines by lifting appearance and geometric priors from real reference manuscripts β yielding roughly 2.5M synthetic lines per script and 32.5M total across the 13 scripts.
WX text line and transliterate it into the target script.
(2) From a reference manuscript line, extract the writing mask, estimate font size and ink color,
then render glyphs sequentially within valid regions using script-specific fonts.
(3) Composite the rendered text onto a diffusion-inpainted clean background to preserve realistic
manuscript characteristics.
(4) Output the final synthetic line paired with its WX transcription, script label, and
glyph-count supervision.
Overall (weighted) performance of UniLipi across all 13 Indic scripts. CER (%) for OCR transcription, Mono CER (%) for script-specific monolingual models, and Count-MAE for glyph count prediction.
| Setting | CER (%) β | Mono CER (%) β | Count-MAE β |
|---|---|---|---|
| UniLipi (Overall, weighted) | 6.9 | 8.6 | 1.1 |
Across the 13 scripts, UniLipi attains an overall 6.9% CER, outperforming dedicated per-script (monolingual) models at 8.6% CER β demonstrating the benefit of unified multi-script training.
Ablation study of UniLipi. Each row modifies a single component while keeping all others identical; the final configuration is highlighted.
| Modified component | Default | CER β | Count-MAE β |
|---|---|---|---|
| IAST as vocab | Roman WX | 11.0 | 1.7 |
| Dedicated vocab per script | Roman WX | 14.1 | 1.9 |
| No script token | Auxiliary tokens | 7.1 | 1.3 |
| No count token | Auxiliary tokens | 9.3 | β |
| No script & count token | Auxiliary tokens | 9.7 | β |
| Encoder with 4 layers | 6 layers | 9.3 | 1.5 |
| Encoder with 8 layers | 6 layers | 8.1 | 1.3 |
| βOCR as Cross-Entropy | CTC loss | 8.3 | 1.4 |
| UniLipi (Final) | Default design | 6.9 | 1.1 |