CURIO : Curvature-Aligned and Efficient OCR for
Low-Resource Historical Manuscripts
Project Video
CURIO straightens curved Sharada lines, trains on curvature-matched synthetic data, and reads centuries-old manuscripts with higher fidelity and lower compute.
Abstract
We present CURIO, an OCR system for low-resource historical manuscripts. CURIO extracts curved lines with scribble/polygon guidance, rectifies them to reduce background, and pairs scarce real data with Sharada-aligned synthetic lines rendered along manuscript-derived trajectories. A lightweight CNN–Transformer with padding-aware null activations, sparse attention, and CTC decoding improves efficiency on long, curved lines. Evaluated on Sharada manuscripts, CURIO delivers state-of-the-art character error rate with the largest gains on high-curvature and long lines, and transfers zero-shot to printed Sharada text.
Why curvature matters
Sharada lines are long, skewed, and densely decorated with diacritics. CURIO rectifies lines before recognition, shrinking background pixels and preserving glyph detail for robust transcription.
Rectifying scarce real data
Line polygons and mid-line scribbles guide a piecewise-linear rectifier, dilated to protect diacritics. The result: compact line crops, reduced background, and aligned polygons/scribbles reused for evaluation and synthesis.
Curvature-aligned synthetic corpus
Sanskrit text is transliterated to Sharada, length-matched to real lines, rendered straight with font sizes estimated from polygons, then warped along real scribbles. Warped text is composited on inpainted manuscript backgrounds and augmented with document-specific effects.
Efficient CNN–Transformer
Rectified lines are resized to 68 px height and padded to 1800 px width. A ResNet-18 encoder feeds a 4-layer Transformer with sparse attention and padding-aware masking. CTC decoding yields transcriptions while avoiding spurious context from padded regions.
Results
CURIO achieves state-of-the-art performance on Sharada manuscripts, with significant gains on highly curved and long lines.