Project Video

CURIO straightens curved Sharada lines, trains on curvature-matched synthetic data, and reads centuries-old manuscripts with higher fidelity and lower compute.

Abstract

We present CURIO, an OCR system for low-resource historical manuscripts. CURIO extracts curved lines with scribble/polygon guidance, rectifies them to reduce background, and pairs scarce real data with Sharada-aligned synthetic lines rendered along manuscript-derived trajectories. A lightweight CNN–Transformer with padding-aware null activations, sparse attention, and CTC decoding improves efficiency on long, curved lines. Evaluated on Sharada manuscripts, CURIO delivers state-of-the-art character error rate with the largest gains on high-curvature and long lines, and transfers zero-shot to printed Sharada text.

Sharada manuscript teaser showing curved lines

Why curvature matters

Sharada lines are long, skewed, and densely decorated with diacritics. CURIO rectifies lines before recognition, shrinking background pixels and preserving glyph detail for robust transcription.

Pipeline: page segmentation, scribble extraction, rectification

Rectifying scarce real data

Line polygons and mid-line scribbles guide a piecewise-linear rectifier, dilated to protect diacritics. The result: compact line crops, reduced background, and aligned polygons/scribbles reused for evaluation and synthesis.

Curvature-aligned synthetic corpus

Sanskrit text is transliterated to Sharada, length-matched to real lines, rendered straight with font sizes estimated from polygons, then warped along real scribbles. Warped text is composited on inpainted manuscript backgrounds and augmented with document-specific effects.

Synthetic pipeline diagram showing curvature-aligned rendering
CNN-Transformer model diagram with sparse attention and CTC

Efficient CNN–Transformer

Rectified lines are resized to 68 px height and padded to 1800 px width. A ResNet-18 encoder feeds a 4-layer Transformer with sparse attention and padding-aware masking. CTC decoding yields transcriptions while avoiding spurious context from padded regions.

Results

Results

CURIO achieves state-of-the-art performance on Sharada manuscripts, with significant gains on highly curved and long lines.