Plain-Text OCR Benchmark: The Gap Between On-Prem and API Is Gone

02.06.2026

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

In our first benchmark, we showed that JSL Vision OCR is the #1 grounded OCR model overall, beating every closed-source frontier system on the FUNSD dataset.

This post answers a different question: plain-text OCR.

Many enterprise workflows don’t need bounding boxes. Search indexing, clinical summarization, RAG pipelines, compliance audits, downstream LLM reasoning, all they need is clean, human-readable text in natural reading order.

So how does JSL Vision ( jsl_vision), our 30B-class flagship OCR model, compare to GPT-5.2, Claude 4.5, Gemini 2.5, Grok-4, and the rest of the frontier?

Short answer: it wins. And the gap between the best self-hosted VLM and the best closed-source API is less than 1% absolute CER. On plain-text OCR, accuracy alone is no longer a reason to send your documents to a third-party cloud.

What “markdown OCR” looks like

Input on the left, JSL Vision’s reading-order text output on the right:

No bounding boxes, no JSON fields, no layout metadata, just clean text in the order a human would read it. That’s the unit of work for RAG, search indexing, and clinical summarization.

What we measured

Dataset: FUNSD training pages, 50 pages for closed-source models, 10 pages for open-source VLMs (GPU-constrained, see methodology)
Metric: character-level CER via jiwer.cer() on full-page text
Ground truth: word-level FUNSD annotations reassembled into line-level reading order (top-left → bottom-right), lowercased, whitespace-normalized
Predictions: preprocessed identically (strip markdown tables, collapse whitespace, lowercase) before scoring

Who leads the field

jsl_vision wins outright at 0.148 CER. Grok-4 and Gemini 2.5 Pro are close behind (0.155 and 0.167). The fact that the top-3 spread is < 0.02 absolute CER is itself the headline: OCR has converged.

How all 16 models compare

The self-hosted ↔ cloud gap has effectively closed. JSL Vision (0.148 CER) outperforms the best closed-source API (Grok-4, 0.155) by 4.5% relative, running entirely on your own hardware.

Even setting JSL Vision aside, open-source VLMs now produce OCR quality that’s practically indistinguishable from GPT-5.2, Claude, or Gemini on plain-text tasks. For clinical summarization, search indexing, and RAG pipelines, there’s no accuracy reason left to send documents to a closed-source API.

The data breaks down in a few ways that matter:

GPT-5.2 underperforms on OCR. It landed at 0.271 CER, last of 16 in the leaderboard. This mirrors what we saw in the BBox benchmark (GPT-5.2 scored 0.588 on grounded OCR, essentially random). GPT-5.2 is a capable general-purpose model, but not competitive as a dedicated OCR engine.

Grok-4 is a surprising standout. On flat text, Grok-4 (0.155) ranks #2 overall. This is very different from its BBox performance (0.582, near the bottom). The takeaway: Grok-4 reads well but can’t ground what it reads. If you need spatial coordinates, don’t use it.

Smaller open-source models are competitive. Qwen3-VL-8B (8B parameters, 0.222 CER) outperforms the 235B closed-source Qwen3-VL-235B (0.227) and GPT-5.2 (0.271). A well-tuned sub-10B model can match frontier APIs on OCR.

Cost vs accuracy: where the real gap shows up

If accuracy is a wash, the real comparison is cost. Plotting CER against cost per 1,000 documents:

JSL Vision wins outright on accuracy. For the 30B flagship: $6.76 per 1K docs on H100 (9.7 seconds per page). For the 1B jsl_vision_ocr: $0.56 per 1K docs on T4. Gemini 2.5 Pro costs roughly $6.75 per 1K on this task with worse accuracy and no on-prem option. Grok-4 runs $86.70 per 1K. Llama-4-Maverick is similarly competitive on flat text, but its 400B MoE architecture requires 4× H100s to self-host, making it ~3× more expensive than JSL Vision at production scale. As we’ll see in the JSON-Schema benchmark, it also falls apart on structured extraction tasks.

Document processing speed

For batch ingest and interactive workflows, time-to-result matters as much as accuracy:

JSL Vision at 9.7 seconds per page is mid-pack. Some closed-source models are faster on flat text (Llama at 3.2s, Gemini Flash at 4.8s) but lose on accuracy or break on schema-following. For pure-OCR latency, the 0.4B jsl_vision_ocr model runs at 1.8s per page (see Article 1).

A single H100 at $2.50/hour processes pages at a tiny fraction of frontier-API cost, and that’s before accounting for the operational cost of sending your documents to a third-party service: data-processing agreements, audit trails, latency tails, and the occasional 3 AM page when someone else’s API goes down.

GPU vs CPU: does the hardware gap matter?

Not every deployment has a GPU. We ran all major OS OCR models on both an NVIDIA H100 and a standard CPU instance (c5.2xlarge, 8 vCPUs) using HF transformers at float32. Same 5 FUNSD pages, same prompts.

GPU delivers 2–7× faster throughput per page across model sizes. The advantage grows with parameter count; 7–8B models peak at 7.2×.

CPU is cheaper per document for every model in the benchmark. CPU compute at $0.34/hr is ~7× cheaper than H100 at $2.50/hr, and even a 7× slower CPU page still comes out ahead on cost. At 7–8B, GPU and CPU land within 2% of each other per document. The latency advantage of GPU is real; the cost advantage of CPU is marginal. JSL Vision OCR via vLLM at $1.22/1K docs beats every CPU option on both speed and cost simultaneously.

Built for regulated industries

JSL Vision runs on your own hardware: Docker or Amazon SageMaker, from a single T4 (1B model) to an H100 (30B model), no minimum cluster size. HIPAA, GDPR, and SOC 2 compliance follow from the architecture since there’s no API call, no data egress, and no data-sharing agreement to negotiate with a vendor. Outputs are deterministic: the same document produces the same output every run, every deployment, every environment. Input can be any image-convertible format: PDF, PNG, JPG.

How we ran the benchmark

Dataset: FUNSD. Reading order via line-level reassembly of word-level FUNSD annotations.
Metric: jiwer.cer(gt, pred) at the corpus level, after consistent preprocessing (lowercase, strip markdown separators, collapse whitespace).
Closed-source models: OpenRouter temperature=0, max_tokens=8192.
Open-source models: vLLM v0.17.1, H100, f16 precision.
Sample sizes: Closed-source at 50 pages (cheap via API). Open-source VLMs at 10 pages, extending to 50 requires GPU time we’ll publish in a follow-up. OS numbers are directionally stable.
Outliers: Ministral-14B produced a 34× length-explosion gen-loop on one page (63 k chars vs 1.8 k GT) that inflated its corpus CER to 1.63 with median page CER of 0.28. It is omitted from the leaderboard; the gen-loop behavior is real and worth knowing about for production deployments.

What’s next

This is the second of a three-part series:

Grounded (BBox) OCR on FUNSD
Image → Markdown OCR on FUNSD, you are here
Schema-Constrained JSON OCR on OmniOCR

The accuracy gap between self-hosted and cloud APIs on plain OCR has closed. If your pipeline is still routing documents to a third-party API for flat-text extraction, you’re paying for API risk, data agreements, and latency uncertainty. Not accuracy.

Replacing a third-party OCR API or standing up a document ingestion pipeline? Schedule a demo to see JSL Vision running on your own data.

How useful was this post?

Schedule a demo

Book now

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

Our additional expert:

Christian Kasim Loan is a computer scientist with over 10 years of coding experience who works for John Snow Labs as a Senior Data Scientist where he helps porting the latest and greatest Machine Learning Models to Spark and created the NLU library.