Add tesseract.js for client-side receipt OCR (privacy-first, local image processing) #148

New issue

Open

opened 2026-06-01 16:07:42 -07:00 by pyr0ball · 0 comments

pyr0ball commented

2026-06-01 16:07:42 -07:00

Owner

Summary

tesseract.js (https://github.com/naptha/tesseract.js/) is a WebAssembly port of the Tesseract OCR engine. It runs entirely in the browser with no server round-trip required. Apache 2.0, 38.1k stars, very mature.

Privacy angle

With WASM-based OCR, the user's receipt image never leaves their device. Only the extracted text string is sent to the backend. This is a meaningful privacy improvement over any server-side OCR pipeline and aligns directly with CF's local-inference-first principle.

Proposed integration

Free tier (local):

Vue frontend loads tesseract.js via WASM
User photographs receipt; OCR runs client-side
Extracted text sent to FastAPI backend for parsing/pantry update
No image upload required

Paid tier (cloud):

Backend Python OCR (PaddleOCR or EasyOCR) for higher accuracy
Handles complex receipt layouts, thermal printer artifacts, narrow columns
Image upload to backend, processed server-side, image discarded immediately after

Key limitations to plan around

Tesseract accuracy is baseline — struggles with thermal receipt artifacts, tiny fonts, column layouts
No PDF support (Scribe.js is the recommended companion for PDFs)
v5 improved bundle size by 54% for English; async worker API

cf-core documents module

tesseract.js serves as the frontend complement to whatever backend OCR cf-core provides. The pattern: client extracts text locally (free tier), backend refines if needed (paid tier).

Also relevant

harrier, rufous, ibis, bunting — any ND-pipeline products handling scanned government forms or insurance documents could use the same client-side extraction pattern

References

https://github.com/naptha/tesseract.js/
v5 changelog: 54% smaller English bundles, 73% smaller Chinese bundles, ~50% faster first-run
License: Apache 2.0

## Summary tesseract.js (https://github.com/naptha/tesseract.js/) is a WebAssembly port of the Tesseract OCR engine. It runs entirely in the browser with no server round-trip required. Apache 2.0, 38.1k stars, very mature. ## Privacy angle With WASM-based OCR, the user's receipt *image never leaves their device*. Only the extracted text string is sent to the backend. This is a meaningful privacy improvement over any server-side OCR pipeline and aligns directly with CF's local-inference-first principle. ## Proposed integration **Free tier (local):** - Vue frontend loads tesseract.js via WASM - User photographs receipt; OCR runs client-side - Extracted text sent to FastAPI backend for parsing/pantry update - No image upload required **Paid tier (cloud):** - Backend Python OCR (PaddleOCR or EasyOCR) for higher accuracy - Handles complex receipt layouts, thermal printer artifacts, narrow columns - Image upload to backend, processed server-side, image discarded immediately after ## Key limitations to plan around - Tesseract accuracy is baseline — struggles with thermal receipt artifacts, tiny fonts, column layouts - No PDF support (Scribe.js is the recommended companion for PDFs) - v5 improved bundle size by 54% for English; async worker API ## cf-core documents module tesseract.js serves as the *frontend complement* to whatever backend OCR cf-core provides. The pattern: client extracts text locally (free tier), backend refines if needed (paid tier). ## Also relevant - harrier, rufous, ibis, bunting — any ND-pipeline products handling scanned government forms or insurance documents could use the same client-side extraction pattern ## References - https://github.com/naptha/tesseract.js/ - v5 changelog: 54% smaller English bundles, 73% smaller Chinese bundles, ~50% faster first-run - License: Apache 2.0