Add tesseract.js for client-side receipt OCR (privacy-first, local image processing) #148

Open
opened 2026-06-01 16:07:42 -07:00 by pyr0ball · 0 comments
Owner

Summary

tesseract.js (https://github.com/naptha/tesseract.js/) is a WebAssembly port of the Tesseract OCR engine. It runs entirely in the browser with no server round-trip required. Apache 2.0, 38.1k stars, very mature.

Privacy angle

With WASM-based OCR, the user's receipt image never leaves their device. Only the extracted text string is sent to the backend. This is a meaningful privacy improvement over any server-side OCR pipeline and aligns directly with CF's local-inference-first principle.

Proposed integration

Free tier (local):

  • Vue frontend loads tesseract.js via WASM
  • User photographs receipt; OCR runs client-side
  • Extracted text sent to FastAPI backend for parsing/pantry update
  • No image upload required

Paid tier (cloud):

  • Backend Python OCR (PaddleOCR or EasyOCR) for higher accuracy
  • Handles complex receipt layouts, thermal printer artifacts, narrow columns
  • Image upload to backend, processed server-side, image discarded immediately after

Key limitations to plan around

  • Tesseract accuracy is baseline — struggles with thermal receipt artifacts, tiny fonts, column layouts
  • No PDF support (Scribe.js is the recommended companion for PDFs)
  • v5 improved bundle size by 54% for English; async worker API

cf-core documents module

tesseract.js serves as the frontend complement to whatever backend OCR cf-core provides. The pattern: client extracts text locally (free tier), backend refines if needed (paid tier).

Also relevant

  • harrier, rufous, ibis, bunting — any ND-pipeline products handling scanned government forms or insurance documents could use the same client-side extraction pattern

References

## Summary tesseract.js (https://github.com/naptha/tesseract.js/) is a WebAssembly port of the Tesseract OCR engine. It runs entirely in the browser with no server round-trip required. Apache 2.0, 38.1k stars, very mature. ## Privacy angle With WASM-based OCR, the user's receipt *image never leaves their device*. Only the extracted text string is sent to the backend. This is a meaningful privacy improvement over any server-side OCR pipeline and aligns directly with CF's local-inference-first principle. ## Proposed integration **Free tier (local):** - Vue frontend loads tesseract.js via WASM - User photographs receipt; OCR runs client-side - Extracted text sent to FastAPI backend for parsing/pantry update - No image upload required **Paid tier (cloud):** - Backend Python OCR (PaddleOCR or EasyOCR) for higher accuracy - Handles complex receipt layouts, thermal printer artifacts, narrow columns - Image upload to backend, processed server-side, image discarded immediately after ## Key limitations to plan around - Tesseract accuracy is baseline — struggles with thermal receipt artifacts, tiny fonts, column layouts - No PDF support (Scribe.js is the recommended companion for PDFs) - v5 improved bundle size by 54% for English; async worker API ## cf-core documents module tesseract.js serves as the *frontend complement* to whatever backend OCR cf-core provides. The pattern: client extracts text locally (free tier), backend refines if needed (paid tier). ## Also relevant - harrier, rufous, ibis, bunting — any ND-pipeline products handling scanned government forms or insurance documents could use the same client-side extraction pattern ## References - https://github.com/naptha/tesseract.js/ - v5 changelog: 54% smaller English bundles, 73% smaller Chinese bundles, ~50% faster first-run - License: Apache 2.0
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Circuit-Forge/kiwi#148
No description provided.