Patterns in scanned documents



I’m testing OCR services for invoice scanning/ocr and would like to know what would be best practices for pattern lookup.

Tesseract has option to add patterns before scanning.

I’m interested in patterns like date, VAT #, VAT percentage(s), IBAN #. My solution is to regex scanned text and the results are satisfiable but far from being correct 100% and that I wouldn’t automate.

Would it be possible to define that patterns before the scan/ocr and to have that info grouped in json result?

My use of your service would be as PRO subscriber.



Tesseract patterns are simply a sort of “regular expression”. => So I think these do not work better then simply applying regular expressions to the output. (or?)

but far from being correct 100% and that I wouldn’t automate.

How do your regex fail? The best solution I know is to optimize and tweak the regex (for any kind of OCR).


Thanks for responding!

I didn’t know that about tesseract.

Regex doesn’t fail but ocr results are inconsistent — for example, IBAN numbers could be with space(s) and that complicates regex expressions; some dates are recognised without period, that kind of details that complicates things…

I thougt that some regex could be fed to ocr to force it to “watch out” for that patterns…