Patterns in scanned documents

zac977 · April 29, 2019, 9:02am

Hi!

I’m testing OCR services for invoice scanning/ocr and would like to know what would be best practices for pattern lookup.

Tesseract has option to add patterns before scanning.

I’m interested in patterns like date, VAT #, VAT percentage(s), IBAN #. My solution is to regex scanned text and the results are satisfiable but far from being correct 100% and that I wouldn’t automate.

Would it be possible to define that patterns before the scan/ocr and to have that info grouped in json result?

My use of your service would be as PRO subscriber.

Thanks!

ulrich · May 3, 2019, 7:14pm

Tesseract patterns are simply a sort of “regular expression”. => So I think these do not work better then simply applying regular expressions to the OCR.space output. (or?)

but far from being correct 100% and that I wouldn’t automate.

How do your regex fail? The best solution I know is to optimize and tweak the regex (for any kind of OCR).

zac977 · May 4, 2019, 5:23pm

Thanks for responding!

I didn’t know that about tesseract.

Regex doesn’t fail but ocr results are inconsistent — for example, IBAN numbers could be with space(s) and that complicates regex expressions; some dates are recognised without period, that kind of details that complicates things…

I thougt that some regex could be fed to ocr to force it to “watch out” for that patterns…