I’m testing OCR services for invoice scanning/ocr and would like to know what would be best practices for pattern lookup.
Tesseract has option to add patterns before scanning.
I’m interested in patterns like date, VAT #, VAT percentage(s), IBAN #. My solution is to regex scanned text and the results are satisfiable but far from being correct 100% and that I wouldn’t automate.
Would it be possible to define that patterns before the scan/ocr and to have that info grouped in json result?
My use of your service would be as PRO subscriber.
Tesseract patterns are simply a sort of “regular expression”. => So I think these do not work better then simply applying regular expressions to the OCR.space output. (or?)
but far from being correct 100% and that I wouldn’t automate.
How do your regex fail? The best solution I know is to optimize and tweak the regex (for any kind of OCR).
Regex doesn’t fail but ocr results are inconsistent — for example, IBAN numbers could be with space(s) and that complicates regex expressions; some dates are recognised without period, that kind of details that complicates things…
I thougt that some regex could be fed to ocr to force it to “watch out” for that patterns…