Single Character Recognition, Labels added to data, LineText. vs. WordText

w0wbagger · May 25, 2023, 3:27am

Hey there. Just started using OCR API and am finding it really great. I have a few questions, tho’.
I’m OCR’ing a FedEx shipping document (I have IsTable = true), and it’s working pretty well, but often misses a single character in a field (i.e. “Packaging” field has a single “1” in it, that it often misses). Is there a way to make it see these better? Image quality is excellent, as it’s computer generated by FedEx, not scanned.

Also, there are places where the label (e.g. “Shipper”) and the actual data (The shipper name) are combined into a single piece of text. Is there a way to help the engine discriminate between one and the other? The fonts are different, and there’s a fair amount of space between them. Are these variables that can be tweaked for better recognition?

And finally, what’s the difference between LineText and WordText in the JSON files I’m receiving back,? They appear to be identical 100% of the time.

Thanks!
Ian

ocr-api-team · May 25, 2023, 10:34pm

Hi, are you using engine1 or engine2?

Also, can you please upload a sample image? Then we can test it here.

w0wbagger · May 25, 2023, 10:46pm

Engine 2. Can I upload it to you in a private message?

ocr-api-team · May 26, 2023, 9:20am

Thanks, I looked at the PDF.:

there are places where the label (e.g. “Shipper”) and the actual data (The shipper name) are combined into a single piece of text. I

when I test with your PDF this works ok:

LineText": “Shipper/Expéditeur: GRAINGER 003 INT”,

(If it fails, you could search for “:” and split the string there?)

misses a single character in a field (i.e. “Packaging” field has a single “1” in it, that it often misses

Single digit number OCR is indeed a challenge that fails sometimes.

And finally, what’s the difference between LineText and WordText in the JSON files I’m receiving back,?

For Engine2 both are the same currently. In future updates LineText will contain the complete sentence (as now), but WordText the single words and their bounding box (as it is currently already for Engine1).

LineText": “Shipper/Expéditeur: GRAINGER 003 INT”,

w0wbagger · May 26, 2023, 9:54pm

I appreciate you looking at this for me. I’m not sure I understand if the result of the label and the data being combined is expected or not! In other places in the document the label is separated from the data. I do just split the data at the first colon detected in post-processing of the JSON, but if the engine misses the colon I’m going to have problem with it. Once you split WordText I guess I’ll be able to traverse the words to find the proper beginning of the data.

As for Single-digit OCR, Engine 2 should be able to detect them based on the link you indicated? [5 is now incorporated into 2?] Is there anything I can do to help find it? Increase contrast tolerance or something?

Thanks for the quick response. One of my clients was looking at Alteryx, but your engine generates results that are almost as good at a fraction of the cost, and I’ve been able to crib together a batch/post-processing program that does what they need very quickly. We’ll be purchasing a subscription shortly. Does the daily page limitation apply to all subscriptions?