OCR Space Not Creating Correct Searchable PDFs

lukshkumar · September 17, 2020, 9:22am

Dear Team,

Can you please look into the below-mentioned issue as soon as possible, I want to buy the PRO PDF paid version of OCR Space but I am stuck in this issue.

I want to utilize OCR Space API for commercial use but I am facing a serious issue. When a PDF I upload is already searchable but there is an image inside it. For instance, suppose the company slogan in PDF is an image while all other data is already searchable and not an image, so I want to get a complete searchable PDF in which that image is also converted to searchable text. But what happens is that OCR Space converts the image to searchable text but along with that it also converts that searchable text in that PDF to again searchable, which makes the text in that PDF written twice because it assumes that the text isn’t searchable so it writes that text over it again.

For the reference, you can upload any PDF which contains image as well as some searchable text, pass it through OCR Space and generate a searchable PDF. Then open that generated pdf and search for any text which was already searchable, you will find that the occurrences of that text which you search would be double than actual.

Thank you for your cooperation.

admin · September 17, 2020, 1:31pm

Technically the OCR API takes a screenshot of every PDF page and then converts this screenshot to text.

I totally agree with you that for PDFs that already have a text layer it would be best to find the images in a PDF and then only convert the images to text, and merge this back into the existing text layer.

However, from the way PDFs are structured, that is a very difficult task that we have not solved yet. Actually, I think nobody has solved this task in a reliable way yet. It is even not possible to reliably identify if a PDF is scanned already.

However, on the other hand, is it a big problem for your use case that the text is now twice in the document? Maybe we can find a workaround for your specific use case.

lukshkumar · September 18, 2020, 11:30am

Thank you for your prompt response to my query. Okay, so let me explain my use case.

We are using OCR Space Api to convert invoices (either scanned PDFs or searchable PDFs) into searchable PDFs and then we use PDF Parser to parse through the PDF data and then we dump into the database. So, the problem is whenever we have a searchable PDF in which most of the data is already searchable but there are few images in that PDF which needs to be converted to text, therefore whenever I pass those PDFs to OCR Space, it returns the correct text for those images within the PDFs but mess around with the text that was already searchable.

So, whenever I use my PDF Parser to read that data, it returns me 2 occurrences for each word. The problem is those double occurrences does not have a pattern, for instance, I had a word “EACH” in pdf, after making it searchable and then trying to parse, I would get “EEAACHCH” or “EEAACCHH” and so on.

Therefore, to resolve this issue I tried some other online platforms which convert PDFs to searchable and then tried the same PDF on the below-mentioned platforms and they did not write the text twice, which means they were not replicating the searchable text over searchable text and also converting the images within the PDF to text correctly.

Please have a look at these platforms and see if you can get an idea of how they are doing it perfectly. The problem is they don’t have any API or commercial support that is why we can not use that.

Links of platforms:

https://freepdfonline.com/ocrpdf/

Try these platforms and let me know if you can be of any assistance. I would really appreciate your cooperation in this regard.

Thank you so much for your support.

Best Regards,
Luksh Kumar.

admin · September 18, 2020, 3:00pm

This is very interesting information. Can you please attach 1-2 PDFs that you used for testing? Of course, you can also email them to me directly at team AT a9t9.com - in your email just mention this forum post.

lukshkumar · September 18, 2020, 3:40pm

Sure. I tried to upload PDFs here but it says new user can’t upload attachments therefore I have attached a google drive link which has three PDFs. The PDF named “Invoice” is the one containing the image of Logo which has text “adidas” written in it while all other text is already searchable.

Additionally, I have uploaded the searchable PDF which I got from OCR Space as well as searchable PDF which I got from freepdfonline.com. Open both PDFs and search “Product”, in OCR Space it comes twice the number of occurrences it is in actual. While this problem isn’t there in freepdfonline PDF.

Though I agree that OCR Space recognition of “adidas” is way better than the other but see if you guys can resolve this issue of replicating text, it would be awesome then.

Link: PDF Invoices - Google Drive

Thank you.

admin · September 18, 2020, 5:44pm

Thanks a lot for the details. We will investigate this further next week. If others can do it, I see no reason why we would not be able to do it

lukshkumar · September 20, 2020, 4:37pm

Great, I will be looking forward to your response on this issue. Thank you so much for your cooperation and please let me know whenever you guys have something that can help me.

Best Regards,
Luksh Kumar.