OCR Space Not Creating Correct Searchable PDFs

lukshkumar · September 18, 2020, 11:30am

Thank you for your prompt response to my query. Okay, so let me explain my use case.

We are using OCR Space Api to convert invoices (either scanned PDFs or searchable PDFs) into searchable PDFs and then we use PDF Parser to parse through the PDF data and then we dump into the database. So, the problem is whenever we have a searchable PDF in which most of the data is already searchable but there are few images in that PDF which needs to be converted to text, therefore whenever I pass those PDFs to OCR Space, it returns the correct text for those images within the PDFs but mess around with the text that was already searchable.

So, whenever I use my PDF Parser to read that data, it returns me 2 occurrences for each word. The problem is those double occurrences does not have a pattern, for instance, I had a word “EACH” in pdf, after making it searchable and then trying to parse, I would get “EEAACHCH” or “EEAACCHH” and so on.

Therefore, to resolve this issue I tried some other online platforms which convert PDFs to searchable and then tried the same PDF on the below-mentioned platforms and they did not write the text twice, which means they were not replicating the searchable text over searchable text and also converting the images within the PDF to text correctly.

Please have a look at these platforms and see if you can get an idea of how they are doing it perfectly. The problem is they don’t have any API or commercial support that is why we can not use that.

Links of platforms:

https://freepdfonline.com/ocrpdf/

Try these platforms and let me know if you can be of any assistance. I would really appreciate your cooperation in this regard.

Thank you so much for your support.

Best Regards,
Luksh Kumar.