Create Searchable PDF - OCR text appended instead of replacing

tiste · November 18, 2019, 9:48am

So I am trying convert my PDF to a searchable new PDF with better OCR results.

The original output is as follows:

Hoofddorp, z6 november zorS
Geachte heer, mevrouw,
Hierbij ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten
aanzien van uw ¡nternationale MTC/DKV-pas.

With the OCR software, the output is as follows:

Hoofddorp, 26 november 2018
Geachte heer, mevrouw,
Hierbij ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten
aanzien van uw internationale MTC/DKV-pas.

BUT when parsing the converted searchable PDF to text, it parses both the original text (the first example), together with the OCR’ed text (the second example) appended to it.
I am only interested in the OCR’ed text.

For example when copying the content of the PDF to a word document, I basically get two outputs, where the original text is pasted and the OCR’ed text is appended behind it.
Or when using a parsing software like TIKA, it gives as output both texts appended.

Is there any way to convert a searchable PDF where the OCR text replaces the original text, instead of appending it?

Output parsing text using Tika original (Not converted):

{'status': 200, 'content': 'Hoofddorp, z6 november zorS\n\nGeachte heer, mevrouw,\n\nHierbij ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten\naanzien van uw ¡nternationale MTC/DKV-pas.'}

Current output using Tika to parse text from pdf:

{'status': 200, 'content': 'Hoofddorp, z6 november zorS\n\nGeachte heer, mevrouw,\n\nHierbij
ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten\naanzien van uw ¡nternationale MTC/DKV-pas.\n\nlnternationale MTC/DKV-pas.
Hoofddorp, 26 november 2018\n\nGeachte heer, mevrouw, Hierbij ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten\naanzien van uw internationale MTC/DKV-pas.'}

Requested output

{'status': 200, 'content': 'Hoofddorp, 26 november 2018\n\nGeachte heer, mevrouw, Hierbij ontvangt u uw nieuwe tankpas met in deze brief relevante informatie ten\naanzien van uw internationale MTC/DKV-pas.'}

admin · November 18, 2019, 10:53pm

This sounds like the PDF document that you want to OCR has already the text as text inside, or? In this case no OCR is required, as the PDF is already searchable and you can get text from pdf directly.

OCR is helpful if the text is provided as image inside a PDF, as it is typically for PDF from scans.

Michelp76 · September 6, 2021, 1:32pm

This sounds like the PDF document that you want to OCR has already the text as text inside, or? In this case no OCR is required, as the PDF is already searchable and you can get text from pdf directly.
OCR is helpful if the text is provided as image inside a PDF, as it is typically for PDF from scans.

Hello admin,

Would you be kind enough to explain how that is done, searching a text PDF without using OCR ?
I can’t seem to find any documentation about that

Mojtaba_Ghasemnataj · February 12, 2022, 8:49am

I need Persian OCR.
Do not know any other site except (https://www.eboo.ir/) site?