Converts 1 = i and sometimes adds space

DajAng · July 16, 2020, 11:11pm

Hello,

I am trying to convert a PDF file to Text file, and when the Text file is generated, I noticed that sometimes the number “1” is converted to the letter “i”, it also put spaces in between sometimes.

On my example below you would see that in the PDF File the item name is “WSS101” but then converted to “WSSIOI”, also for the items below it “WSS102” was converted to “WSS 1 02”

Please help.

This is the PDF File: https://4148197.app.netsuite.com/core/media/media.nl?id=2418266&c=4148197&h=d48fb80ecac820966b8e&_xt=.pdf

This is the parsed text:

****** Result for Image/Page 1 ******
cabtec
where design becomes reality
Supplier
Fit Ltd
po BOX 15
WELLSFORD 0940
Account #
173
Cabtec Limited
1034 State Highway 12
Maungaturoto 0583
P 09 431 0022 | E accounts@cabtec.co.nz
Purchase Order
Purchase Order No.
Date
Ship To
Cabtec Limited
1034 State Highway 12
Maungaturoto, 0583
66350
12/06/2020
Terms
20th EOM
PO No. is required on all Invoices
Expected
1 6/06/2020
Ordered By
Heather
Please supply the following goods in good order and condition and charge to our account.
Our order number must appear on all documents relating to this order.
Description
WSSIOI Wardrobe Lift Right
WSS 1 02 Wardrobe Lift Left
WSS 1 03 Spacer
86.015R2 Hingefix Square Drive Drive No.2 6x5/8 (16mm)
Pk. I OOO
SSC Shelf Stud 5mm - Plastic
Special Instructions / Notes:
Qty
1.00
1.00
2.00
5.00
5.00
Unit
ea
ea
ea
pk/ OOO
pk/ OOO
Rate
149.60
1 49.60
10.76
14.74
39.82
Job #
CL6065
CL6065
CL6065
STOCK
STOCK
Subtotal
GST
Total
Amount
1 49.60
1 49.60
21.52
73.70
199.10
$593.52
$89.04
$682.56

DajAng · July 16, 2020, 11:14pm

Before posting, I’ve read other related topics, and one suggestion was to use Engine 2.

But then I tried using Engine 2 and the following conversion is:

WSS101 = WSSi01
WSS102 = WSSi02
WSS103 = WSST03

admin · July 17, 2020, 9:35am

If OCR needs to be used:

I confirmed your result. Engine 2 works better here, but not perfect. The problem is that in this font, the “i” and “1” look very similar.

The only quick solution I can think of is to add some post-processing on your side. So if you know that the order numbers are always in the “WSSxxx” format, then add some logic to replace “i” and “T” with “1”.

Or you can run the conversion in engine1 and 2, and then trigger an alarm for manual processing of the results of engine1 and 2 do not match (confidence test).

Or, if you create the invoices yourself, use a different font that is better suited for OCR processing.

But actually no OCR required:

This specific PDF is not a scan of an invoice, but an original PDF that already includes a text layer. Thus OCR is not required! You can convert the PDF to text with simple PDF text extraction command line tools, for example use

pdftotext
pdftxtextract
Ghostscript

pdftotext is an open-source command-line utility for converting PDF files to plain text files—i.e. extracting text data from PDF-encapsulated files. It is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Such text extraction is complicated as PDF files are internally built on page drawing primitives, meaning the boundaries between words and paragraphs often must be inferred based on their position on the page.

pdftotext is part of the Xpdf software suite. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext. On most Linux distributions, pdftotext is included as part of the poppler-utils package.[1]

Test:
Test Customer - Jan_66350.pdf (47.8 KB)

E1:
1966880d-845a-433a-9f8b-92409cf800c9.pdf (138.7 KB)

E2:
7b447a39-2178-4094-9ef9-bc38f243d92f.pdf (138.1 KB)

DajAng · July 19, 2020, 8:58pm

Thank you for taking the time to look into this.

I agree, if we are going to use OCR, I’ll probably just add post-processing, I just wanted to make sure that I am not missing any parameters that I can use, before adding the post-processing logic.

Thanks for the PDF to Text recommendation as well, I’ll look into this!

DajAng · July 21, 2020, 11:50pm

Hello

By any chance you know an API for pdftotext or pdftxtextract that works well as your OCR API, and could easily be executed, like in Postman?

I know that this isn’t part of your support anymore and I’m just trying my luck here

Thank you!

admin · July 22, 2020, 9:23am

Do you “just” need the text or also the exact word positions (x,y) coordinates?

DajAng · July 22, 2020, 8:29pm

Having the (x,y) coordinates would be ideal. I am using this on your API and it is very helpful!