EMail Adress might be in multiple Lines

DW11 · February 7, 2023, 11:10am

Hi everbody,
i input pdfs into the free ocr api and want to extract several values.
Sadly from the pdf layout the part for the E-Mail can be in 1 Column or if the email gets longer it can be in 2 columns.

If it is in 1 column everything works allright since my Regex can filter it perfectly.
When it is split in 2 columns regex cant identify it anymore since it is split and the first part of the mail adress is very eary in the code and does not come after E-Mail: (2nd part).

Is there a way to help this? I tried the different engines but that didnt work.
I tried to alter the Regex but cannot do it better.

ocr-api-team · February 7, 2023, 1:19pm

hi, did you use isTable=true parameter?

See also table ocr

DW11 · February 7, 2023, 4:11pm

Yes i did, it did not give the right results sadly. the upper part of the e-mail is read 1 line before the “E-Mail:” string sadly

Honeybuil · December 13, 2023, 1:28am

You can always use a more flexible Regex pattern that accounts for line breaks. This might involve using regular expressions that are capable of matching patterns across multiple lines.

Gewainte · December 13, 2023, 11:24am

Dealing with email addresses that span multiple lines can be a bit tricky, especially when using OCR. It’s good to hear that your Regex works when the email is in one column. To handle cases where it’s split into two columns, you might need to modify your approach.
One potential solution is to preprocess the text before applying Regex. You could try merging the lines where the email address is split into two columns into a single line, and then apply your Regex to extract the email address.
And if you’re interested in expanding your knowledge on related topics, you can Buy Old Reddit Accounts to stay informed about the latest techniques in data extraction and manipulation.