OCR on chinese subtitles in a video

Alca · March 25, 2020, 1:51pm

Hello,

I am trying to extract subtitles from two taiwanese series “我們與惡的距離: The world between us” and “想見你: Someday or one day” (official translation, not literal). As for most series in mandarin, the subs are hardcoded and I am looking for a way to extract an .srt or csv files out of it, to produce a workable text. In general, it would be great to be able to retrieve all subs from series or movies en mandarin, which are almost always containing hardcoded subs.

I have seen your demo on youtube ( Copyfish Demo: Copy and Paste Text from Images - YouTube ), and it look pretty close to what I want to do except it should be applied on VLC (or any other player) and as well, is there a way to save the full text OCRed somewhere? (I do not need the timing of the subtitles in this case, only the text). Is there a way to apply OCR to the video?
Is there also a way to know when the OCR is not completely sure, to come back and check it manually?

The purpose is to do some vocabulary analysis using R once having extracted it, not translation, for a university project.

admin · March 25, 2020, 8:27pm

Hi, how are you currently extracting the text? With Copyfish, the OCR API or with screen scraping?

Alca · March 25, 2020, 11:18pm

Hi,
I was trying to use copyfish, but I am definitely open to using different method and invest some time in it to make it work, if there is a way.

admin · March 27, 2020, 2:42pm

You can use the screen scraping feature of UI.Vision RPA software to automate this task, including saving the text.

The tricky part is to detect when the subtitles change. I am not sure how to detect that reliably…

franken · January 30, 2024, 4:07pm

There are free web based solutions for extracting hardcoded subtitles. subtitleextractor.com works well.