OCR on chinese subtitles in a video


I am trying to extract subtitles from two taiwanese series “我們與惡的距離: The world between us” and “想見你: Someday or one day” (official translation, not literal). As for most series in mandarin, the subs are hardcoded and I am looking for a way to extract an .srt or csv files out of it, to produce a workable text. In general, it would be great to be able to retrieve all subs from series or movies en mandarin, which are almost always containing hardcoded subs.

I have seen your demo on youtube ( https://www.youtube.com/watch?v=YNGkGWj8lA4 ), and it look pretty close to what I want to do except it should be applied on VLC (or any other player) and as well, is there a way to save the full text OCRed somewhere? (I do not need the timing of the subtitles in this case, only the text). Is there a way to apply OCR to the video?
Is there also a way to know when the OCR is not completely sure, to come back and check it manually?

The purpose is to do some vocabulary analysis using R once having extracted it, not translation, for a university project.

Hi, how are you currently extracting the text? With Copyfish, the OCR API or with screen scraping?

I was trying to use copyfish, but I am definitely open to using different method and invest some time in it to make it work, if there is a way.

You can use the screen scraping feature of UI.Vision RPA software to automate this task, including saving the text.

The tricky part is to detect when the subtitles change. I am not sure how to detect that reliably…