I have a subtitle.srt file, but its content is not as accurate as it should be. In parallel, I also have a set of paragraphs that is accurate but not in time synchronized.
The inaccuracy can be caused because of several reasons including,
- capitalization mismatch,
- extra words or characters,
- missing words or characters,
- missing punctuation
- etc.
By which approach I can fix the srt file with the ground truth text? Any algorithm suggestion would be great independent from the coding language.
I really appreciate any help you can provide.
Example:
subtitle.srt
1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few
2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating
3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission.
The ground truth text:
The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions.
This is the expected: subtitle_corrected.srt
1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few
2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate
3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.
This task is called alignment, it is a common task in the fields of biology (comparing two DNA sequences) and natural language processing (such as the current example with two parallel sources of subtitles), among others.
The task has been studied for many years (going back to the 1970s) and many algorithms have been developed. These algorithms have been implemented in all major programming languages.
As an example there is the Python library
text_alignment_tool, which implements the dynamic programming algorithms Smith-Waterman and Needleman-Wunsch. The code below shows how to use the Smith-Waterman algorithm (calledLocalAlignmentAlgorithmin the library) on the subtitles. This algorithm produces the following kind of alignment using the correct text as query and the inaccurate text as target:Most of the code is bookkeeping to generate a plain text version of the inaccurate subtitles while keeping track of character positions and timestamps, and reconstructing the subtitle format afterwards.
Output of the code, which shows the plain text version of the inaccurate subtitles, the list with information about each fragment, and the reconstructed subtitles:
This code is in Python, and the
text_alignment_toollibrary has some particularities and is not very well-documented (disclaimer: I'm not affiliated with the library in any way). The code serves as a proof of concept, but it may not be the best solution in all situations.However, as mentioned above, these algorithms are widely available in libraries for many different programming languages, so with the right search terms (alignment, Needleman-Wunsch) you should be able to write something similar that fits your needs.