An algorithm to fix a subtitle.srt file with ground truth paragraph

98 Views Asked by At

I have a subtitle.srt file, but its content is not as accurate as it should be. In parallel, I also have a set of paragraphs that is accurate but not in time synchronized.

The inaccuracy can be caused because of several reasons including,

  • capitalization mismatch,
  • extra words or characters,
  • missing words or characters,
  • missing punctuation
  • etc.

By which approach I can fix the srt file with the ground truth text? Any algorithm suggestion would be great independent from the coding language.

I really appreciate any help you can provide.

Example:

subtitle.srt

1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission.

The ground truth text:

The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions.

This is the expected: subtitle_corrected.srt

1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.
1

There are 1 best solutions below

0
Marijn On

This task is called alignment, it is a common task in the fields of biology (comparing two DNA sequences) and natural language processing (such as the current example with two parallel sources of subtitles), among others.

The task has been studied for many years (going back to the 1970s) and many algorithms have been developed. These algorithms have been implemented in all major programming languages.

As an example there is the Python library text_alignment_tool, which implements the dynamic programming algorithms Smith-Waterman and Needleman-Wunsch. The code below shows how to use the Smith-Waterman algorithm (called LocalAlignmentAlgorithm in the library) on the subtitles. This algorithm produces the following kind of alignment using the correct text as query and the inaccurate text as target:

syntax: position, character in query > position, character in target
119 T > 114 t
120 h > 115 h
121 e > 116 e
122   > 117  
123 h > 118 h
124 e > 119 e
125 a > 120 a
126 t > 121 t
127   > 122  
128 w > 123 w
129 a > 124 a
130 v > 125 v
131 e > 126 e
[...]
162 o > 157 o
163 f > 158 f
164   > 159  
165 c > 160 c
166 l > 161 l
167 i > 162 i
168 m > 163 m
169 a > 164 a
170 t > 165 t
171 e > 168 g

Most of the code is bookkeeping to generate a plain text version of the inaccurate subtitles while keeping track of character positions and timestamps, and reconstructing the subtitle format afterwards.

# Import the tool and necessary classes
from text_alignment_tool import (
    TextAlignmentTool,
    StringTextLoader,
    LocalAlignmentAlgorithm,
)

subtitle_srt = """1
00:00:00,000 --> 00:00:04,320
Heat wave is expect to continue for the next a few

2
00:00:04,320 --> 00:00:07,920
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating

3
00:00:07,920 --> 00:00:13,760
change, the need to take action to reduce greenhouse gas emission."""

correct_text = """The heat wave is expected to continue for the next few days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate change, and the need to take action to reduce greenhouse gas emissions."""

# list with information about each subtitle fragment
fragments_info = []
# keep track of character positions for each sentence in the original subtitles 
current_pos = 0
# collect a list of just the text without the number and timestamp
all_lines = list()

# split on two newlines to get each block of nr+timestamp_sentence
for fragment in subtitle_srt.split("\n\n"):
    # split each block into number, timestamp and sentence
    (fragment_nr, timestamp, fragment_txt) = fragment.splitlines()
    # add the sentence to the list of sentences
    all_lines.append(fragment_txt)
    # keep track of new position: old position plus length of current sentence
    newpos = current_pos + len(fragment_txt)
    # add number, timestamp and position to the list with information about fragments
    fragments_info.append({"number": fragment_nr, "timestamp": timestamp, "end_position": newpos})
    # update position variable to use in next iteration
    current_pos = newpos + 1

# create a multi-line string with only the sentences to use for alignment
target_text = "\n".join(all_lines)

print(target_text)
print("---------------------------")
print(fragments_info)
print("---------------------------")

# load the two text strings for use in the alignment library
query_1 = StringTextLoader(correct_text)
target_1 = StringTextLoader(target_text)
# initialize the alignment for the two texts
aligner_1 = TextAlignmentTool(query_1, target_1)
# select an alignment algorithm
local_alignment_algorithm = LocalAlignmentAlgorithm()
# perform the actual alignment
aligner_1.align_text(local_alignment_algorithm)

# extract character-level alignment positions
alm = aligner_1.collect_all_alignments()
alm_idxs = alm[0][0]

# reconstruct the subtitles using the alignment

# keep track of the fragment number and the position in the correct text
fragment_nr = 0
start_pos = 0
# loop over each aligned character pair
for x in alm_idxs.query_to_target_mapping.alignments:
    # if the position in the original subtitle (=target) is the end of a fragment
    # then write a subtitle line using the position in the correct text (=query) 
    if x.target_idx >= fragments_info[fragment_nr]["end_position"]-1:
        print(fragments_info[fragment_nr]["number"])
        print(fragments_info[fragment_nr]["timestamp"])
        print(correct_text[start_pos:x.query_idx+1])
        # update the start position and fragment number for the next fragment
        start_pos = x.query_idx + 2
        fragment_nr += 1

Output of the code, which shows the plain text version of the inaccurate subtitles, the list with information about each fragment, and the reconstructed subtitles:

Heat wave is expect to continue for the next a few
days, and the government`s warning people to take precautions. the heat wave is a reminder of the dangers of climating
change, the need to take action to reduce greenhouse gas emission.
---------------------------
[{'number': '1', 'timestamp': '00:00:00,000 --> 00:00:04,320', 'end_position': 50}, {'number': '2', 'timestamp': '00:00:04,320 --> 00:00:07,920', 'end_position': 169}, {'number': '3', 'timestamp': '00:00:07,920 --> 00:00:13,760', 'end_position': 236}]
---------------------------
1
00:00:00,000 --> 00:00:04,320
The heat wave is expected to continue for the next few

2
00:00:04,320 --> 00:00:07,920
days, and the government is warning people to take precautions. The heat wave is a reminder of the dangers of climate

3
00:00:07,920 --> 00:00:13,760
change, and the need to take action to reduce greenhouse gas emissions.

This code is in Python, and the text_alignment_tool library has some particularities and is not very well-documented (disclaimer: I'm not affiliated with the library in any way). The code serves as a proof of concept, but it may not be the best solution in all situations.

However, as mentioned above, these algorithms are widely available in libraries for many different programming languages, so with the right search terms (alignment, Needleman-Wunsch) you should be able to write something similar that fits your needs.