I have been using a script on my Mac for years that was built with ccextractor. Since the app no longer works, I decided to switch to ffmpeg. My goal is to extract subtitles from a video file and have the resulting text in one paragraph without any line breaks. However, I've run into two issues that are beyond my skills:
Line Breaks Issue: My existing
awkcommand doesn't seem to work anymore, and the output contains many line breaks instead of a single paragraph.HTML Tags Issue: With some files, the extracted lines are embedded with HTML tags, such as
<font face="Serif" size="18">Example line</font>.
Here's the code snippet I have, where I replaced the ccextractor line with ffmpeg:
ffmpeg -i "$filename" "${filename}.srt"
# If the file has subtitles, extract them
if [ "$has_subtitles" ]; then
ccextractor "$filename" -o "${filename}.srt"
# Use dos2unix to convert the file to the correct format
dos2unix "${filename}.srt"
# Remove the timestamps from the .srt file
awk -v RS= '{
for (i=5;i<=NF;i++){
printf "%s%s", (sep ? " " : ""), $i
sep=1
}
}
END{ print "" }' "${filename}.srt" | pbcopy
rm "${filename}.srt"
fi
I would greatly appreciate any assistance in resolving these issues. Specifically, I need help modifying the code to remove line breaks and HTML tags from the extracted subtitles, so the output is a single paragraph.
Thanks a lot!
I tried a few awk and sed commands to fix it but nothing works!
EDIT: I made some progress but the line breaks are still randomly messed up. NO more HTML tags.
I modified the line with: END{ print "" }' "${filename}.srt" | sed 's/<[^>]*>//g' | tr '\n' ' ' | pbcopy
To convert multi-line file to single-line inform GNU AWK that you want non-default
ORS(output row separator), either empty string or space, consider following simple example, letfile.txtcontentthen
gives output
whilst
gives output
Keep in mind that there is not trailing newline, if you wish to have one trailing newline use one of following
(tested in GNU Awk 5.1.0)