TL;DR

I wanted to have transcriptions of the debates and tried to do it myself. It is being a fun ride. Loads of work. 😅 I've been improving the process over this period. Newer debates are likely better transcribed than the initial ones. I haven't got the time to re-review them, with all this content landing each day.

Disclaimer: I try my best to review each debate's SRT (which never takes me less than 45 min on easy ones). It's sometimes very challenging to understand, let alone correct, when multiple people are talking at once. Whisper does an overall good job at this. There were a couple of periods I had to completely write from scratch.

Calendar

5/2

21h SIC: PS - IL
22h RTP3: PAN - Chega

6/2

18h RTP3: PCP - PAN
20h TVI: AD - BE
22h SICN: IL - Chega

7/2

18h CNN: IL - Livre

8/2

18h SICN: BE - Livre

9/2

18h SICN: IL - PAN
21h RTP: PS - Livre
22h CNN: Chega - PCP

10/2

20.30 RTP1 PSD - PCP
21h TVI PS - PAN

11/2

21h SIC PSD - PAN
22h SICN BE - PCP

12/2

21h RTP PSD - Chega

13/2

18h CNN PCP - Livre
22h RTP3 Chega - BE

14/2

18h RTP3 Livre - PAN
21h TVI PS - Chega
22h RTP3 IL - PCP

15/2

18h CNN IL - BE

16/2

20.30 RTP PS - BE
22h SICN Chega - Livre

17/2

20.30 TVI PS - PCP
21h SIC PSD - Livre

18/2

20.45 SIC PSD - IL
21.50 CNN BE - Pan

19/2

21h SIC PS - PSD

20/2

21h RTP3 partidos sem assento parlamentar

23/2

21h RTP1 partidos com assento parlamentar

Process

simpler audio-only grab from podcast (currently used)

PODCAST PROCESS

wget "url" -O 1.mp3
ffmpeg -i 1.mp3 -map 0:a -c:a copy -map_metadata -1 2.mp3
ffmpeg -i 2.mp3 -ss 35 -vcodec copy -acodec copy 3.mp3

wget "url" -O 1.mp4
ffmpeg -i 1.mp4 -map 0:a -c:a copy -map_metadata -1 2.aac
ffmpeg -i 2.aac -ss 20 -codec:a libmp3lame -b:a 128k 3.mp3

video stream grab w/ VLC + FFMPEG to extract aac stream and convert to mp3 (initially used)

save m3u8 stream to file on VLC:
vlc open network
first m3u8...
stream output
settings
file ... asd.ts
MPEG TS
video to audio without transcoding: ffmpeg -i vlc-output.ts -vn -acodec copy audio.aac
aac to mp3: ffmpeg -i audio.aac -acodec mp3 audio.mp3

transcribe mp3 to srt

pinokio + whisper webui
large v3
portuguese
toggle off suffix checkbox
supply mp3 file and wait...
get output from app's output folder

WIP audio analysis

#set INFILE 2024-02-05_pan-chega.mp3
ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 $INFILE
ffmpeg -i $INFILE -lavfi showspectrumpic=s=3622x512 out.png
ffmpeg -i $INFILE -filter_complex "showwavespic=s=14488x512" -frames:v 1 out.png

navigation key bindings

space - toggle playback
up/down - move to previous/next subtitle
left/right - review/fast forward by 15 seconds

onboarding new debates and editing text and speaker tags

For each new debate (an mp3 file), we expect 2 additional files to be created:

a subtitles file (srt), which initially comes from running whisper over the mp3
a json file listing the speakers and which subtitles indices belong to each speaker the index.json needs to updated to also list the name of this new debate (used in the search features of the main page)

When the site is running locally for editing purposes, node server.mjs should also be running. It changes the file system debate files according to the operations defined in the front end.

There's a set of key bindings for manipulating SRT and JSON files in tandem:

joins the current subtitle with either its previous or next one
splits the current subtitle by a ratio into 2 new ones
edits the current subtitle's text content
time tweaks the start and end placements for the current subtitle and its neighbors
x deletes the current subtitle
f fills the space between the previous subtitle and the current one with a new subtitle
1 assigns the moderator role to the current subtitle (typically gray)
2 assigns the 1st debater role to the current subtitle (typically cyan)
3 assigns the 2nd debater role to the current subtitle (typically magenta)
§ (before 1, on mac) clears any speaker role from the current subtitle

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
content		content
.gitignore		.gitignore
README.md		README.md
date.mjs		date.mjs
dialogs.mjs		dialogs.mjs
index.html		index.html
index.mjs		index.mjs
list.mjs		list.mjs
main.css		main.css
player.mjs		player.mjs
render.mjs		render.mjs
search.mjs		search.mjs
server.mjs		server.mjs
stats.mjs		stats.mjs
subtitles.mjs		subtitles.mjs

JosePedroDias/leg24

Folders and files

Latest commit

History

Repository files navigation