Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YouTube] Add support for extracting auto-translated captions #997

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

TobiGr
Copy link
Member

@TobiGr TobiGr commented Dec 5, 2022

  • I carefully read the contribution guidelines and agree to them.
  • I have tested the API against NewPipe.
  • I agree to create a pull request for NewPipe as soon as possible to make it compatible with the changed API.

Extract auto-translated captions for YouTube videos.

API changes 馃煝

SubtitlesStream

This adds isAutoTranslated() next to isAutoGenerated() to distinguish between auto-generated subtitles which use speech2text and auto-translated captions based on Google translator.
Additionally, getBaseLocale(), getDisplayBaseLanguageName() and getBaseLanguageTag() were added to access info on the language which was used for auto-translations.

Issues closed by this PR

Closes #977
Based on and adresses TeamNewPipe/NewPipe#8023

@TobiGr TobiGr added enhancement youtube service, https://www.youtube.com/ labels Dec 5, 2022
.build());
if (i == 0 && caption.getBoolean("isTranslatable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not base the extraction on the index, but rather on whether the subtitles are auto-generated:

Suggested change
if (i == 0 && caption.getBoolean("isTranslatable")
if (isAutoGenerated && caption.getBoolean("isTranslatable")

Also, this PR doesn't add support of subtitles translation for uploaded subtitles. For instance, see https://www.youtube.com/watch?v=_cMxraX_5RE: you can translate from German to French and from English to French, and the translations are different.

We may need another property in SubtitlesStream for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we should use isAutoGenerated here. For better quality, it should be !isAutoGenerated. Manually added captions should be exact.
I was also wondering whether we should provide the auto-translated captions by default. Extracting the data for and generating ~100 SubtitleStreams takes some time. I'd definitely not recommend to do this for all available languages by default. On the other hand, we could provide a method which does this when needed.

Copy link
Member Author

@TobiGr TobiGr May 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to extract all available subtitles, but made sure to speed up the process. It's up to the frontends to filter the subtitles.

@Facni
Copy link

Facni commented May 8, 2024

What happened to this?

TobiGr added 4 commits May 9, 2024 20:54
Faster and ordered: captions provided by the user are at the beginning of the list, auto-translated captions are at the end
@TobiGr TobiGr force-pushed the feature/youtube-auto-translated-captions branch from efce384 to 9730de2 Compare May 10, 2024 18:16
@TobiGr TobiGr requested a review from AudricV May 10, 2024 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement youtube service, https://www.youtube.com/
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract automatically translated subtitles
3 participants