Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MTVA Archivum] Add new extractor #32589

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aaron-tan
Copy link
Contributor

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

Add new extractor for MTVA Archivum site.

Closes #21430

Add new extractor for MTVA Archivum site.

Closes ytdl-org#21430
Copy link
Contributor

@dirkf dirkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this old issue which I expect is really important for HU users!

I've made a few suggestions. Please have a look and check the CI test results too.

'id': 'M3-87720998249999359',
'ext': 'mp4',
'title': 'Kék egér',
'description': 'Kék egér nem sokáig örülhet a napsütésnek, mert egy kölyökkutya azt hiszi, kutyáknak való játék ez a kék valami. A Kék egér tiltakozása ellenére csak akkor engedi el az egeret, amikor az elásott csontja helyét megtalálja a kutya. A menekülő egérke elbotlik egy fél perecben aminek nagyon megörül, de egy erőszakos galamb meghívatja magát a perecre. Némi ellenszolgáltatás, és egy jó tanács fejében az egészet felfalja.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test harness will suggest a variant of this long test value with a digest ('md5:...digest...') but this doesn't help to check when the site makes a minor change to the extracted value. I prefer something like:

Suggested change
'description': 'Kék egér nem sokáig örülhet a napsütésnek, mert egy kölyökkutya azt hiszi, kutyáknak való játék ez a kék valami. A Kék egér tiltakozása ellenére csak akkor engedi el az egeret, amikor az elásott csontja helyét megtalálja a kutya. A menekülő egérke elbotlik egy fél perecben aminek nagyon megörül, de egy erőszakos galamb meghívatja magát a perecre. Némi ellenszolgáltatás, és egy jó tanács fejében az egészet felfalja.',
'description': r're:Kék egér nem sokáig örülhet a napsütésnek, mert egy kölyökkutya azt hiszi, kutyáknak való játék ez a kék valami\. .+\.$',

Then (especially, I guess, with Hungarian language skills!) a future maintainer can compare some non-matching extracted value to see if the site has just tweaked or even completely changed the value, or maybe a new extraction tactic would get something like the expected value.

'id': 'M3-59898941410999595',
'ext': 'mp4',
'title': 'Magyar retro',
'description': 'MTVA Archívum',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the site has this value for description, maybe omit it, since it's just repeating the site name?

webpage = self._download_webpage(url, video_id)
json = self._download_json('https://archivum.mtva.hu/m3/stream?no_lb=1&target=' + video_id, video_id)
video_url = json['url']
title = self._og_search_title(webpage) or self._html_search_regex(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to give the first call a default or it will raise and never try the second one.

Suggested change
title = self._og_search_title(webpage) or self._html_search_regex(
title = self._og_search_title(webpage, default=None) or self._html_search_regex(

Comment on lines +50 to +51
description = self._og_search_description(webpage) or self._html_search_regex(
'<p class=\"active-full-description\">\n.+</p>', webpage, 'description')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description is optional and shouldn't raise. Also, the fallback call needs a group in the RE (how many times have I left it out!) and the RE can be improved, ...

Suggested change
description = self._og_search_description(webpage) or self._html_search_regex(
'<p class=\"active-full-description\">\n.+</p>', webpage, 'description')
description = self._og_search_description(webpage) or self._html_search_regex(
'''<p\s[^>]*\bclass=['"]active-full-description\b[^>]*>(.+)</p>''', webpage, 'description',
default=None)

... or perhaps use a completely different function.

Suggested change
description = self._og_search_description(webpage) or self._html_search_regex(
'<p class=\"active-full-description\">\n.+</p>', webpage, 'description')
description = self._og_search_description(webpage) or get_element_by_class(
'active-full-description', webpage)

Comment on lines +54 to +56
formats = self._extract_m3u8_formats(
video_url, video_id, 'mp4')
self._sort_formats(formats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this up to just after the title extraction. Then the optional extraction won't be done until we know that we have a valid media link(s).

Also, you may want to supply the parameters ext='mp4', m3u8_id='hls', which for whatever reason aren't the defaults. And try extracting with entry_protocol='m3u8_native', (now the yt-dlp default): if that works ([hlsnative] entries in the log) the site's M3U8 manifests can be downloaded without ffmpeg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for online channel M3 on MTVA Archívum (HU)
2 participants