Bugfix/ard #32688

f-froehlich · 2024-01-09T20:43:47Z

Please follow the guide below

You will be asked some questions, please read them carefully and answer honestly
Put an x into all the boxes [ ] relevant to your pull request (like that [x])
Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Fixing the ARD extractor because API is changed

dirkf

Thanks for your work!

Is there a related issue?

I've made some suggestions that should also help to pass the CI tests.

You might like to review the yt-dlp extractor as updated in yt-dlp/yt-dlp#9037 and include any additional functionality from there,. or generally re-align the code.

dirkf · 2024-01-22T12:06:09Z

youtube_dl/extractor/ard.py

+
+    _VALID_URL = r'https://(?:(?:beta|www)\.)?ardmediathek\.de/(((?:[^/]+/)?(?:player|live|video|serie|sendung)/(?:[^/]+/)*(?P<id>Y3JpZDovL[a-zA-Z0-9]+))|(((?P<sender>[a-zA-Z0-9\-]+)([/]))?(?P<name>[a-zA-Z0-9\-]+)))'
+


dirkf · 2024-01-22T12:08:24Z

youtube_dl/extractor/common.py

-    def _match_id(cls, url):
+    def _match_id(cls, url, group_name = 'id'):
        m = cls.__match_valid_url(url)
        assert m
-        return compat_str(m.group('id'))
+        return compat_str(m.group(group_name))


Pls revert these and use _match_valid_url() in the extractor, as suggested above.

dirkf · 2024-01-22T12:11:37Z

youtube_dl/extractor/ard.py

        video_id = self._match_id(url)
+        video_name = self._match_id(url, group_name='name')
+        sender = self._match_id(url, group_name='sender')


Suggested change

video_id = self._match_id(url)

video_name = self._match_id(url, group_name='name')

sender = self._match_id(url, group_name='sender')

video_id, video_name, sender = self._match_valid_url(url).group('id, 'name', 'sender')

dirkf · 2024-01-22T12:23:51Z

youtube_dl/extractor/ard.py

+        if '/serie/' in url or '/sendung/' in url:
+            return self._real_extract_serie(video_id)
+        elif 'none' != video_name.lower():
+            return self._real_extract_named_serie(video_name, sender if 'none' != sender.lower() else "ard")
+        else:
+            return self._real_extract_video(video_id)


The URL pattern matches either for video_id or for video_name, so that logic can be used. A non-matched group will be None rather than 'none'.

Suggested change

if '/serie/' in url or '/sendung/' in url:

return self._real_extract_serie(video_id)

elif 'none' != video_name.lower():

return self._real_extract_named_serie(video_name, sender if 'none' != sender.lower() else "ard")

else:

return self._real_extract_video(video_id)

if video_id is not None:

return self._real_extract_serie(video_id)

return self._real_extract_named_serie(video_name, sender if sender is not None else 'ard')

If video_name.lower() == 'none' is an actual possibility, add a test for it and raise ExtractorError( ..., expected=True) for that case. Or wrap the group in `(?!none...) in the pattern so that it doesn't match.

dirkf · 2024-01-22T12:24:56Z

youtube_dl/extractor/ard.py

-            }).encode(), headers={
-                'Content-Type': 'application/json'
-            })['data']['playerPage']
+                f'https://api.ardmediathek.de/page-gateway/pages/ard/item/{video_id}',


For Py2 (could use .format(video_id) but this style is used elsewhere):

Suggested change

f'https://api.ardmediathek.de/page-gateway/pages/ard/item/{video_id}',

'https://api.ardmediathek.de/page-gateway/pages/ard/item/' + video_id,

dirkf · 2024-01-22T14:26:50Z

youtube_dl/extractor/ard.py

+        page_number = 0
+        page_size = 100
+
+        while True:


Make the loop condition explicit:

Suggested change

while True:

total = traverse_obj(widgets, ('pagination', 'totalElements', T(int))) or 0

while page_number * page_size <= total:

dirkf · 2024-01-22T14:32:28Z

youtube_dl/extractor/ard.py

+            total = widgets['pagination']['totalElements']
+            if (page_number + 1) * page_size > total:
+                break


No longer needed:

Suggested change

total = widgets['pagination']['totalElements']

if (page_number + 1) * page_size > total:

break

dirkf · 2024-01-22T14:36:12Z

youtube_dl/extractor/ard.py

+        widgets = self._download_json(
+                f'https://api.ardmediathek.de/page-gateway/pages/{sender}/editorial/{video_id}',
+                video_id,
+                query={'pageSize': str(10), 'pageNumber': 0}
+        )['widgets']
+
+        for widget in widgets:
+            widget_id = widget['id']


Suggested change

widgets = self._download_json(

f'https://api.ardmediathek.de/page-gateway/pages/{sender}/editorial/{video_id}',

video_id,

query={'pageSize': str(10), 'pageNumber': 0}

)['widgets']

for widget in widgets:

widget_id = widget['id']

widgets = self._download_json(

'https://api.ardmediathek.de/page-gateway/pages/{0}/editorial/{1}'.format(sender, video_id),

video_id, query={'pageSize': 10, 'pageNumber': 0}

)

for widget_id in traverse_obj(widgets, ('widgets', Ellipsis, 'id')):

dirkf · 2024-01-22T15:01:20Z

youtube_dl/extractor/ard.py

+                for teaser in page_data['teasers']:
+                    if 'EPISODE' == teaser.get('coreAssetType', None) and teaser['type'] not in ['poster'] and ':' not in teaser['id']:
+
+                        item = self._real_extract_video(teaser['id'])
+                        item['webpage_url'] = f"https://www.ardmediathek.de/video/{teaser['id']}"
+                        entries.append(item)
+
+                total = page_data['pagination']['totalElements']
+                if (page_number + 1) * page_size > total:
+                    break


And as before:

Suggested change

for teaser in page_data['teasers']:

if 'EPISODE' == teaser.get('coreAssetType', None) and teaser['type'] not in ['poster'] and ':' not in teaser['id']:

item = self._real_extract_video(teaser['id'])

item['webpage_url'] = f"https://www.ardmediathek.de/video/{teaser['id']}"

entries.append(item)

total = page_data['pagination']['totalElements']

if (page_number + 1) * page_size > total:

break

entries.extend(traverse_obj(page_data, (

'teasers', lambda _, v: 'EPISODE' == ['coreAssetType'] and v.get('type') != 'poster' and ':' not in v['id'],

'id', T(self._mk_teaser))))

total = traverse_obj(page_data, (

'pagination', 'totalElements', T(int))) or 0

dirkf · 2024-01-22T15:10:49Z

youtube_dl/extractor/ard.py

+            while True:
+                page_data = self._download_json(
+                        f'https://api.ardmediathek.de/page-gateway/widgets/{sender}/editorials/{widget_id}',
+                        video_id,
+                        query={'pageSize': page_size, 'pageNumber': page_number}


Make loop condition explicit:

Suggested change

while True:

page_data = self._download_json(

f'https://api.ardmediathek.de/page-gateway/widgets/{sender}/editorials/{widget_id}',

video_id,

query={'pageSize': page_size, 'pageNumber': page_number}

total = 0

while page_number * page_size <= total:

page_data = self._download_json(

'https://api.ardmediathek.de/page-gateway/widgets/{0}/editorials/{1}'.format(sender, widget_id),

video_id, query={'pageSize': page_size, 'pageNumber': page_number}

f-froehlich added 3 commits January 9, 2024 21:37

Fixing ARD download extractor

5c76372

tests

a1466c4

tests

1204472

dirkf requested changes Jan 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix/ard #32688

Bugfix/ard #32688

f-froehlich commented Jan 9, 2024

dirkf left a comment

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024

dirkf Jan 22, 2024


		_VALID_URL = r'https://(?:(?:beta\|www)\.)?ardmediathek\.de/(((?:[^/]+/)?(?:player\|live\|video\|serie\|sendung)/(?:[^/]+/)*(?P<id>Y3JpZDovL[a-zA-Z0-9]+))\|(((?P<sender>[a-zA-Z0-9\-]+)([/]))?(?P<name>[a-zA-Z0-9\-]+)))'

-    _VALID_URL = r'https://(?:(?:beta|www)\.)?ardmediathek\.de/(((?:[^/]+/)?(?:player|live|video|serie|sendung)/(?:[^/]+/)*(?P<id>Y3JpZDovL[a-zA-Z0-9]+))|(((?P<sender>[a-zA-Z0-9\-]+)([/]))?(?P<name>[a-zA-Z0-9\-]+)))'
+    _VALID_URL = r'''(?x)
+        https://(?:(?:beta|www)\.)?ardmediathek\.de/
+            (?:
+                (?:[^/#?]+/)?(?:player|live|video|serie|sendung)/(?:[^/#?]+/)*
+                    (?P<id>Y3JpZDovL[a-zA-Z0-9]+)|
+                (?P<sender>[a-zA-Z0-9-]+)/)?(?P<name>[a-zA-Z0-9-]+)
+            )
+        '''

	f'https://api.ardmediathek.de/page-gateway/pages/ard/item/{video_id}',
	'https://api.ardmediathek.de/page-gateway/pages/ard/item/' + video_id,

	while True:
	total = traverse_obj(widgets, ('pagination', 'totalElements', T(int))) or 0
	while page_number * page_size <= total:

	total = widgets['pagination']['totalElements']
	if (page_number + 1) * page_size > total:
	break

Bugfix/ard #32688

Are you sure you want to change the base?

Bugfix/ard #32688

Conversation

f-froehlich commented Jan 9, 2024

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment