Add support for shows.acast.com #32747

PierreMesure · 2024-03-17T06:58:56Z

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

I tried to add support for Acast's url shows.acast.com. It's my first PR and I wasn't able to check my regex properly so I hope someone more experienced can review it properly. That being said, I tested with the new URLs and it works 🙂.

While testing with the old URLs to ensure no regression, I noticed that some of them are not active anymore: https://www.acast.com/todayinfocus returns a 404.

I didn't remove it as I wasn't sure I should but I'm happy to do a bit of cleaning if someones greenlights it.

I also saw that they are using a new format with a dollar sign for embeds URL but (not being a regex expert) I failed to add it. Here's an example: https://embed.acast.com/$/611233d8767fdf0012f22cb6/611233e4988b6a001394b5cf?feed=true

dirkf

Thanks for your work!

I've made a couple of suggestions to fix/improve the regexes (should make the CI tests pass).

For tests that don't work because the page is unavailable, you can add 'skip': 'reason why unavailable', to the test case dict with the info_dict key. If that leaves no valid test page for one of the extraction tactics being tested, a test with a valid page that has the same structure has to be added, or it should be documented in comments that the structure is obsolete.

youtube_dl/extractor/acast.py

Co-authored-by: dirkf <fieldhouse@gmx.net>

PierreMesure · 2024-03-29T13:46:30Z

I committed your suggestions and changed the tests as you asked. Let's see whether there are still any errors.

dirkf

These changes, not all related to the current PR, make the download tests pass:

--- PR/youtube-dl/youtube_dl/extractor/acast.py
+++ new/youtube-dl/youtube_dl/extractor/acast.py
@@ -55,16 +55,16 @@
                     '''
     _TESTS = [{
         'url': 'https://www.acast.com/sparpodcast/2.raggarmordet-rosterurdetforflutna',
-        'md5': 'f5598f3ad1e4776fed12ec1407153e4b',
+        'md5': '9904ad5409bbc3b945efc1fe55e5f5d4',
         'info_dict': {
             'id': '2a92b283-1a75-4ad8-8396-499c641de0d9',
             'ext': 'mp3',
             'title': '2. Raggarmordet - Röster ur det förflutna',
-            'description': 'md5:a992ae67f4d98f1c0141598f7bebbf67',
+            'description': r're:(?s)Spår får kontakt med raggarna från 1979\. En titt i arkiven ger oss en gammal förundersökning som någon sparat trots alla bestämmelser om att den bör förstöras\. .{356}\.$',  # md5:a992ae67f4d98f1c0141598f7bebbf67',
             'timestamp': 1477346700,
             'upload_date': '20161024',
             'duration': 2766,
-            'creator': 'Anton Berg & Martin Johnson',
+            'creator': 'Third Ear Studio',
             'series': 'Spår',
             'episode': '2. Raggarmordet - Röster ur det förflutna',
         }
@@ -83,7 +83,7 @@
     }]
 
     def _real_extract(self, url):
-        channel, display_id = re.match(self._VALID_URL, url).groups()
+        channel, display_id = re.match(self._VALID_URL, url).group('channel', 'id')
         episode = self._call_api(
             '%s/episodes/%s' % (channel, display_id),
             display_id, {'showInfo': 'true'})
@@ -106,8 +106,7 @@
         'info_dict': {
             'id': '4efc5294-5385-4847-98bd-519799ce5786',
             'title': 'Today in Focus',
-            'description': 'md5:c09ce28c91002ce4ffce71d6504abaae',
-            'skip': 'Error 404'
+            'description': r're:Hosted by Nosheen Iqbal and Michael Safi, Today in Focus brings you closer to Guardian journalism\. .{994}$',
         },
         'playlist_mincount': 200,
     }, {
@@ -120,7 +119,8 @@
         'url': 'https://play.acast.com/s/getthebeltpod',
         'info_dict': {
             'id': '4caac858-3cda-44e6-b684-50d88b909c18',
-            'title': 'Get The Belt'
+            'title': 'Get The Belt',
+            'description': r'''re:(?s)British Podcast that you probably are not ready for! Join Myles with segments 'such as Myles Measures' that discusses relevant pop culture and society shenanigans,  .{867}$''',
         },
         'playlist_mincount': 138,
     }]
@@ -137,4 +137,4 @@
         for episode in (show.get('episodes') or []):
             entries.append(self._extract_episode(episode, show_info))
         return self.playlist_result(
-            entries, show.get('id'), show.get('title'), show.get('description'))
+            entries, show.get('id'), show.get('title'), clean_html(show.get('description')))

Example:

$ python test/test_download.py TestDownload.test_ACast
[acast] 2.raggarmordet-rosterurdetforflutna: Downloading JSON metadata
[info] Writing video description metadata as JSON to: test_ACast_2a92b283-1a75-4ad8-8396-499c641de0d9.info.json
[debug] Invoking downloader on u'https://sphinx.acast.com/sparpodcast/2.raggarmordet-rosterurdetforflutna/media.mp3?tk=eyJ0ayI6ImRlZmF1bHQiLCJhZHMiOnRydWUsInNwb25zIjp0cnVlLCJzdGF0dXMiOiJwdWJsaWMifQ==&sig=4mClRAB0qh2d8ZixupPWgjm9YxaIJSPcak-ZATQ-H4g'
[download] Destination: test_ACast_2a92b283-1a75-4ad8-8396-499c641de0d9.mp3
[download] 100% of 10.00KiB in 00:00
.
----------------------------------------------------------------------
Ran 1 test in 3.362s

OK
$

dirkf · 2024-04-02T02:03:15Z

For long text fields, the download test proposes a test value expressed as a digest using MD5. Such a test value is accurate and short, but doesn't reveal what changed if the test starts to fail, as happened here.

In the diff I used pattern test values for description. If a test starts to fail, a pattern can be compared with the new value, showing whether there's just been a small change to the value, as when the site starts to append some or different boilerplate, or whether a completely different extraction tactic is needed to get the value correctly.

Add support for shows.acast.com

2594f3c

dirkf requested changes Mar 21, 2024

View reviewed changes

youtube_dl/extractor/acast.py Outdated Show resolved Hide resolved

youtube_dl/extractor/acast.py Outdated Show resolved Hide resolved

PierreMesure and others added 2 commits March 29, 2024 13:25

Apply suggestions from code review

3f45e4c

Co-authored-by: dirkf <fieldhouse@gmx.net>

Edit tests

b91bbb8

PierreMesure requested a review from dirkf March 29, 2024 13:45

dirkf requested changes Apr 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for shows.acast.com #32747

Add support for shows.acast.com #32747

PierreMesure commented Mar 17, 2024 •

edited

dirkf left a comment

PierreMesure commented Mar 29, 2024

dirkf left a comment

dirkf commented Apr 2, 2024

Add support for shows.acast.com #32747

Are you sure you want to change the base?

Add support for shows.acast.com #32747

Conversation

PierreMesure commented Mar 17, 2024 • edited

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

dirkf left a comment

Choose a reason for hiding this comment

PierreMesure commented Mar 29, 2024

dirkf left a comment

Choose a reason for hiding this comment

dirkf commented Apr 2, 2024

PierreMesure commented Mar 17, 2024 •

edited