Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix npo support #31976

Open
wants to merge 39 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
3b31478
Fix support for NPO downloads
bartbroere Mar 31, 2023
b4776f2
Import from compat
bartbroere Mar 31, 2023
fb2b4e2
Add line comment
bartbroere Mar 31, 2023
9e1acb2
Fix flake8
bartbroere Mar 31, 2023
6328978
Accept suggestions on PR; comply with conventions
bartbroere Apr 3, 2023
0c7261d
Update npo.py
dirkf Apr 6, 2023
c409a8c
Merge branch 'ytdl-org:master' into fix-npo-support
bartbroere Feb 25, 2024
f76d58c
Skip a test
bartbroere Feb 26, 2024
da3d1f4
Add notes on new npo.nl site
bartbroere Mar 1, 2024
5773681
Fix token URL
bartbroere Mar 1, 2024
29724e7
Delete all broken extractors
bartbroere Mar 1, 2024
21eb451
Convert the description into code
bartbroere Mar 1, 2024
0dc7d95
Comply with coding conventions a bit more
bartbroere Mar 1, 2024
fb7b717
Speculate about other ways of getting productId
bartbroere Mar 1, 2024
f9e59b0
Add the possibility to add 'hls' later
bartbroere Mar 1, 2024
8b1a7d9
Use provided util
bartbroere Mar 1, 2024
34b5b20
Refactor into reusable method
bartbroere Mar 3, 2024
4fc4238
Fix lint
bartbroere Mar 5, 2024
28ba01f
Add Ongehoord Nederland and test URL for BNNVARA
bartbroere Mar 5, 2024
eb6e396
First version of a VPRO regex
bartbroere Mar 5, 2024
d36d50f
Re-add Zapp
bartbroere Mar 5, 2024
d426a92
Encoding suggestion from PR
bartbroere Mar 5, 2024
3b3d73c
Use program-detail endpoint and remove a test
bartbroere Mar 6, 2024
4b24e5f
Re-add SchoolTV
bartbroere Mar 6, 2024
681b390
Fix flake8 and better error reporting
bartbroere Mar 6, 2024
159f825
Add scaffolding for last few extractors and change order so the PR di…
bartbroere Mar 6, 2024
0cbcd1a
Make diff better
bartbroere Mar 6, 2024
0ab79c3
Reusable code for two NTR sites
bartbroere Mar 7, 2024
c08f29f
Update unit tests
bartbroere Mar 10, 2024
28624cf
Work work
bartbroere Mar 10, 2024
1ca4e68
Add an MD5
bartbroere Mar 10, 2024
4398f68
Fix zapp extractor
bartbroere Mar 11, 2024
58d7a00
Resolve some of the pull request feedback
bartbroere Mar 11, 2024
d4250c8
Merge branch 'ytdl-org:master' into fix-npo-support
bartbroere Mar 12, 2024
ad64f37
Improve regex
bartbroere Mar 14, 2024
bc86c5f
Make regex more specific and remove redundant .*
bartbroere Mar 14, 2024
4c90b2f
Adhere to code style
bartbroere Mar 14, 2024
007bbea
Remove afspelen and trailing slashes with one regex
bartbroere Mar 14, 2024
a60972e
Fix indent from suggestion
bartbroere Mar 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 10 additions & 1 deletion youtube_dl/extractor/extractors.py
Expand Up @@ -847,7 +847,16 @@
NownessSeriesIE,
)
from .noz import NozIE
from .npo import NPOIE
from .npo import (
AndereTijdenIE,
BNNVaraIE,
NPOIE,
ONIE,
SchoolTVIE,
HetKlokhuisIE,
VPROIE,
WNLIE,
)
from .npr import NprIE
from .nrk import (
NRKIE,
Expand Down
268 changes: 211 additions & 57 deletions youtube_dl/extractor/npo.py
@@ -1,43 +1,21 @@
# coding: utf-8
from __future__ import unicode_literals

import json
import re

bartbroere marked this conversation as resolved.
Show resolved Hide resolved
from .common import InfoExtractor
from ..utils import (
ExtractorError,
)
from ..utils import ExtractorError


class NPOIE(InfoExtractor):
IE_NAME = 'npo'
IE_DESC = 'npo.nl'
_VALID_URL = r'''(?x)
(?:
npo:|
https?://
(?:www\.)?
(?:
npo\.nl/(?:[^/]+/)*
)
)
(?P<id>[^/?#]+)
'''
_VALID_URL = r'https?://(?:www\.)?npo\.nl/.*'
bartbroere marked this conversation as resolved.
Show resolved Hide resolved

_TESTS = [{
'url': 'https://npo.nl/start/serie/zembla/seizoen-2015/wie-is-de-mol-2/',
# TODO fill in other test attributes
}, {
'url': 'http://www.npo.nl/de-mega-mike-mega-thomas-show/27-02-2009/VARA_101191800',
'md5': 'da50a5787dbfc1603c4ad80f31c5120b',
'info_dict': {
'id': 'VARA_101191800',
'ext': 'm4v',
'title': 'De Mega Mike & Mega Thomas show: The best of.',
'description': 'md5:3b74c97fc9d6901d5a665aac0e5400f4',
'upload_date': '20090227',
'duration': 2400,
},
'skip': 'Video gone',
}, {
'url': 'https://npo.nl/start/serie/vpro-tegenlicht/seizoen-11/zwart-geld-de-toekomst-komt-uit-afrika',
'md5': 'f8065e4e5a7824068ed3c7e783178f2c',
Expand Down Expand Up @@ -67,45 +45,49 @@ def _real_extract(self, url):
url = url[:-10]
url = url.rstrip('/')
slug = url.split('/')[-1]
page = self._download_webpage(url, slug, 'Finding productId using slug: %s' % slug)
# TODO find out what proper HTML parsing utilities are available in youtube-dl
next_data = page.split('<script id="__NEXT_DATA__" type="application/json">')[1].split('</script>')[0]
# TODO The data in this script tag feels like GraphQL, so there might be an easier way
# to get the product id, maybe using a GraphQL endpoint
next_data = json.loads(next_data)
product_id, title, description, thumbnail = None, None, None, None
for query in next_data['props']['pageProps']['dehydratedState']['queries']:
if isinstance(query['state']['data'], list):
for entry in query['state']['data']:
if entry['slug'] == slug:
product_id = entry.get('productId')
title = entry.get('title')
synopsis = entry.get('synopsis', {})
description = (
synopsis.get('long')
or synopsis.get('short')
or synopsis.get('brief')
)
thumbnails = entry.get('images')
for thumbnail_entry in thumbnails:
if 'url' in thumbnail_entry:
thumbnail = thumbnail_entry.get('url')

program_metadata = self._download_json('https://npo.nl/start/api/domain/program-detail',
slug,
query={'slug': slug})
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
product_id = program_metadata.get('productId')
images = program_metadata.get('images')
thumbnail = None
for image in images:
thumbnail = image.get('url')
break
title = program_metadata.get('title')
descriptions = program_metadata.get('description', {})
description = descriptions.get('long') or descriptions.get('short') or descriptions.get('brief')
duration = program_metadata.get('durationInSeconds')

if not product_id:
raise ExtractorError('No productId found for slug: %s' % slug)

token = self._get_token(product_id)
formats = self._download_by_product_id(product_id, slug, url)

return {
'id': slug,
'formats': formats,
'title': title or slug,
'description': description or title or slug,
'thumbnail': thumbnail,
'duration': duration,
}

def _download_by_product_id(self, product_id, slug, url=None):
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
token = self._get_token(product_id)
formats = []
for profile in (
'dash',
# 'hls', # TODO test what needs to change for 'hls' support
# 'hls' is available too, but implementing it doesn't add much
# As far as I know 'dash' is always available
):
stream_link = self._download_json(
'https://prod.npoplayer.nl/stream-link', video_id=slug,
data=json.dumps({
'profileName': profile,
'drmType': 'widevine',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naïvely, I wonder any streams returned with this request are not WV-encrypted, and what happens if other or no values are passed.

'referrerUrl': url,
'referrerUrl': url or '',
}).encode('utf8'),
headers={
'Authorization': token,
Expand All @@ -114,12 +96,184 @@ def _real_extract(self, url):
)
stream_url = stream_link.get('stream', {}).get('streamURL')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream_url = stream_link.get('stream', {}).get('streamURL')
stream_url = traverse_obj(stream_link, ('stream', 'streamURL'))

formats.extend(self._extract_mpd_formats(stream_url, slug, mpd_id='dash', fatal=False))
return formats


class BNNVaraIE(NPOIE):
IE_NAME = 'bnnvara'
IE_DESC = 'bnnvara.nl'
_VALID_URL = r'https?://(?:www\.)?bnnvara\.nl/videos/[0-9]*'
_TESTS = [{
'url': 'https://www.bnnvara.nl/videos/27455',
# TODO fill in other test attributes
}]

def _real_extract(self, url):
url = url.rstrip('/')
video_id = url.split('/')[-1]

media = self._download_json('https://api.bnnvara.nl/bff/graphql',
video_id,
data=json.dumps(
{
'operationName': 'getMedia',
'variables': {
'id': video_id,
'hasAdConsent': False,
'atInternetId': 70
},
'query': 'query getMedia($id: ID!, $mediaUrl: String, $hasAdConsent: Boolean!, $atInternetId: Int) {\n player(\n id: $id\n mediaUrl: $mediaUrl\n hasAdConsent: $hasAdConsent\n atInternetId: $atInternetId\n ) {\n ... on PlayerSucces {\n brand {\n name\n slug\n broadcastsEnabled\n __typename\n }\n title\n programTitle\n pomsProductId\n broadcasters {\n name\n __typename\n }\n duration\n classifications {\n title\n imageUrl\n type\n __typename\n }\n image {\n title\n url\n __typename\n }\n cta {\n title\n url\n __typename\n }\n genres {\n name\n __typename\n }\n subtitles {\n url\n language\n __typename\n }\n sources {\n name\n url\n ratio\n __typename\n }\n type\n token\n __typename\n }\n ... on PlayerError {\n error\n __typename\n }\n __typename\n }\n}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this less ridiculous, it could be assigned to a variable at the start of the routine using this sort of formatting:

        GQL_QUERY = (
            'query getMedia($id: ID!, $mediaUrl: String, $hasAdConsent: Boolean!, $atInternetId: Int) {\n  '
            'player(\n    id: $id\n    mediaUrl: $mediaUrl\n    hasAdConsent: $hasAdConsent\n    '
            'atInternetId: $atInternetId\n  ) {\n    ... on PlayerSucces {\n      '
            'brand {\n        name\n        slug\n        broadcastsEnabled\n        '
            '__typename\n      }\n      title\n      programTitle\n      pomsProductId\n      '
            ...
            '... on PlayerError {\n      error\n      __typename\n    }\n    __typename\n  }\n}')

}).encode('utf8'),
headers={
'Content-Type': 'application/json',
})
product_id = media.get('data', {}).get('player', {}).get('pomsProductId')

formats = self._download_by_product_id(product_id, video_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

Suggested change
product_id = media.get('data', {}).get('player', {}).get('pomsProductId')
formats = self._download_by_product_id(product_id, video_id)
product_id = traverse_obj(media, ('data', 'player', 'pomsProductId'))
formats = self._download_by_product_id(product_id, video_id) if product_id else []
self._sort_formats(formats)


return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be re-worked like the previous return, with the merge_dicts(traverse_obj(metadata, dict_construction), dict_of_known_vars) pattern.

'id': slug,
'id': product_id,
'title': media.get('data', {}).get('player', {}).get('title'),
'formats': formats,
'thumbnail': media.get('data', {}).get('player', {}).get('image').get('url'),
}


class ONIE(NPOIE):
IE_NAME = 'on'
IE_DESC = 'ongehoordnederland.tv'
_VALID_URL = r'https?://(?:www\.)?ongehoordnederland.tv/.*'
_TESTS = [{
'url': 'https://ongehoordnederland.tv/2024/03/01/korte-clips/heeft-preppen-zin-betwijfel-dat-je-daar-echt-iets-aan-zult-hebben-bij-oorlog-lydia-daniel/',
# TODO fill in other test attributes
}]

def _real_extract(self, url):
video_id = url.rstrip('/').split('/')[-1]
page, _ = self._download_webpage_handle(url, video_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not using the returned urlhandle to track redirection or errors:

Suggested change
page, _ = self._download_webpage_handle(url, video_id)
page = self._download_webpage(url, video_id)

results = re.findall("page: '(.+)'", page)
formats = []
for result in results:
formats.extend(self._download_by_product_id(result, video_id))

if not formats:
raise ExtractorError('Could not find a POMS product id in the provided URL, '
'perhaps because all stream URLs are DRM protected.')
Comment on lines +169 to +171
Copy link
Contributor

@dirkf dirkf Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._sort_formats(...) should be called and will raise if there aren't any. If there is a way to identify that a stream has DRM, and given that unlike yt-dlp we're going to skip DRM formats, one could, eg, return a tuple (formats, number_of formats_seen) and then compare the sum of the total formats against len(formats). If not formats and the values differ, self.report_drm() can be called.


return {
'id': video_id,
'title': video_id,
'formats': formats,
'title': title or slug,
'description': description,
'thumbnail': thumbnail,
# TODO fill in other metadata that's available
}


class ZAPPIE(NPOIE):
IE_NAME = 'zapp'
IE_DESC = 'zapp.nl'
_VALID_URL = r'https?://(?:www\.)?zapp.nl/.*'

_TESTS = [{
'url': 'https://www.zapp.nl/programmas/zappsport/gemist/AT_300003973',
# TODO fill in other test attributes
}]

def _real_extract(self, url):
video_id = url.rstrip('/').split('/')[-1]

formats = self._download_by_product_id(url, video_id)

return {
'id': video_id,
'title': video_id,
'formats': formats,
}


class SchoolTVIE(NPOIE):
IE_NAME = 'schooltv'
IE_DESC = 'schooltv.nl'
_VALID_URL = r'https?://(?:www\.)?schooltv.nl/item/.*'

_TESTS = [{
'url': 'https://schooltv.nl/item/zapp-music-challenge-2015-zapp-music-challenge-2015',
# TODO fill in other test attributes
}]

def _real_extract(self, url):
video_id = url.rstrip('/').split('/')[-1]

# TODO Find out how we could obtain this automatically
# Otherwise this extractor might break each time SchoolTV deploys a new release
build_id = 'b7eHUzAVO7wHXCopYxQhV'
Comment on lines +224 to +226
Copy link

@rvsit rvsit Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way is to load a random page, JSON parse the __NEXT_DATA__ part and get the buildId prop from there. But then we might as well have that 'random page' be the actual video page and skip the /_next/data/ download part as that object already contains the poms_mid.
It is not great, but I think the only stable option is parsing __NEXT_DATA__ part for the poms_mid like we initially did for the NPO Start webui. I have worked with nextjs for quite a while and they have never changed the __NEXT_DATA__ part as far as I know so should be relatively safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the _search_nextjs_data() method if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the _search_nextjs_data() method if needed.

Thanks! I'll look into that and use it if I can make it work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, all the information that you might have got with the JSON API is in the Next.js hydration JSON in the page, including the build ID that is no longer of interest. I have this working but will update once other issues have been cleared.


metadata_url = 'https://schooltv.nl/_next/data/' \
+ build_id \
+ '/item/' \
+ video_id + '.json'

metadata = self._download_json(metadata_url,
video_id).get('pageProps', {}).get('data', {})

formats = self._download_by_product_id(metadata.get('poms_mid'), video_id)

if not formats:
raise ExtractorError('Could not find a POMS product id in the provided URL, '
'perhaps because all stream URLs are DRM protected.')

return {
'id': video_id,
'title': metadata.get('title', '') + ' - ' + metadata.get('subtitle', ''),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use utils.join_nonempty() for this effect:

  • 'title', 'subtitle' -> 'title - subtitle'
  • 'title', '' -> 'title'
  • '', 'subtitle' -> 'subtitle'

'description': metadata.get('description') or metadata.get('short_description'),
'formats': formats,
}


class HetKlokhuisIE(NPOIE):
...

def _real_extract(self, url):
...


class VPROIE(NPOIE):
IE_NAME = 'vpro'
IE_DESC = 'vpro.nl'
_VALID_URL = r'https?://(?:www\.)?vpro.nl/.*'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a tight enough pattern?

_TESTS = [{
'url': 'https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen/2015-2016/offline-als-luxe.html',
# TODO fill in other test attributes
}]

def _real_extract(self, url):
video_id = url.rstrip('/').split('/')[-1]
page, _ = self._download_webpage_handle(url, video_id)
results = re.findall(r'data-media-id="(.+_.+)"\s', page)
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
formats = []
for result in results:
formats.extend(self._download_by_product_id(result, video_id))
break # TODO find a better solution, VPRO pages can have multiple videos embedded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May this embedding occur in other pages (not vpro.nl)?

Are the second and up videos related (clips, trailers, etc), or is the case more like a series page with various episodes?

In the first case maybe skip the subsidiary videos; in the second normally return a playlist result whose entries are either the url_result()s of episode URLs constructed for each video, or info_dicts extracted from the page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the test video page, there is apparently a content video and a teaser video. The former can be detected because it's inside (preceded by) <div class=grid>.

As far as I can see other pages that might list multiple videos are playlist pages like https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/kijk/afleveringen.html or https://www.vpro.nl/programmas/tegenlicht/categorieen/wereld.html that don't include data-media-ids but just have links to programme episodes. Counterexamples welcome.


if not formats:
raise ExtractorError('Could not find a POMS product id in the provided URL, '
'perhaps because all stream URLs are DRM protected.')

return {
'id': video_id,
'title': video_id,
'formats': formats,
}


class WNLIE(NPOIE):
...

def _real_extract(self, url):
...


class AndereTijdenIE(NPOIE):
...

def _real_extract(self, url):
...