[utils] Small fixes to utils, make tests pass in Py2 #29845

dirkf · 2021-08-22T12:16:27Z

[utils] Small fixes to utils, make tests pass in Py2

Please follow the guide below

Before submitting a pull request make sure you have:

Searched the bugtracker for similar pull requests
Read adding new extractor tutorial
Read youtube-dl coding conventions and adjusted the code to meet them
Covered the code with tests (note that PRs without tests will be REJECTED)
Checked the code with flake8

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

I am the original author of this code and I am willing to release it under Unlicense
I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

Bug fix
Improvement
New extractor
New feature

Description of your pull request and other information

Tweaks:

use enumerate() instead of zip(..., itertools.count());
pass encodeFilename() test in Py2 by encoding with 'backslashreplace' and decoding with 'unicode_escape' when filesystem encoding isn't Unicode;
ditto for shell_quote();
consequent changes in compat_setenv(), compat_getenv();
make compat_expanduser() test pass;
consistently test for OS ('nt', 'ce') (ha!);
fix error in urlhandle_detect_ext() where non-ASCII character couldn't be coerced to Unicode in Py2 (resolves Soundcloud 'ascii' codec can't decode byte ... #29417);
fix issue where extract_timezone() could be confused by a date-time string ending in a 4-digit year (check was added, then apparently removed with unified_timestamp()), add some more date formats and tests (resolves [utils] extract_timezone() removes trailing year from time+date-type date-time string #29948)
following the discussion below, add support to urlhandle_detect_ext() for all the filename parameter syntaxes specified in RFC6266
avoid truncating a supplied cookie file, resolves Local cookie file erased after ENOSPC error #30082
fix get_elements_by_classname() matching elements with classname (eg) plist-info when searching for class info, as noted here.

Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf

* 'master' of https://github.com/yt-dlp/yt-dlp: [CBC] Fix CBC Gem extractors (#1013) [Peertube] Add channel extractor (#1023) [youtube] Warn when trying to download clips [test/cookies] Improve logging [Nuvid] Fix extractor (#1022) [aes] Add `aes_gcm_decrypt_and_verify` (#1020) [CGTN] Add extractor (#981) [utils] Improve `extract_timezone` Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf

C0D3D3V · 2021-10-09T14:07:55Z

Can you also fix the regex so that it can skip the UTF8 filename if it is in front :) ?

My suggestion is: r'attachment;.*filename="(?P<filename>[^"]+)

dirkf · 2021-10-09T20:06:19Z

If we're going to support the extended parameter syntax in RFCs 6266/5987 (the latter has been updated by RFC 8187, but it's the normative reference for 6266), we'd better do it properly.

Apparently, there is either the normal parameter syntax ("" around value optional)
attachment; filename=file_name.ext
attachment; filename="file name.ext"
or the extended parameter syntax, with a * added to the parameter name and a charset-name and optional single-quoted language code before the unquoted encoded value
attachment; filename*=ISO-8859-1'fr-CA'nom%20de%20fichier.ext

In the latter case we would need to decode the value appropriately, but I guess any language code can be ignored.

In RFC 6266 section 5, we have a set of examples, showing that we can actually have both (or according to the grammar, 0 or more filename/filename* parameters), although we SHOULD skip the first if the second is present, or just take the filename=... if it comes first:

    Content-Disposition: Attachment; filename=example.html

    Content-Disposition: attachment;
                         filename*= UTF-8''%e2%82%ac%20rates

    Content-Disposition: attachment;
                         filename="EURO rates";
                         filename*=utf-8''%e2%82%ac%20rates

As in the first example, the keywords attachment and filename should be matched case-insensitively.

Obviously one could ask whether anyone's download has ever failed as a result of its mimetype not having been detected according to a filename*-type Content-Disposition.

C0D3D3V · 2021-10-09T20:24:09Z

If the regex finds nothing, the function falls back to mimetype2ext. So in any case an extension always comes back, therefore no download can fail because of that. But it would still be nice to get a better ending than the one from the MIME type.

(for a "docx" file like in my example the MIME type is application/vnd.openxmlformats-officedocument.wordprocessingml.document, so we get a rather long file extension)

C0D3D3V · 2021-10-09T20:29:16Z

Thanks for researching this in such detail, of course this makes it a bit more complicated. I would keep it as simple as possible as it is only about the file extension and not the filename.

C0D3D3V · 2021-10-09T20:58:05Z

We could also use the cgi lib

>>> import cgi

>>> cgi.parse_header('attachment; filename=file_name.ext')
('attachment', {'filename': 'file_name.ext'})

>>> cgi.parse_header('attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates')
('attachment', {'filename': 'EURO rates', 'filename*': 'utf-8%e2%82%ac%20rates'})

>>> cgi.parse_header('''attachment;
                           filename*=utf-8''%e2%82%ac%20rates''')
('attachment', {'filename*': "utf-8''%e2%82%ac%20rates"})

Exists only since python 2.7 not since 2.6 :(

dirkf · 2021-10-10T00:51:48Z

This fragment (as updated) could do the trick

        m = re.match(r'''(?xi)
            attachment;\s*
            (?:filename\s*=[^;]+?;\s*)?                    # possible initial filename=...;, ignored
            filename(?P<x>\*)?\s*=\s*                      # filename/filename* = 
                (?(x)(?P<charset>\S+?)'[\w-]*'|(?P<q>")?)  # if * then charset'...' else maybe "
                (?P<filename>(?(q)[^"]+(?=")|\S+))         # actual name of file
            ''', cd)
        if m:
            m = m.groupdict()
            filename = m.get('filename')
            if m.get('x'):
                try:
                    filename = compat_urllib_parse_unquote(filename, encoding=m.get('charset','utf-8'))
                except LookupError:  # unrecognised character set name
                    pass
            e = determine_ext(filename, default_ext=None)

replacing

        m = re.match(r'attachment;\s*filename="(?P<filename>[^"]+)"', cd)
        if m:
            e = determine_ext(m.group('filename'), default_ext=None)

C0D3D3V · 2021-10-10T12:46:14Z

Thank you very much!
I have tested it, it works, but the semicolon was forgotten. Either we use rstrip to remove the semicolon or one builds it yet into the Regex. (As a reminder, my Content-Disposition was: attachment; filename*=UTF-8''asdasd.docx; filename="asdasd.docx" )

Here is the working version with rstrip:

 def urlhandle_detect_ext(self, url_handle):
        getheader = url_handle.headers.get

        cd = getheader('Content-Disposition')
        if cd:
            m = re.match(
                r'''(?xi)
                attachment;\s*
                (?:filename\s*=[^;]+?;\s*)?                   # possible initial filename=...;, ignored
                filename(?P<x>\*)?\s*=\s*                      # filename/filename* = 
                    (?(x)(?P<charset>\S+?)'[\w-]*'|(?P<q>")?)  # if * then charset'...' else maybe "
                    (?P<filename>(?(q)[^"]+(?=")|\S+))         # actual name of file
                ''',
                cd,
            )
            if m:
                m = m.groupdict()
                filename = m.get('filename')
                if m.get('x'):
                    try:
                        filename = compat_urllib_parse_unquote(filename, encoding=m.get('charset', 'utf-8'))
                    except LookupError:  # unrecognised character set name
                        pass
                    e = determine_ext(filename.rstrip(';'), default_ext=None)
                    if e:
                        return e

        return mimetype2ext(getheader('Content-Type'))

dirkf · 2021-10-10T13:04:56Z

Thank you very much! I have tested it, it works, but the semicolon was forgotten. Either we use rstrip to remove the semicolon or one builds it yet into the Regex. (As a reminder, my Content-Disposition was: attachment; filename*=UTF-8''asdasd.docx; filename="asdasd.docx" )

Ah, I missed that. It's the non-recommended ordering with the extended syntax first. Instead of \S+ for the unquoted branch, we need [^\s;]+. I've put in another test for that.

C0D3D3V · 2021-10-10T13:19:03Z

Works perfectly! Thank you very much! You are the best :)

Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf

test/test_utils.py

Restore check omitted from extract_timezone(); adjust DATE_FORMATS_DAY/MONTH_FIRST; add tests.

Add support for unquoted token and RFC 5987 extended parameter syntax

Eg, in [1], the class with name 'plist-info' was found when searching for 'info'. 1. ytdl-org#30230

…s, Py>=3.7)

Also swallow inf, nan

…set` header

…rg#2094)

* see yt-dlp/yt-dlp#8816

dirkf force-pushed the df-utils-triv-patch branch from bd96faf to bc6ad58 Compare August 22, 2021 12:19

dirkf marked this pull request as draft August 22, 2021 17:03

dirkf force-pushed the df-utils-triv-patch branch 6 times, most recently from 81b2a9d to 1972157 Compare August 23, 2021 17:02

dirkf marked this pull request as ready for review August 23, 2021 17:10

dirkf force-pushed the df-utils-triv-patch branch 4 times, most recently from 625666f to 1e22200 Compare August 29, 2021 05:27

dirkf force-pushed the df-utils-triv-patch branch from e96bfce to f798b40 Compare September 13, 2021 00:15

dirkf mentioned this pull request Sep 13, 2021

[utils] extract_timezone() removes trailing year from time+date-type date-time string #29948

Open

6 tasks

pukkandan added a commit to yt-dlp/yt-dlp that referenced this pull request Sep 19, 2021

[utils] Improve extract_timezone

f137e4c

Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf

dirkf force-pushed the df-utils-triv-patch branch from f54575f to f33e312 Compare October 10, 2021 12:50

dirkf force-pushed the df-utils-triv-patch branch from f33e312 to 74b5310 Compare October 10, 2021 13:06

nixxo pushed a commit to nixxo/yt-dlp that referenced this pull request Nov 22, 2021

[utils] Improve extract_timezone

69359b7

Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf

dirkf force-pushed the df-utils-triv-patch branch from 16b1fe3 to d7d8e0c Compare January 27, 2022 05:25

dirkf force-pushed the df-utils-triv-patch branch from d7d8e0c to 51d3d0c Compare January 27, 2022 05:38

dirkf force-pushed the master branch from 01bf89e to 4c6fba3 Compare August 26, 2022 07:51

dirkf commented Mar 4, 2024

View reviewed changes

test/test_utils.py Outdated Show resolved Hide resolved

dirkf commented Mar 4, 2024

View reviewed changes

test/test_utils.py Outdated Show resolved Hide resolved

dirkf added 18 commits March 11, 2024 16:06

[utils] Small fixes to utils and compat and test

06d489c

[utils] Fix urlhandle_detect_ext() non-ASCII error in Py2, with test

d87e2ad

[utils] Disambiguate 4-digit year and time-zone suffix

7990d1e

Restore check omitted from extract_timezone(); adjust DATE_FORMATS_DAY/MONTH_FIRST; add tests.

[utils] Detect extension from any RFC Content-Disposition syntax

973f76c

Add support for unquoted token and RFC 5987 extended parameter syntax

[utils] Avoid scrubbing supplied cookie file on failed update

a3fe1d1

[utils] Don't find classname as part of class="... x-classname ...", etc

58f15bb

Eg, in [1], the class with name 'plist-info' was found when searching for 'info'. 1. ytdl-org#30230

[utils] Improve ExtractorError with msg IV and ie constructor param

eb93aaf

[utils] Work-around for yt-dlp issue 1060 (skip bad certs from Window…

045ff70

…s, Py>=3.7)

[utils] Ensure a value from determine_protocol()

7a438da

[utils] Simplify int_or_none(), based on yt-dlp 9e907eb

5988a39

Also swallow inf, nan

[utils] Sort Chrome versions used for UAs; drop obsolete `Accept-Char…

77c778b

…set` header

[utils] Recognise FLAC audio in parse_codecs()

1d9df28

[utils] Add parsing YYYYMMDD dates, also in Nov/Dec (yt-dlp PR ytdl-o…

f5f1908

…rg#2094)

[utils] Improve parse_count() with single regex, based on yt-dlp 352d5da

0205fea

[utils] Fix/improve InAdvancePagedList, from yt-dlp d37707b

645d7a3

[utils] Handle ss:xxx in parse_duration(), based on yt-dlp 8bd1c00

059ef5b

[utils] Unescape HTML5 named character references (with no ;)

80cb917

[utils] mode might be None in write_string()

05aa2ad

* see yt-dlp/yt-dlp#8816

dirkf force-pushed the df-utils-triv-patch branch from e456d46 to 05aa2ad Compare March 11, 2024 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[utils] Small fixes to utils, make tests pass in Py2 #29845

[utils] Small fixes to utils, make tests pass in Py2 #29845

dirkf commented Aug 22, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

dirkf commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

dirkf commented Oct 10, 2021 •

edited

C0D3D3V commented Oct 10, 2021 •

edited

dirkf commented Oct 10, 2021

C0D3D3V commented Oct 10, 2021

[utils] Small fixes to utils, make tests pass in Py2 #29845

Are you sure you want to change the base?

[utils] Small fixes to utils, make tests pass in Py2 #29845

Conversation

dirkf commented Aug 22, 2021 • edited

Please follow the guide below

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

What is the purpose of your pull request?

Description of your pull request and other information

C0D3D3V commented Oct 9, 2021 • edited

dirkf commented Oct 9, 2021 • edited

C0D3D3V commented Oct 9, 2021 • edited

C0D3D3V commented Oct 9, 2021 • edited

C0D3D3V commented Oct 9, 2021 • edited

dirkf commented Oct 10, 2021 • edited

C0D3D3V commented Oct 10, 2021 • edited

dirkf commented Oct 10, 2021

C0D3D3V commented Oct 10, 2021

dirkf commented Aug 22, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

dirkf commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

C0D3D3V commented Oct 9, 2021 •

edited

dirkf commented Oct 10, 2021 •

edited

C0D3D3V commented Oct 10, 2021 •

edited