Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[utils] Small fixes to utils, make tests pass in Py2 #29845

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

dirkf
Copy link
Contributor

@dirkf dirkf commented Aug 22, 2021

[utils] Small fixes to utils, make tests pass in Py2

Please follow the guide below


Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

Tweaks:

  • use enumerate() instead of zip(..., itertools.count());
  • pass encodeFilename() test in Py2 by encoding with 'backslashreplace' and decoding with 'unicode_escape' when filesystem encoding isn't Unicode;
  • ditto for shell_quote();
  • consequent changes in compat_setenv(), compat_getenv();
  • make compat_expanduser() test pass;
  • consistently test for OS ('nt', 'ce') (ha!);
  • fix error in urlhandle_detect_ext() where non-ASCII character couldn't be coerced to Unicode in Py2 (resolves Soundcloud 'ascii' codec can't decode byte ... #29417);
  • fix issue where extract_timezone() could be confused by a date-time string ending in a 4-digit year (check was added, then apparently removed with unified_timestamp()), add some more date formats and tests (resolves [utils] extract_timezone() removes trailing year from time+date-type date-time string #29948)
  • following the discussion below, add support to urlhandle_detect_ext() for all the filename parameter syntaxes specified in RFC6266
  • avoid truncating a supplied cookie file, resolves Local cookie file erased after ENOSPC error #30082
  • fix get_elements_by_classname() matching elements with classname (eg) plist-info when searching for class info, as noted here.

@dirkf dirkf marked this pull request as draft August 22, 2021 17:03
@dirkf dirkf force-pushed the df-utils-triv-patch branch 6 times, most recently from 81b2a9d to 1972157 Compare August 23, 2021 17:02
@dirkf dirkf marked this pull request as ready for review August 23, 2021 17:10
@dirkf dirkf force-pushed the df-utils-triv-patch branch 4 times, most recently from 625666f to 1e22200 Compare August 29, 2021 05:27
pukkandan added a commit to yt-dlp/yt-dlp that referenced this pull request Sep 19, 2021
Lesmiscore added a commit to ytdl-patched/ytdl-patched that referenced this pull request Sep 19, 2021
* 'master' of https://github.com/yt-dlp/yt-dlp:
  [CBC] Fix CBC Gem extractors (#1013)
  [Peertube] Add channel extractor (#1023)
  [youtube] Warn when trying to download clips
  [test/cookies] Improve logging
  [Nuvid] Fix extractor (#1022)
  [aes] Add `aes_gcm_decrypt_and_verify` (#1020)
  [CGTN] Add extractor (#981)
  [utils] Improve `extract_timezone` Code taken from: ytdl-org/youtube-dl#29845 Fixes: ytdl-org/youtube-dl#29948 Authored by: dirkf
@C0D3D3V
Copy link

C0D3D3V commented Oct 9, 2021

Can you also fix the regex so that it can skip the UTF8 filename if it is in front :) ?
urlhandle_detect_ext

My suggestion is: r'attachment;.*filename="(?P<filename>[^"]+)

@dirkf
Copy link
Contributor Author

dirkf commented Oct 9, 2021

If we're going to support the extended parameter syntax in RFCs 6266/5987 (the latter has been updated by RFC 8187, but it's the normative reference for 6266), we'd better do it properly.

Apparently, there is either the normal parameter syntax ("" around value optional)
attachment; filename=file_name.ext
attachment; filename="file name.ext"
or the extended parameter syntax, with a * added to the parameter name and a charset-name and optional single-quoted language code before the unquoted encoded value
attachment; filename*=ISO-8859-1'fr-CA'nom%20de%20fichier.ext

In the latter case we would need to decode the value appropriately, but I guess any language code can be ignored.

In RFC 6266 section 5, we have a set of examples, showing that we can actually have both (or according to the grammar, 0 or more filename/filename* parameters), although we SHOULD skip the first if the second is present, or just take the filename=... if it comes first:

    Content-Disposition: Attachment; filename=example.html

    Content-Disposition: attachment;
                         filename*= UTF-8''%e2%82%ac%20rates

    Content-Disposition: attachment;
                         filename="EURO rates";
                         filename*=utf-8''%e2%82%ac%20rates

As in the first example, the keywords attachment and filename should be matched case-insensitively.

Obviously one could ask whether anyone's download has ever failed as a result of its mimetype not having been detected according to a filename*-type Content-Disposition.

@C0D3D3V
Copy link

C0D3D3V commented Oct 9, 2021

If the regex finds nothing, the function falls back to mimetype2ext. So in any case an extension always comes back, therefore no download can fail because of that. But it would still be nice to get a better ending than the one from the MIME type.

(for a "docx" file like in my example the MIME type is application/vnd.openxmlformats-officedocument.wordprocessingml.document, so we get a rather long file extension)

@C0D3D3V
Copy link

C0D3D3V commented Oct 9, 2021

Thanks for researching this in such detail, of course this makes it a bit more complicated. I would keep it as simple as possible as it is only about the file extension and not the filename.

@C0D3D3V
Copy link

C0D3D3V commented Oct 9, 2021

We could also use the cgi lib

>>> import cgi

>>> cgi.parse_header('attachment; filename=file_name.ext')
('attachment', {'filename': 'file_name.ext'})

>>> cgi.parse_header('attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates')
('attachment', {'filename': 'EURO rates', 'filename*': 'utf-8%e2%82%ac%20rates'})

>>> cgi.parse_header('''attachment;
                           filename*=utf-8''%e2%82%ac%20rates''')
('attachment', {'filename*': "utf-8''%e2%82%ac%20rates"})                     
                           

Exists only since python 2.7 not since 2.6 :(

@dirkf
Copy link
Contributor Author

dirkf commented Oct 10, 2021

This fragment (as updated) could do the trick

        m = re.match(r'''(?xi)
            attachment;\s*
            (?:filename\s*=[^;]+?;\s*)?                    # possible initial filename=...;, ignored
            filename(?P<x>\*)?\s*=\s*                      # filename/filename* = 
                (?(x)(?P<charset>\S+?)'[\w-]*'|(?P<q>")?)  # if * then charset'...' else maybe "
                (?P<filename>(?(q)[^"]+(?=")|\S+))         # actual name of file
            ''', cd)
        if m:
            m = m.groupdict()
            filename = m.get('filename')
            if m.get('x'):
                try:
                    filename = compat_urllib_parse_unquote(filename, encoding=m.get('charset','utf-8'))
                except LookupError:  # unrecognised character set name
                    pass
            e = determine_ext(filename, default_ext=None)

replacing

        m = re.match(r'attachment;\s*filename="(?P<filename>[^"]+)"', cd)
        if m:
            e = determine_ext(m.group('filename'), default_ext=None)

@C0D3D3V
Copy link

C0D3D3V commented Oct 10, 2021

Thank you very much!
I have tested it, it works, but the semicolon was forgotten. Either we use rstrip to remove the semicolon or one builds it yet into the Regex. (As a reminder, my Content-Disposition was: attachment; filename*=UTF-8''asdasd.docx; filename="asdasd.docx" )

grafik

Here is the working version with rstrip:

 def urlhandle_detect_ext(self, url_handle):
        getheader = url_handle.headers.get

        cd = getheader('Content-Disposition')
        if cd:
            m = re.match(
                r'''(?xi)
                attachment;\s*
                (?:filename\s*=[^;]+?;\s*)?                   # possible initial filename=...;, ignored
                filename(?P<x>\*)?\s*=\s*                      # filename/filename* = 
                    (?(x)(?P<charset>\S+?)'[\w-]*'|(?P<q>")?)  # if * then charset'...' else maybe "
                    (?P<filename>(?(q)[^"]+(?=")|\S+))         # actual name of file
                ''',
                cd,
            )
            if m:
                m = m.groupdict()
                filename = m.get('filename')
                if m.get('x'):
                    try:
                        filename = compat_urllib_parse_unquote(filename, encoding=m.get('charset', 'utf-8'))
                    except LookupError:  # unrecognised character set name
                        pass
                    e = determine_ext(filename.rstrip(';'), default_ext=None)
                    if e:
                        return e

        return mimetype2ext(getheader('Content-Type'))

@dirkf
Copy link
Contributor Author

dirkf commented Oct 10, 2021

Thank you very much! I have tested it, it works, but the semicolon was forgotten. Either we use rstrip to remove the semicolon or one builds it yet into the Regex. (As a reminder, my Content-Disposition was: attachment; filename*=UTF-8''asdasd.docx; filename="asdasd.docx" )

Ah, I missed that. It's the non-recommended ordering with the extended syntax first. Instead of \S+ for the unquoted branch, we need [^\s;]+. I've put in another test for that.

@C0D3D3V
Copy link

C0D3D3V commented Oct 10, 2021

Works perfectly! Thank you very much! You are the best :)

nixxo pushed a commit to nixxo/yt-dlp that referenced this pull request Nov 22, 2021
test/test_utils.py Outdated Show resolved Hide resolved
test/test_utils.py Outdated Show resolved Hide resolved
dirkf added 18 commits March 11, 2024 16:06
Restore check omitted from extract_timezone(); adjust DATE_FORMATS_DAY/MONTH_FIRST; add tests.
Add support for unquoted token and RFC 5987 extended parameter syntax
Eg, in [1], the class with name 'plist-info' was found when searching for 'info'.

1. ytdl-org#30230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants