Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some URLs containing apostrophe (') causes internal error #745

Open
rrthomas opened this issue May 31, 2023 · 11 comments
Open

Some URLs containing apostrophe (') causes internal error #745

rrthomas opened this issue May 31, 2023 · 11 comments

Comments

@rrthomas
Copy link

Summary

Pointing linkchecker at some URLs containing an ASCII apostrophe causes an internal error.

Steps to reproduce

linkchecker -Dall "https://boyde.ithaky.net/Mark's Gospel"

Actual result

Internal error.

Expected result

No error!

Environment

  • Operating system: Ubuntu 22.04.2 LTS
  • Linkchecker version: 10.2.1
  • Python version: 3.10.6
  • Install method: PyPI

Configuration

DEBUG linkcheck.cmdline 2023-05-31 11:52:45,354 MainThread Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
INFO linkcheck.cmdline 2023-05-31 11:52:45,360 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
DEBUG linkcheck.cmdline 2023-05-31 11:52:45,361 MainThread configuration: [('aborttimeout', 300),
 ('allowedschemes', []),
 ('authentication', []),
 ('checkextern', False),
 ('cookiefile', None),
 ('csv', {}),
 ('debugmemory', False),
 ('dot', {}),
 ('enabledplugins', []),
 ('externlinks', []),
 ('failures', {}),
 ('fileoutput', []),
 ('gml', {}),
 ('gxml', {}),
 ('html', {}),
 ('ignoreerrors', []),
 ('ignorewarnings',
  ['http-robots-denied', 'url-unnormed', 'http-moved-permanent']),
 ('internlinks', []),
 ('localwebroot', None),
 ('logger', 'NoneLogger'),
 ('loginextrafields', {}),
 ('loginpasswordfield', 'password'),
 ('loginurl', None),
 ('loginuserfield', 'login'),
 ('maxfilesizedownload', 5242880),
 ('maxfilesizeparse', 1048576),
 ('maxhttpredirects', 10),
 ('maxnumurls', None),
 ('maxrequestspersecond', 10),
 ('maxrunseconds', None),
 ('nntpserver', None),
 ('none', {}),
 ('output', 'text'),
 ('pluginfolders', []),
 ('quiet', False),
 ('recursionlevel', -1),
 ('resultcachesize', 100000),
 ('robotstxt', True),
 ('sitemap', {}),
 ('sql', {}),
 ('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
 ('status', True),
 ('status_wait_seconds', 5),
 ('text', {}),
 ('threads', 10),
 ('timeout', 60),
 ('trace', False),
 ('useragent',
  'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
  '+https://linkchecker.github.io/linkchecker/)'),
 ('verbose', False),
 ('warnings', True),
 ('xml', {})]

Logs

$ linkchecker -Dall "https://boyde.ithaky.net/Mark's Gospel"
DEBUG linkcheck.cmdline 2023-05-31 11:55:09,962 MainThread Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
DEBUG linkcheck.check 2023-05-31 11:55:09,963 MainThread reading configuration from ['/home/rrt/.linkchecker/linkcheckerrc']
INFO linkcheck.cmdline 2023-05-31 11:55:09,970 MainThread Checking intern URLs only; use --check-extern to check extern URLs.
DEBUG linkcheck.cmdline 2023-05-31 11:55:09,970 MainThread configuration: [('aborttimeout', 300),
 ('allowedschemes', []),
 ('authentication', []),
 ('checkextern', False),
 ('cookiefile', None),
 ('csv', {}),
 ('debugmemory', False),
 ('dot', {}),
 ('enabledplugins', []),
 ('externlinks', []),
 ('failures', {}),
 ('fileoutput', []),
 ('gml', {}),
 ('gxml', {}),
 ('html', {}),
 ('ignoreerrors', []),
 ('ignorewarnings',
  ['http-robots-denied', 'url-unnormed', 'http-moved-permanent']),
 ('internlinks', []),
 ('localwebroot', None),
 ('logger', 'TextLogger'),
 ('loginextrafields', {}),
 ('loginpasswordfield', 'password'),
 ('loginurl', None),
 ('loginuserfield', 'login'),
 ('maxfilesizedownload', 5242880),
 ('maxfilesizeparse', 1048576),
 ('maxhttpredirects', 10),
 ('maxnumurls', None),
 ('maxrequestspersecond', 10),
 ('maxrunseconds', None),
 ('nntpserver', None),
 ('none', {}),
 ('output', 'text'),
 ('pluginfolders', []),
 ('quiet', False),
 ('recursionlevel', -1),
 ('resultcachesize', 100000),
 ('robotstxt', True),
 ('sitemap', {}),
 ('sql', {}),
 ('sslverify', '/etc/ssl/certs/ca-certificates.crt'),
 ('status', True),
 ('status_wait_seconds', 5),
 ('text', {}),
 ('threads', 10),
 ('timeout', 60),
 ('trace', False),
 ('useragent',
  'Mozilla/5.0 (compatible; LinkChecker/10.2.1; '
  '+https://linkchecker.github.io/linkchecker/)'),
 ('verbose', False),
 ('warnings', True),
 ('xml', {})]
DEBUG linkcheck.check 2023-05-31 11:55:09,971 MainThread HttpUrl handles url https://boyde.ithaky.net/Mark's Gospel
DEBUG linkcheck.check 2023-05-31 11:55:09,971 MainThread checking syntax
DEBUG linkcheck.check 2023-05-31 11:55:09,971 MainThread cache_url 'https://boyde.ithaky.net/Mark%27s%20Gospel'
DEBUG linkcheck.check 2023-05-31 11:55:09,971 MainThread Add intern pattern '^https?://(www\\.|)boyde\\.ithaky\\.net'
DEBUG linkcheck.check 2023-05-31 11:55:09,971 MainThread Link pattern '^https?://(www\\.|)boyde\\.ithaky\\.net' strict=False
DEBUG linkcheck.cache 2023-05-31 11:55:09,972 MainThread queueing https://boyde.ithaky.net/Mark%27s%20Gospel
LinkChecker 10.2.1
Copyright (C) 2000-2016 Bastian Kleineidam, 2010-2022 LinkChecker Authors
LinkChecker comes with ABSOLUTELY NO WARRANTY!
This is free software, and you are welcome to redistribute it under
certain conditions. Look at the file `LICENSE' within this distribution.
Read the documentation at https://linkchecker.github.io/linkchecker/
Write comments and bugs to https://github.com/linkchecker/linkchecker/issues

Start checking at 2023-05-31 11:55:09+001
DEBUG linkcheck.check 2023-05-31 11:55:09,973 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Checking https link
base_url="https://boyde.ithaky.net/Mark's Gospel"
parent_url=None
base_ref=None
recursion_level=0
url_connection=None
line=None
column=None
page=0
name=''
anchor=''
cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel
DEBUG linkcheck.check 2023-05-31 11:55:09,974 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel checking connection
DEBUG linkcheck.check 2023-05-31 11:55:10,591 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel 'https://boyde.ithaky.net/robots.txt' allow all (HTTP error)
DEBUG linkcheck.check 2023-05-31 11:55:10,591 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel 'https://boyde.ithaky.net/robots.txt' check allowance for:
  user agent: 'Mozilla/5.0 (compatible; LinkChecker/10.2.1; +https://linkchecker.github.io/linkchecker/)'
  url: 'https://boyde.ithaky.net/Mark%27s%20Gospel' ...
DEBUG linkcheck.check 2023-05-31 11:55:10,591 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel  ... allow all.
DEBUG linkcheck.check 2023-05-31 11:55:10,591 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Prepare request with {'method': 'GET', 'url': 'https://boyde.ithaky.net/Mark%27s%20Gospel', 'headers': {}}
DEBUG linkcheck.check 2023-05-31 11:55:10,595 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Min wait time: 0.1 Max wait time: 0.6000000000000001 for host: boyde.ithaky.net
DEBUG linkcheck.check 2023-05-31 11:55:10,596 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Send request <PreparedRequest [GET]> with {'stream': True, 'timeout': 60, 'verify': '/etc/ssl/certs/ca-certificates.crt', 'allow_redirects': False}
DEBUG linkcheck.check 2023-05-31 11:55:10,596 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Request headers {'User-Agent': 'Mozilla/5.0 (compatible; LinkChecker/10.2.1; +https://linkchecker.github.io/linkchecker/)', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
DEBUG linkcheck.check 2023-05-31 11:55:10,618 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Response headers {'Date': 'Wed, 31 May 2023 10:55:10 GMT', 'Server': 'Apache/2.4.56 (Debian)', 'Location': "https://boyde.ithaky.net/Mark's%20Gospel/", 'Content-Length': '332', 'Keep-Alive': 'timeout=5, max=99', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=iso-8859-1'}
DEBUG linkcheck.check 2023-05-31 11:55:10,619 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Response encoding iso-8859-1
DEBUG linkcheck.check 2023-05-31 11:55:10,619 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel cannot extract SSL certificate from connection
DEBUG linkcheck.check 2023-05-31 11:55:10,619 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Got SSL certificate None
DEBUG linkcheck.check 2023-05-31 11:55:10,620 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel MIME type: text/html
DEBUG linkcheck.check 2023-05-31 11:55:10,620 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel follow all redirections
 1 thread active,     0 links queued,    0 links in   1 URL checked, runtime 1 seconds
DEBUG linkcheck.check 2023-05-31 11:55:11,284 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Redirected to "https://boyde.ithaky.net/Mark's%20Gospel/"
DEBUG linkcheck.check 2023-05-31 11:55:11,285 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Intern URL "https://boyde.ithaky.net/Mark's%20Gospel/"
DEBUG linkcheck.check 2023-05-31 11:55:11,285 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Got SSL certificate {'subject': ((('commonName', 'boyde.ithaky.net'),),), 'issuer': ((('countryName', 'US'),), (('organizationName', "Let's Encrypt"),), (('commonName', 'R3'),)), 'version': 3, 'serialNumber': '0407DDFEB9B70E62AAF1661B2538BBE56D1B', 'notBefore': 'Apr 21 18:12:52 2023 GMT', 'notAfter': 'Jul 20 18:12:51 2023 GMT', 'subjectAltName': (('DNS', 'boyde.ithaky.net'),), 'OCSP': ('http://r3.o.lencr.org',), 'caIssuers': ('http://r3.i.lencr.org/',)}
DEBUG linkcheck.check 2023-05-31 11:55:11,285 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel MIME type: text/html
DEBUG linkcheck.check 2023-05-31 11:55:11,285 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Redirected response headers {'Date': 'Wed, 31 May 2023 10:55:10 GMT', 'Server': 'Apache/2.4.56 (Debian)', 'Expires': 'Wed, 31 May 2023 10:55:11 GMT', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Content-Length': '1944', 'Keep-Alive': 'timeout=5, max=98', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=utf-8'}
DEBUG linkcheck.check 2023-05-31 11:55:11,286 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Redirected response encoding utf-8
DEBUG linkcheck.check 2023-05-31 11:55:11,286 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel MIME type: text/html
DEBUG linkcheck.check 2023-05-31 11:55:11,286 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel checking recursion of "https://boyde.ithaky.net/Mark's%20Gospel/" ...
DEBUG linkcheck.check 2023-05-31 11:55:11,286 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Get content of "https://boyde.ithaky.net/Mark's%20Gospel/"
DEBUG linkcheck.check 2023-05-31 11:55:11,311 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Beautiful Soup detected utf-8
DEBUG linkcheck.check 2023-05-31 11:55:11,311 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel Content encoding utf-8
DEBUG linkcheck.check 2023-05-31 11:55:11,312 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel ... yes, recursion.
DEBUG linkcheck.check 2023-05-31 11:55:11,312 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag html attrs {}
DEBUG linkcheck.check 2023-05-31 11:55:11,312 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 2 col 1
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder finished tag html
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag head attrs {}
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 3 col 3
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder finished tag head
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag meta attrs {'charset': 'utf-8'}
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 4 col 5
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder finished tag meta
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag meta attrs {'name': 'viewport', 'content': 'width=device-width, initial-scale=1, shrink-to-fit=no'}
DEBUG linkcheck.check 2023-05-31 11:55:11,313 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 5 col 5
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder finished tag meta
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag meta attrs {'http-equiv': 'x-ua-compatible', 'content': 'ie=edge'}
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 6 col 5
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder finished tag meta
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel LinkFinder tag link attrs {'href': 'https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', 'rel': 'stylesheet', 'integrity': 'sha384-GLhlTQ8iRABdZLl6O3oVMWSktQOp6b7In1Zl3/Jr59b6EGGoI1aFkw7cmDA6j6gD', 'crossorigin': 'anonymous'}
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel line 9 col 5
DEBUG linkcheck.check 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel HttpUrl handles url https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css
DEBUG linkcheck.cache 2023-05-31 11:55:11,314 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel task_done https://boyde.ithaky.net/Mark's%20Gospel/


********** Oops, I did it again. *************

You have found an internal error in LinkChecker. Please write a bug report
at https://github.com/linkchecker/linkchecker/issues
and include the following information:
- the URL or file you are testing
- the system information below

When using the commandline client:
- your commandline arguments and any custom configuration files.
- the output of a debug run with option "-Dall"

Not disclosing some of the information above due to privacy reasons is ok.
I will try to help you nonetheless, but you have to give me something
I can work with ;) .

Traceback (most recent call last):
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/director/checker.py", line 104, in check_url
    line: self.check_url_data(url_data)
    locals:
      self = <local> <Checker(CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel, started 139785810527808)>
      self.check_url_data = <local> <bound method Checker.check_url_data of <Checker(CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel, started 139785810527808)>>
      url_data = <local> <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/director/checker.py", line 116, in check_url_data
    line: check_url(url_data, self.logger)
    locals:
      check_url = <global> <function check_url at 0x7f226c32ab00>
      url_data = <local> <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>
      self = <local> <Checker(CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel, started 139785810527808)>
      self.logger = <local> <linkcheck.director.logger.Logger object at 0x7f226c1c28c0>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/director/checker.py", line 64, in check_url
    line: parser.parse_url(url_data)
    locals:
      parser = <global> <module 'linkcheck.parser' from '/home/rrt/.local/lib/python3.10/site-packages/linkcheck/parser/__init__.py'>
      parser.parse_url = <global> <function parse_url at 0x7f226c329c60>
      url_data = <local> <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/parser/__init__.py", line 43, in parse_url
    line: globals()[funcname](url_data)
    locals:
      globals = <builtin> <built-in function globals>
      funcname = <local> 'parse_html', len = 10
      url_data = <local> <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/parser/__init__.py", line 52, in parse_html
    line: linkparse.find_links(url_data.get_soup(), url_data.add_url, linkparse.LinkTags)
    locals:
      linkparse = <global> <module 'linkcheck.htmlutil.linkparse' from '/home/rrt/.local/lib/python3.10/site-packages/linkcheck/htmlutil/linkparse.py'>
      linkparse.find_links = <global> <function find_links at 0x7f226c32a200>
      url_data = <local> <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>
      url_data.get_soup = <local> <bound method UrlBase.get_soup of <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>>
      url_data.add_url = <local> <bound method UrlBase.add_url of <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>>
      linkparse.LinkTags = <global> {'a': ['href'], 'applet': ['archive', 'src'], 'area': ['href'], 'audio': ['src'], 'bgsound': ['src'], 'blockquote': ['cite'], 'body': ['background'], 'button': ['formaction'], 'del': ['cite'], 'embed': ['pluginspage', 'src'], 'form': ['action'], 'frame': ['src', 'longdesc'], 'head': ['profile'], ..., len = 35
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/htmlutil/linkparse.py", line 223, in find_links
    line: lf.html_element(
    locals:
      lf = <local> <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>
      lf.html_element = <local> <bound method LinkFinder.html_element of <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/htmlutil/linkparse.py", line 168, in html_element
    line: self.parse_tag(tag, attr, value, name, base, lineno, column)
    locals:
      self = <local> <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>
      self.parse_tag = <local> <bound method LinkFinder.parse_tag of <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>>
      tag = <local> 'link'
      attr = <local> 'href'
      value = <local> 'https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', len = 78
      name = <local> ''
      base = <local> ''
      lineno = <local> 9
      column = <local> 5
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/htmlutil/linkparse.py", line 209, in parse_tag
    line: self.found_url(value, name, base, lineno, column)
    locals:
      self = <local> <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>
      self.found_url = <local> <bound method LinkFinder.found_url of <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>>
      value = <local> 'https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', len = 78
      name = <local> ''
      base = <local> ''
      lineno = <local> 9
      column = <local> 5
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/htmlutil/linkparse.py", line 214, in found_url
    line: self.callback(url, line=lineno, column=column, name=name, base=base)
    locals:
      self = <local> <linkcheck.htmlutil.linkparse.LinkFinder object at 0x7f226cc2cf40>
      self.callback = <local> <bound method UrlBase.add_url of <https link, base_url="https://boyde.ithaky.net/Mark's Gospel", parent_url=None, base_ref=None, recursion_level=0, url_connection=None, line=None, column=None, page=0, name='', anchor='', cache_url=https://boyde.ithaky.net/Mark%27s%20Gospel>>
      url = <local> 'https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', len = 78
      line = <not found>
      lineno = <local> 9
      column = <local> 5
      name = <local> ''
      base = <local> ''
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/checker/urlbase.py", line 801, in add_url
    line: url_data = get_url_from(
    locals:
      url_data = <not found>
      get_url_from = <global> <function get_url_from at 0x7f226ff07e20>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/checker/__init__.py", line 138, in get_url_from
    line: return klass(
    locals:
      klass = <local> <class 'linkcheck.checker.httpurl.HttpUrl'>
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/checker/urlbase.py", line 142, in __init__
    line: self.init(
    locals:
      self = <local> <None link, base_url='https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', parent_url="https://boyde.ithaky.net/Mark's%20Gospel/", base_ref=None, recursion_level=1, url_connection=None, line=9, column=5, page=0, name='', anchor=None, cache_url=None>
      self.init = <local> <bound method UrlBase.init of <None link, base_url='https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', parent_url="https://boyde.ithaky.net/Mark's%20Gospel/", base_ref=None, recursion_level=1, url_connection=None, line=9, column=5, page=0, name='', anchor=None, cache...
  File "/home/rrt/.local/lib/python3.10/site-packages/linkcheck/checker/urlbase.py", line 206, in init
    line: assert not urlutil.url_needs_quoting(self.parent_url), (
    locals:
      urlutil = <global> <module 'linkcheck.url' from '/home/rrt/.local/lib/python3.10/site-packages/linkcheck/url.py'>
      urlutil.url_needs_quoting = <global> <function url_needs_quoting at 0x7f226ff07c70>
      self = <local> <None link, base_url='https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha1/dist/css/bootstrap.min.css', parent_url="https://boyde.ithaky.net/Mark's%20Gospel/", base_ref=None, recursion_level=1, url_connection=None, line=9, column=5, page=0, name='', anchor=None, cache_url=None>
      self.parent_url = <local> "https://boyde.ithaky.net/Mark's%20Gospel/", len = 41
AssertionError: unquoted parent URL "https://boyde.ithaky.net/Mark's%20Gospel/"
System info:
LinkChecker 10.2.1
Released on: 2022-12-05
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Modules: Beautiful Soup, dnspython, Requests, Argcomplete, sqlite3, SQLite
Local time: 2023-05-31 11:55:11+001
sys.argv: ['/home/rrt/.local/bin/linkchecker', '-Dall', "https://boyde.ithaky.net/Mark's Gospel"]
LANGUAGE = 'en_GB'
LANG = 'en_GB.UTF-8'
Default locale: ('en', 'UTF-8')

 ******** LinkChecker internal error, over and out ********
WARNING linkcheck.check 2023-05-31 11:55:11,319 CheckThread-https://boyde.ithaky.net/Mark%27s%20Gospel internal error occurred

Statistics:
Downloaded: 5KB.
Content types: 0 image, 1 text, 0 video, 0 audio, 0 application, 0 mail and 0 other.
URL lengths: min=41, max=41, avg=41.

That's it. 1 link in 2 URLs checked. 0 warnings found. 0 errors found.
There was 1 internal error.
Stopped checking at 2023-05-31 11:55:11+001 (2 seconds)

Other notes

Since this site contains lots of very similar pages, but this is the only one with an apostrophe in its name, that would seem to be the cause of the problem. Apostrophes do not need to be escaped in URIs as far as I can tell, so my CMS doesn't, and indeed, the web server and browsers seem to be quite happy with it.

linkchecker is also happy with other URLs I give it that contain apostrophes, so I'm not exactly sure why it goes wrong in this case, except that it's a directory URL not a file URL.

Thanks for linkchecker!

@rrthomas
Copy link
Author

rrthomas commented May 31, 2023

If I escape the URL on the command line (adding %20 and %27 in the relevant places), I still get the internal error, so I presume the problem is caused by the fact that the page as served contains an unescaped apostrophe.

@cjmayo
Copy link
Contributor

cjmayo commented Jun 1, 2023

You could try adding ' to _nopathquote_chars in your local copy:
https://github.com/linkchecker/linkchecker/blob/4d9a39089d579208e0e073fc26e3ba8a24489e4f/linkcheck/url.py#LL402C1-L402C19

I think any permanent change would need some careful investigation and more tests, not least because url_needs_quoting() is called from UrlBase and potentially used for more than just HTTP.

@rrthomas
Copy link
Author

rrthomas commented Jun 2, 2023

Thanks very much for the suggestion, which fixes things for me. I sympathise with this being a delicate issue to fix, so if you agree that at least it's not obvious that an unescaped apostrophe is illegal in a URL, perhaps it might be possible to add a (hidden?) configuration option so that it's not necessary to edit the source to work around this problem?

@rrthomas
Copy link
Author

rrthomas commented Jun 2, 2023

I just had a quick look into this. Forgive me if I'm telling you things you already know!

RFC 3986 seems to be relevant here. There's a nice summary. ' is a "reserved character", for example, and the suggestion is that it should be escaped if it's not being used as a delimiter (so, I should be escaping it in my context). However, linkchecker's _nopathquote_chars doesn't seem to correspond to any particular list in this RFC, or indeed any combination that I can see.

@cjmayo
Copy link
Contributor

cjmayo commented Jun 5, 2023

Indeed the reason for the contents of _nopathquote_chars is exactly the sort of thing that needs investigating.

I think there is a workaround for now without any changes: in my attempt to recreate the problem adding a trailing slash to the URL avoids the exception - although now the link is seen as outside the domain filter. Looks like you aren't using --check-extern (there are some obsolete ignorewarnings though), so instead I suggest setting internlinks in linkcheckerrc to match your site:

internlinks=REGEX

Regular expression to add more URLs recognized as internal links. Default is that URLs given on the command line are internal. Command line option: none

https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html#filtering

@rrthomas
Copy link
Author

rrthomas commented Jun 5, 2023

Thanks for the workaround. I don't quite understand why it works: I'm indeed not using --check-extern, so how come changing (indeed, enlarging) the set of URLs considered to be internal avoids the problem?

@cjmayo
Copy link
Contributor

cjmayo commented Jun 6, 2023

For syntax checking every link is broken down and put back together. Somehow because of the apostrophe the reconstituted link doesn't appear to LinkChecker to be a child of the URL that was passed i.e. not internal. Setting internlinks overrides that.

@rrthomas
Copy link
Author

rrthomas commented Jun 6, 2023

Thanks for the explanation!

@rrthomas
Copy link
Author

rrthomas commented Sep 13, 2023

I just realised that I hadn't actually tried your workaround on an unpatched version of linkchecker. It doesn't seem to work.

I am using the following linkcheckerrc:

[filtering]
internlinks=^https://boyde\.ithaky\.net/.*

Then when I run: linkchecker --config=linkcheckerrc "https://boyde.ithaky.net/Mark's Gospel"

I get the same error as before.

@cjmayo
Copy link
Contributor

cjmayo commented Sep 13, 2023

I haven't tried it again but reading #745 (comment) again there were two parts: internlinks and a trailing slash. Hopefully:

linkchecker --config=linkcheckerrc "https://boyde.ithaky.net/Mark's Gospel/"

@rrthomas
Copy link
Author

rrthomas commented Sep 13, 2023

Thanks, I hadn't understood that adding the trailing slash was needed on top of the internlinks setting. However, running with that setting still gives me internal errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants