Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a_b_c.domain.com — Neither domain, nor publicSuffix? (but valid) #73

Open
ikari-pl opened this issue Dec 7, 2015 · 12 comments
Open
Labels

Comments

@ikari-pl
Copy link

ikari-pl commented Dec 7, 2015

The URL http://wsc4_1.webspectator.com/ is returning null for both getDomain and getPublicSuffix. I can't even find webspectator.com on public suffix list, so I assume the corect result would be webspectator.com for domain and com for public suffix.

Demo:

var tld = require('tldjs');
tld.getDomain('http://wsc4_1.webspectator.com/'); // null
tld.getDomain('wsc4_1.webspectator.com'); // null
tld.getPublicSuffix('http://wsc4_1.webspectator.com/'); // null
tld.isValid('http://wsc4_1.webspectator.com/'); // true

but:

> tld.getDomain('wsc41.webspectator.com')
'webspectator.com'

So it seems it's all about the _ character.
See:

> tld.getDomain('a_b.google.com')
null
> tld.getDomain('a-b.google.com')
'google.com'
@ZLightning
Copy link

Technically host names containing an underscore are not RFC compliant (only A-Z, a-z, 0-9, -, and . are allowed), however a newer RFC notes that a DNS server can be used to serve arbitrary data, and no DNS server should refuse to load a zone that contains invalid characters in host names.

@thom4parisot
Copy link
Owner

Yes indeed it is tight to the character _.

@ZLightning do you have a link towards the new RFC change?

A possibility could be to have a strict mode or not (I guess, disabled by default) in order to properly extract domains and such. For cookie creation, we might want to stick to the RFC compliant mode but that's something to discuss later on.

What do you think folks?

@ZLightning
Copy link

RFC2181 is only a proposed standard, but I have confirmed subdomains with an _ in them still resolve. I think a strict and sloppy mode would be a great feature. The default being strict is a good idea for backwards compatibility.

@nebulade
Copy link

nebulade commented Nov 1, 2017

Is there any update on that, as I also just hit that unfortunately.

@LesBarstow
Copy link

LesBarstow commented Apr 26, 2018

Note if anyone's still following this: HOSTNAMES cannot contain underscores, but other DNS entries can. e.g. _spf.google.com is a valid DNS name.

$ dig +short TXT _spf.google.com
"v=spf1 include:_netblocks.google.com include:_netblocks2.google.com include:_netblocks3.google.com ~all"

The DNS itself places only one restriction on the particular labels that can be used to identify resource records. That one restriction relates to the length of the label and the full name. [...] Implementations of the DNS protocols must not place any restrictions on the labels that can be used. In particular, DNS servers must not refuse to serve a zone because it contains labels that might not be acceptable to some DNS client programs.

AFAIK, no registrar allows you to register a domain under a TLD with an underscore, but technically that too is allowed.

@thom4parisot
Copy link
Owner

@LesBarstow I find your comment valuable but I did not have in mind the context of hostnames in regard of DNS entries.

There is a proposal in issue #122 to be either strict or lenient on hostnames with underscores.

Do you think it will address what you mention?

@LesBarstow
Copy link

My personal opinion: the only calls that should care about character restrictions (aside from length) are isValidHostname() and the isValid property returned by parse(). We use both tldExists() and getDomain(), and those shouldn't care, ever.

For isValidHostname() and parse().isValid: FWIW, the defaults in PHP filtering and Perl Net regex patterns are both lenient, with options for strict. This matches the DNS RFC itself - no restrictions except for proper hostnames, which are limited by RFCs 952 and 1123.

Just my two cents.

@LesBarstow
Copy link

Alternately, the code could care about the validity of the publicSuffix in a strict form while the rest of the domain name would be lenient. (No registrar registers domains with an underscore as they can't be used for hostnames at all...) This is more annoying, though, because if someone does want to be lenient on the publicSuffix, now you have to have two flag options: reallyStrict, default, and reallyLenient.

@remusao
Copy link
Collaborator

remusao commented May 2, 2018

Hi @LesBarstow and thanks for the great feedback! It's really interesting to get another perspective. I would like to add the following, which is just my opinion on the matter. Currently isValid is used for two different purposes internally:

  1. It's used to quickly check if the input to any of the function is already a valid hostname, in which case we can skip the expensive parsing step. In this case, we could probably use the lenient version of isValid.
  2. It is used to indicate if the input is a valid url/hostname through the two functions: parse, isValid, exposed as part of the public API.

So what we could do perhaps is to use the lenient mode for 1. (as an internal optimization). And for 2. allow an extra parameter to provide options about the behavior isValid.

Last but not least, we had similar discussions in the past regarding hostname parsing (which is hard and different libraries have different behaviors). In the end, we made the opinionated choice of using a specific module but gave the flexibility for a user of the library to provide their own parsing logic. In a way, tldjs is not about validating urls/hostnames. So maybe it is ok to pick one option (let's say we always validate hostnames in a lenient way), and let users who need it use some more complex ways to validate the hostnames depending on their usecase.

As was pointed out, tldjs could only care about validating the public suffix part, since it's what the library is about.

We can of course recommend/suggest other libraries which can be used along-side tld.js to do this validation.

@7c
Copy link

7c commented Sep 10, 2018

Hi, i am using parse() function with real world urls from squid logs to determine domainnames. I understand that this repo is all about publicsuffix but look at this real-world-example:

console.log(parse('http://spons_700.spns.nrb-apps.com/ajax/footpanel_process.php'));
{ hostname: 'spons_700.spns.nrb-apps.com',
  isValid: false,
  isIp: false,
  tldExists: false,
  publicSuffix: null,
  domain: null,
  subdomain: null }

console.log(parse('http://spons700.spns.nrb-apps.com/ajax/footpanel_process.php'));
{ hostname: 'spons700.spns.nrb-apps.com',
  isValid: true,
  isIp: false,
  tldExists: true,
  publicSuffix: 'com',
  domain: 'nrb-apps.com',
  subdomain: 'spons700.spns' }

many bigger providers do have _ in their hostnames and if the purphose of parse() is to determine publicSuffix then this function fails with real-world urls

@remusao
Copy link
Collaborator

remusao commented Sep 23, 2018

Hi @taskinosman, thank you for your input. I proposed a solution a few weeks ago in the form of an option to enable a "lenient mode" for hostname validation in the following PR: #122 but unfortunately the PR was not merged/reviewed yet. In the meanwhile I forked and published tldts which is based on tld.js (but re-written in Typescript + a few other modifications) and provides a different set of default; among which the more permissive hostname validation is enabled by default. Maybe this would solve your problem? Don't hesitate to give me any feedback on it.

@7c
Copy link

7c commented Sep 24, 2018

Thanks, sorry i should have seen #122 . I have commented that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants