URL extraction does not capture entire path and query if "www" is omitted. #346

danielittlewood0 · 2021-05-11T14:18:11Z

Describe the bug
In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.

To Reproduce

JS examples

The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.

Does not behave as expected:

twttr.txt.extractUrls("https://t.co/a?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b")
> ["https://t.co/a"]

Behaves as expected:

twttr.txt.extractUrls("https://www.t.co?amp=1")
> ["https://www.t.co?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b?amp=1")
> ["https://www.t.co/a/b?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b")
> ["https://www.t.co/a/b"]
twttr.txt.extractUrls("https://t.co?amp=1")
> ["https://t.co?amp=1"]

Ruby examples

I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.

Does not behave as expected:

irb(main):002:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a?amp=1")
=> ["https://t.co/a"]
irb(main):004:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b?amp=1")
=> ["https://t.co/a"]
irb(main):005:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b")
=> ["https://t.co/a"]

Expected behavior
I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).

Environment
Apart from the details above, I'm running Ubuntu.

The text was updated successfully, but these errors were encountered:

danielittlewood0 · 2021-05-11T14:41:04Z

I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.

      // In the case of t.co URLs, don't allow additional path characters.
      if (url.match(validTcoUrl)) {
      ...

Indeed,

twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]

The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1 to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1.

danielittlewood0 · 2021-05-11T15:01:56Z

This is not a very nice solution, but I found that if I set

(ruby)

Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/

(js)

twttr.txt.regexen.validTcoUrl = /$^/;

then I get the behaviour I expect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL extraction does not capture entire path and query if "www" is omitted. #346

URL extraction does not capture entire path and query if "www" is omitted. #346

danielittlewood0 commented May 11, 2021

danielittlewood0 commented May 11, 2021 •

edited

danielittlewood0 commented May 11, 2021 •

edited

URL extraction does not capture entire path and query if "www" is omitted. #346

URL extraction does not capture entire path and query if "www" is omitted. #346

Comments

danielittlewood0 commented May 11, 2021

danielittlewood0 commented May 11, 2021 • edited

danielittlewood0 commented May 11, 2021 • edited

danielittlewood0 commented May 11, 2021 •

edited

danielittlewood0 commented May 11, 2021 •

edited