Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL extraction does not capture entire path and query if "www" is omitted. #346

Open
danielittlewood0 opened this issue May 11, 2021 · 2 comments

Comments

@danielittlewood0
Copy link

Describe the bug
In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.

To Reproduce

JS examples

The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.

  • Does not behave as expected:
twttr.txt.extractUrls("https://t.co/a?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b")
> ["https://t.co/a"]
  • Behaves as expected:
twttr.txt.extractUrls("https://www.t.co?amp=1")
> ["https://www.t.co?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b?amp=1")
> ["https://www.t.co/a/b?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b")
> ["https://www.t.co/a/b"]
twttr.txt.extractUrls("https://t.co?amp=1")
> ["https://t.co?amp=1"]

Ruby examples

I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.

  • Does not behave as expected:
irb(main):002:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a?amp=1")
=> ["https://t.co/a"]
irb(main):004:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b?amp=1")
=> ["https://t.co/a"]
irb(main):005:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b")
=> ["https://t.co/a"]

Expected behavior
I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).

Environment
Apart from the details above, I'm running Ubuntu.

@danielittlewood0
Copy link
Author

danielittlewood0 commented May 11, 2021

I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.

      // In the case of t.co URLs, don't allow additional path characters.
      if (url.match(validTcoUrl)) {
      ...

Indeed,

twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]

The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1 to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1.

@danielittlewood0
Copy link
Author

danielittlewood0 commented May 11, 2021

This is not a very nice solution, but I found that if I set

(ruby)

Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/

(js)

twttr.txt.regexen.validTcoUrl = /$^/;

then I get the behaviour I expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant