Auto generate link for URL on news and comments #214

Vamoss · 2020-03-04T16:25:18Z

Hi,

I have recently implemented auto generating hyperlinks on URL for news.

This was done by these two commits:

News list:
EncontrosDigitais@6681653
News Comments:
EncontrosDigitais@dc9e319

If it is an interest of all, I could create a pull-request.

Cheers,
Carlos

The text was updated successfully, but these errors were encountered:

Vamoss · 2020-03-11T22:39:36Z

Hello again,

I have improved this issue suggestion by getting the first URL from text and extracting the meta information, like facebook or twitter does:

This was the commit that implemented it:
EncontrosDigitais@a74f15b

The main responsible file is the metadatareader.py:
https://github.com/EncontrosDigitais/bootcamp/blob/master/bootcamp/news/metadatareader.py

If you are good with this idea, I can merge it with the previous suggestion and make the pull request.

My only doubt was about adding this new requirement in this file, I am not sure if this is the right way:
EncontrosDigitais@a74f15b#diff-7fe11226cff646f5d9f35faa76217059

sebastian-code · 2020-03-21T02:53:06Z

This is actually awesome @Vamoss I have been thinking about implementing this for some time now, but I hadn't the time to do it myself. I appreciate you took the time to show it to me. This is great, and I appreciate it very much. Yes, please, I gladly receive your PR for this.

Vamoss · 2020-03-21T14:22:50Z

Great to hear you like it.

I started learning Django because bootcamp, and because of that I don't know much.
Can you please check if this requirement was added in the correct place:
EncontrosDigitais@a74f15b#diff-7fe11226cff646f5d9f35faa76217059

sebastian-code · 2020-03-22T05:49:16Z

Yes @Vamoss it is in the right place. Can you replace the URL you use to describe the package, for the URL to the repository?

Vamoss · 2020-04-05T13:06:34Z

Hi @sebastian-code,
BeautifulSoup has no github/git official respository.
There are some outdated versions. So I think it is better to link to the official website.
Just made the pull-request: #219

Here is a screenshot:

Vamoss · 2020-04-05T13:09:16Z

The only feature not implemented yet would be cache the metatag image.
For example, from servers like facebook and instagram, the image link expires after a few hours.
I think that is simple to implement, love if someone could make it :)

Vamoss · 2020-04-13T12:43:04Z

Hi @sebastian-code, amazing working improving the code!

Just one small fix. The news are giving an error if there is no URLs in it. This should fix:
Vamoss@6d991dd

Before a pull-request I would like discuss two topics:

By replace subprocess to requests we loose the ability to force a timeout in servers that never respond. I choose to use subprocess because I noticed that a lot of sites never respond, so I think this is critical, because bootcamp now can easily enter an infinity loop.
By simplifying the regex, we loose the ability to find urls like google.com(without the rigor of typing http://), I think this is important because a lot of users don't pay attention to that. I did a lot of research, and find a way to extract a URL from a text is not a simple task.

Considering the complexity those two subject I would recommend to roll back to the original suggestion, or find a better way to solve those issues.

@Vamoss

Added a really nice feature from PR #220 fixing #214 and done by @Vamoss

sebastian-code · 2020-04-13T15:11:56Z

Now I see the value of using a really complex regex like the one you gave @Vamoss I'll recover it then. About the subprocess method, that is a bad idea, it exposes any installation to unnecessary risks, and also makes more difficult to use Bootcamp in different environments. Besides, you can assign a timeout feature to the requests call. I'll add the same 5 sec timeframe to the call.

Vamoss · 2020-04-13T15:44:10Z

I know request has a native timeout parameter, but please consider this note from the documentation:

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.
https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts

During the development I find some cases where a website returned the first byte(avoiding request timeout) and then took 30 seconds to fully return the content. Another approach to use with request is using eventlet :
https://stackoverflow.com/a/13592680/1177566

sebastian-code · 2020-04-13T19:44:36Z

@Vamoss I see your point. I think this can be solved implementing the proper exception workflow from requests when declaring the timeout, I was thinking on something of the likes of:

try:
    r = requests.get(url,timeout=3)
    r.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print ("Http Error:",errh)
except requests.exceptions.ConnectionError as errc:
    print ("Error Connecting:",errc)
except requests.exceptions.Timeout as errt:
    print ("Timeout Error:",errt)
except requests.exceptions.RequestException as err:
    print ("OOps: Something Else",err)

But thinking on what the documentation says, perhaps your solution with eventlet will be a safer approach, but the implementation on your reference looks weird, what do you think of something of the likes of:

with eventlet.Timeout(10):
    requests.get("someverytrickyURL.com")

sebastian-code · 2020-04-13T19:47:53Z

By the way @Vamoss I could not reproduce a timeout situation, do you have an URL I can work with for testing?

Vamoss · 2020-04-13T23:37:37Z

Great! There is hope :)

Yes, that is hard to simulate, I don't remember exactly what site I was testing for that case, but I suspect it was https://reddit.com, or my personal website https://vamoss.com.br (my server can be very slow sometimes).

This site may help (I tested some pages and not all of them were actually slow):
http://internetsupervision.com/scripts/urlcheck/report.aspx?reportid=slowest

Nonetheless, I think you have a good strategy, I would be happy to see in action.

sebastian-code · 2020-04-14T22:33:23Z

@Vamoss I'm trying to revert the regex, but when I tested it, in the specific case you say (I tried using vamoss.com.br) it throws a MissingSchema error; the log actually has a mention about detecting the right pattern Invalid URL 'vamoss.com.br': No schema supplied. Perhaps you meant http://vamoss.com.br?

Do you know something? My regex is really bad.

Vamoss · 2020-04-15T11:32:46Z

It happens because request requires using a schema (https://) in the url, this didn't happen with subprocess because curl auto fixed that.

This code works for me:

def fetch_metadata(text):
    """Method to consolidate workflow to recover the metadata of a page of the
    first URL found a in a given text block.
    :requieres:

    :param text: Block of text of any lenght
    """
    urls = get_urls(text)
    if len(urls) > 0:
        url = urls[0]
        if not url.lower().startswith(("http://", "https://", "ftp://")):
            url = "http://" + url

        return get_metadata(url)

    return None

sebastian-code · 2020-04-15T20:32:41Z

No problem @Vamoss I fixed it using a different approach. Thanks for the help tho.

Vamoss · 2021-02-21T15:05:54Z

I think we can close this issue @sebastian-code

sebastian-code added a commit that referenced this issue Apr 13, 2020

Merge tag '2.1.1' into develop

ba46616

Added a really nice feature from PR #220 fixing #214 and done by @Vamoss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto generate link for URL on news and comments #214

Auto generate link for URL on news and comments #214

Vamoss commented Mar 4, 2020

Vamoss commented Mar 11, 2020

sebastian-code commented Mar 21, 2020

Vamoss commented Mar 21, 2020

sebastian-code commented Mar 22, 2020

Vamoss commented Apr 5, 2020

Vamoss commented Apr 5, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 13, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 13, 2020 •

edited

sebastian-code commented Apr 13, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 14, 2020

Vamoss commented Apr 15, 2020 •

edited

sebastian-code commented Apr 15, 2020

Vamoss commented Feb 21, 2021

Auto generate link for URL on news and comments #214

Auto generate link for URL on news and comments #214

Comments

Vamoss commented Mar 4, 2020

Vamoss commented Mar 11, 2020

sebastian-code commented Mar 21, 2020

Vamoss commented Mar 21, 2020

sebastian-code commented Mar 22, 2020

Vamoss commented Apr 5, 2020

Vamoss commented Apr 5, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 13, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 13, 2020 • edited

sebastian-code commented Apr 13, 2020

Vamoss commented Apr 13, 2020

sebastian-code commented Apr 14, 2020

Vamoss commented Apr 15, 2020 • edited

sebastian-code commented Apr 15, 2020

Vamoss commented Feb 21, 2021

sebastian-code commented Apr 13, 2020 •

edited

Vamoss commented Apr 15, 2020 •

edited