Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore empty OpenGraph props #120

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Ignore empty OpenGraph props #120

wants to merge 2 commits into from

Conversation

croqaz
Copy link
Member

@croqaz croqaz commented Jul 16, 2019

Ignore empty OpenGraph properties or content.

Fixes #117

Upgraded six>=1.11 to make Python 3.4 tests pass.

This was also done in #119

@croqaz croqaz requested a review from ivanprado July 16, 2019 13:59
@croqaz croqaz changed the title WIP Ignore empty OpenGraph props Ignore empty OpenGraph props Jul 16, 2019
{
"@value": "Buy tickets for an upcoming Elysian Fields concert near you. List of all Elysian Fields tickets and tour dates for 2017."
}
],
"http://ogp.me/ns#image": [
{
"@value": "http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg"
"@value": "http://images.sk-static.com/SECONDARY_IMAGE.jpg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this order will create a bug in function test_rdfa_not_preserving_order. Why do you have to change the order here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the reason is that it is because you are comparing with the sorting values... could you sort the values in the test function instead? That way test_rdfa_not_preserving_order would keep working as xfail case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to sort the list in this failing function to fix it. I'm almost there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@croqaz the function test_rdfa_not_preserving_order should be preserved as it is. It is an xfail function, which means it is a test that should be passing, but it is not. So don't sort on this function, sort it in test_all.

Probably you understood well, but just in case :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Well, to fix the test_rdfa_not_preserving_order() all we need to do is:

for rdf in data['rdfa']:
            for key, pairs in rdf.items():
                if ':' in key and isinstance(pairs, list):
                    rdf[key] = sorted(pairs, key=lambda e: e["@value"])
        self.assertEqual(jsonize_dict(data)['rdfa'], expected['rdfa'])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_rdfa_not_preserving_order() fails randomly as it is exactly now, with my changes or not, because the order of the duplicate values is random.
So it doesn't matter if the duplicate values are sorted correctly or not in the JSON file.
The function is still a good way to replicate the bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Anyway, still, the order of the values for this case in elysianfields.json should be reversed. The first image should be http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg and the second image should be http://images.sk-static.com/SECONDARY_IMAGE.jpg as it was before, because this is the appearance order in the page.

This change will require some more code in the function test_all so that test passes, but this is the correct way. Otherwise test_rdfa_not_preserving_order would be incorrect, so wouldn't represent properly a case that should pass but is failing. (with current PR code test_rdfa_not_preserving_order represents a case that is failing and would also fail even in case #116 is fixed)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I'm working on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is done, I forgot to mention.
All the meta tags are converted to JSON objects and the sorting of JSON in the tests matches the actual HTML.

@codecov
Copy link

codecov bot commented Jul 16, 2019

Codecov Report

Merging #120 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #120      +/-   ##
==========================================
+ Coverage   87.63%   87.68%   +0.05%     
==========================================
  Files          11       11              
  Lines         469      471       +2     
  Branches      101      102       +1     
==========================================
+ Hits          411      413       +2     
  Misses         52       52              
  Partials        6        6
Impacted Files Coverage Δ
extruct/_extruct.py 73.33% <ø> (ø) ⬆️
extruct/opengraph.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50a0915...d4645ef. Read the comment docs.

@@ -28,6 +28,7 @@
<meta property="og:type" content="songkick-concerts:artist">
<meta property="og:title" content="Elysian Fields">
<meta property="og:description" content="Buy tickets for an upcoming Elysian Fields concert near you. List of all Elysian Fields tickets and tour dates for 2017.">
<meta property="og:description" content="" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could have the tests also check that whitespace characters, like spaces, are now ignored as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! That's a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gallaecio Do you want to take a look or I can just go ahead and merge this?
This is my first contribution to Extruct ;)

@croqaz
Copy link
Member Author

croqaz commented Jul 17, 2019

I believe I adressed all the comments.

The Travis build seems to constantly fail with Python 3.4 at "pip install -U tox codecov".

@ivanprado
Copy link
Contributor

@croqaz I have noticed that the same is happening to master, see: https://travis-ci.org/scrapinghub/extruct/builds/558857852?utm_source=github_status&utm_medium=notification

Digging deeper, the failure in the logs is:

FileNotFoundError: [Errno 2] No such file or directory: '/home/travis/virtualenv/python3.4.6/lib/python3.4/site-packages/six-1.10.0.dist-info/METADATA'

I googled and this could be relevant: tox-dev/tox-travis#76. Could you have a look?

@croqaz
Copy link
Member Author

croqaz commented Jul 17, 2019

Squashed all commits to have a clean history on merge.

Copy link
Contributor

@ivanprado ivanprado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @croqaz for implementing that! I see everything right after the changes.

@kmike
Copy link
Member

kmike commented Jul 19, 2019

In #117 idea was to only drop empty values when deciding which value to choose, to be on a safe side. But maybe we can drop them always, as in this PR. @croqaz could you plese check that properties with empty values are always ignored, e.g. using https://developers.facebook.com/tools/debug/sharing/?

@croqaz
Copy link
Member Author

croqaz commented Jul 19, 2019

@kmike Just to make sure I understand this correctly, because I think I might have misunderstood how to solve the problem.

This is the problem:

<meta property="og:description" content="" />
<blah-blah>... </blah-blah> ...
<meta property="og:description" content="My cool website" />

Duplicate OpenGraph with the same name, but one of them has an empty value.
In many cases, the empty value is the first, but it might be second just as well.

Solution:

  1. Do we want to completely drop OpenGraph meta tags with empty value?
  2. Do we want to keep the tags with empty value, but move them at the end of the values list, so they have a lower priority?

In this PR I implemented the first option.

@croqaz
Copy link
Member Author

croqaz commented Jul 19, 2019

Oh, Facebook's Debug tool only shows one of the OpenGraph meta tags.
It seems to be the most relevant.
But I don't think they care much about the order - in case of images, I'm not 100% they always take the first one - but I might be wrong, because I don't know how to find a website that has multiple images in a "random" order and check what Facebook would extract...

@kmike
Copy link
Member

kmike commented Jul 19, 2019

The question is: if there is only a single property with an empty value, is it returned or dropped?

@croqaz
Copy link
Member Author

croqaz commented Jul 19, 2019

Ok, I understand now. I wonder if we can quickly craft a page like that and feed that to Facebook and see how it works?
Because I don't know how to find a page like that "out there".

Actually I might craft a page like that relatively easy.
Let me try.

@croqaz
Copy link
Member Author

croqaz commented Jul 19, 2019

For:

<meta property="og:description" content="" />
<meta property="og:image" content="" />
<meta property="og:title" content="" />

Facebook Sharing Debugger shows:

  • Provided og:image URL, was not a valid URL
  • og:title - Taken from the <title> tag
  • og:description - empty

@croqaz
Copy link
Member Author

croqaz commented Jul 19, 2019

I also tested with:

<meta property="og:image" content="" />
<meta property="og:image" content="/images/logo.png" />

And Facebook Sharing Debugger shows: "Provided og:image URL, was not a valid URL"
But also the logo is corretly identified and displayed as a main image.

So if the image is first and the empty is second, OR if the image is second and the empty is first, the Sharing Debugger still finds the correct image.

@jakubwasikowski
Copy link
Contributor

Hey guys! Is there something which prevents us from merging that fix?
Would be great to have it merged because it blocks a bit a PR in another project.

@ivanprado
Copy link
Contributor

My cent: Looking at http://ogp.me/ and https://developers.facebook.com/docs/sharing/webmasters/#markup I don't see any definition of a property that could contain an empty value and still have some meaning. It seems that properties cannot be used to label a page somehow. So I would include the removal of every single property without value from the output. What do you think the rest of reviewers?

@croqaz
Copy link
Member Author

croqaz commented Jul 25, 2019

@ivanprado You mean in this case of: property="og:title" content=" " or property="og:description" content="" both properties should be dropped from the result?

@ivanprado
Copy link
Contributor

ivanprado commented Jul 25, 2019

@croqaz well, I have some doubts regarding the fix. By one side, I would say that it should be the user using extruct the one with the responsibility to filter empty values (option 1). extruct should only expose the data in the proper order (appearance order), empty or not.

But by the other side, I have not seen anything in the protocol definition pointing that is possible to create properties with empty values, so removing all of them could be reasonable (option 2). But the question is, it is the same a document with property="og:description" content="" than a document without any annotation? I'm not sure 100%.

The solution of cleaning empty values but only if there is at least one non-empty value for the duplicated property seems like a strange mix (option 3). But maybe could do the job:

  • User don't have to deal with empty values by himself, as they are removed
  • Meaning could be retained in case of empty content

But is weird as the user still will have to deal with empty value for the particular case. It is a little bit strange.

So in conclusión, I have not a clear view about what we should do, the three options make some sense. The prefered one by my side would be the option 1, but I'm also ok with option 2 and 3. @croqaz
@kmike @Gallaecio, what do you think?

@kmike
Copy link
Member

kmike commented Jul 25, 2019

I have the same concerns as Ivan. For OpenGraph we return a single value for an attribute, and empty values affect prioritization in case of multiple meta tags, so doing a correct prioritization (non-empty over empty) is fixing a bug.

Removing all empty values is a different thing; it solves the above mention prioritization problem, but it is a larger change, with unclear outcome - it can be a right thing to do (if having an empty value is the same as not having tag at all), or it can be a data loss (if a presence of a tag with empty value is meaningful information). To decide we need to answer this question, and I'm not sure how to do it :) My suggestion was a tool by Facebook (as standard says nothing), and it seems to confirm that empty value is meaningful - if I'm reading @croqaz's results properly, it does show an empty description if a description is present, but empty - and shows nothing if description is absent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

First non-empty result should be extracted in case of OpenGraph
5 participants