Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it really that difficult to stop duplicate downloads? #1361

Open
AndyM48 opened this issue May 13, 2024 · 5 comments
Open

Is it really that difficult to stop duplicate downloads? #1361

AndyM48 opened this issue May 13, 2024 · 5 comments

Comments

@AndyM48
Copy link

AndyM48 commented May 13, 2024

A picture is worth a thousand words:

2024-05-13_07-22

These are from the BBC Feed (http://feeds.bbci.co.uk/news/rss.xml). Isn't it just a question of comparing titles and times?

@lwindolf
Copy link
Owner

lwindolf commented May 15, 2024

No, I believe it isn't that simple. Please check out the complexity of the item comparison code in src/itemset.c there is already a lot of logic eliminating duplication.

The BBC feed in question provides unique identifiers for feed items, if those are present a difference in those is taken as an indication of different items. If such a feed provider issues the same content with a new UID the RSS spec says it is to be considered new content.

There are use cases where you want it and your suggestion would kill the use case. For example an feed alerting on something and providing the same content at different times to show you that a problem does persist.

@AndyM48
Copy link
Author

AndyM48 commented May 16, 2024

Thank you for the explanation. I understand what you have said.
Could there be an option, or maybe a plugin, to hide "apparent" duplicates, ie. ignore the UID when displaying the feeds?

  • Turn the option on and they are not displayed
  • Turn the option off and they are displayed again

@lwindolf
Copy link
Owner

Such an option would be possible. Maintaining the feature is the problem. This is a one man project, all code paths that the maintainer does not use daily tend to rot :-(

@AndyM48
Copy link
Author

AndyM48 commented May 20, 2024

This is really very frustrating. Many, many feeds have apparently duplicated items, especially from the BBC. The only difference in the sql database (items) seems to be in the source_id where a number is appended to the string eg:

https://www.bbc.com/sport/football/videos/cx88ezex0jzo#5
https://www.bbc.com/sport/football/videos/cx88ezex0jzo#6

Are the the "unique identifiers " you referred to above?

There is an informative article here

@AndyM48
Copy link
Author

AndyM48 commented May 21, 2024

So I solved this problem, which seems to mainly affect the BBC feeds. Thanks to DanQ for the info.

The answer was to intercept the BBC feed and remove the "#nn" numbers which the BBC had helpfully added to each guid. Unfortunately I could not get the ruby script that DanQ offered to work so I rewrote it in tcl, and it works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants