Is it really that difficult to stop duplicate downloads? #1361

AndyM48 · 2024-05-13T05:27:12Z

A picture is worth a thousand words:

These are from the BBC Feed (http://feeds.bbci.co.uk/news/rss.xml). Isn't it just a question of comparing titles and times?

lwindolf · 2024-05-15T21:32:22Z

No, I believe it isn't that simple. Please check out the complexity of the item comparison code in src/itemset.c there is already a lot of logic eliminating duplication.

The BBC feed in question provides unique identifiers for feed items, if those are present a difference in those is taken as an indication of different items. If such a feed provider issues the same content with a new UID the RSS spec says it is to be considered new content.

There are use cases where you want it and your suggestion would kill the use case. For example an feed alerting on something and providing the same content at different times to show you that a problem does persist.

AndyM48 · 2024-05-16T05:14:03Z

Thank you for the explanation. I understand what you have said.
Could there be an option, or maybe a plugin, to hide "apparent" duplicates, ie. ignore the UID when displaying the feeds?

Turn the option on and they are not displayed
Turn the option off and they are displayed again

lwindolf · 2024-05-16T22:51:48Z

Such an option would be possible. Maintaining the feature is the problem. This is a one man project, all code paths that the maintainer does not use daily tend to rot :-(

AndyM48 · 2024-05-20T05:11:29Z

This is really very frustrating. Many, many feeds have apparently duplicated items, especially from the BBC. The only difference in the sql database (items) seems to be in the source_id where a number is appended to the string eg:

https://www.bbc.com/sport/football/videos/cx88ezex0jzo#5
https://www.bbc.com/sport/football/videos/cx88ezex0jzo#6

Are the the "unique identifiers " you referred to above?

There is an informative article here

AndyM48 · 2024-05-21T09:33:12Z

So I solved this problem, which seems to mainly affect the BBC feeds. Thanks to DanQ for the info.

The answer was to intercept the BBC feed and remove the "#nn" numbers which the BBC had helpfully added to each guid. Unfortunately I could not get the ruby script that DanQ offered to work so I rewrote it in tcl, and it works fine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it really that difficult to stop duplicate downloads? #1361

Is it really that difficult to stop duplicate downloads? #1361

AndyM48 commented May 13, 2024

lwindolf commented May 15, 2024 •

edited

AndyM48 commented May 16, 2024

lwindolf commented May 16, 2024

AndyM48 commented May 20, 2024 •

edited

AndyM48 commented May 21, 2024

Is it really that difficult to stop duplicate downloads? #1361

Is it really that difficult to stop duplicate downloads? #1361

Comments

AndyM48 commented May 13, 2024

lwindolf commented May 15, 2024 • edited

AndyM48 commented May 16, 2024

lwindolf commented May 16, 2024

AndyM48 commented May 20, 2024 • edited

AndyM48 commented May 21, 2024

lwindolf commented May 15, 2024 •

edited

AndyM48 commented May 20, 2024 •

edited