-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it really that difficult to stop duplicate downloads? #1361
Comments
No, I believe it isn't that simple. Please check out the complexity of the item comparison code in src/itemset.c there is already a lot of logic eliminating duplication. The BBC feed in question provides unique identifiers for feed items, if those are present a difference in those is taken as an indication of different items. If such a feed provider issues the same content with a new UID the RSS spec says it is to be considered new content. There are use cases where you want it and your suggestion would kill the use case. For example an feed alerting on something and providing the same content at different times to show you that a problem does persist. |
Thank you for the explanation. I understand what you have said.
|
Such an option would be possible. Maintaining the feature is the problem. This is a one man project, all code paths that the maintainer does not use daily tend to rot :-( |
This is really very frustrating. Many, many feeds have apparently duplicated items, especially from the BBC. The only difference in the sql database (items) seems to be in the source_id where a number is appended to the string eg:
Are the the "unique identifiers " you referred to above? There is an informative article here |
So I solved this problem, which seems to mainly affect the BBC feeds. Thanks to DanQ for the info. The answer was to intercept the BBC feed and remove the "#nn" numbers which the BBC had helpfully added to each guid. Unfortunately I could not get the ruby script that DanQ offered to work so I rewrote it in tcl, and it works fine. |
A picture is worth a thousand words:
These are from the BBC Feed (http://feeds.bbci.co.uk/news/rss.xml). Isn't it just a question of comparing titles and times?
The text was updated successfully, but these errors were encountered: