Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate items across sources #1444

Open
mrichtarsky opened this issue Jul 10, 2023 · 3 comments
Open

Deduplicate items across sources #1444

mrichtarsky opened this issue Jul 10, 2023 · 3 comments

Comments

@mrichtarsky
Copy link

Hi,

selfoss only adds an item from a feed when it is not already present for that source. However, newspapers often have separate feeds for different topics. When you subscribe to multiple feeds, you can end up with the same article from multiple feeds/sources.

So it would be nice if selfoss could check whether the article is present regardless of source. This is usually ok since the ID is the URL to the article, which should be unique across sources.

I have implemented this change in behavior here, controlled by an ini parameter:
mrichtarsky@f31bf4f

Would this be interesting for others as well?

Thanks and best regards,
Martin

@jtojnar
Copy link
Member

jtojnar commented Jul 10, 2023

Thanks, that is interesting idea. I wonder if we could make it always enabled and have the item in multiple sources.

We would probably need to replace the source column in the items table with an m:n association table. Will need to check the performance implications.

@davidoskky
Copy link
Contributor

This is a very nice idea, what are you using as identifier to deduplicate? The url?
What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

@jtojnar
Copy link
Member

jtojnar commented Jul 11, 2023

what are you using as identifier to deduplicate? The url?

The UID. Most commonly, this is the post URL but it is not required. For example blogger.com will use something like tag:blogger.com,1999:blog-6112936277054198647.post-403878284366003238.

What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

We could have findAll return the source id in addition to item id and check whether the content and url matches when the source id does not, and only deduplicate it then.

That would also probably resolve the uid collisions.

The issue that items will be missing from some of the sources will still remain, though, which is why I would like to test the performance impact of having sources table in m:n relation to items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants