Deduplicate items across sources #1444

mrichtarsky · 2023-07-10T09:19:03Z

Hi,

selfoss only adds an item from a feed when it is not already present for that source. However, newspapers often have separate feeds for different topics. When you subscribe to multiple feeds, you can end up with the same article from multiple feeds/sources.

So it would be nice if selfoss could check whether the article is present regardless of source. This is usually ok since the ID is the URL to the article, which should be unique across sources.

I have implemented this change in behavior here, controlled by an ini parameter:
mrichtarsky@f31bf4f

Would this be interesting for others as well?

Thanks and best regards,
Martin

The text was updated successfully, but these errors were encountered:

jtojnar · 2023-07-10T21:15:35Z

Thanks, that is interesting idea. I wonder if we could make it always enabled and have the item in multiple sources.

We would probably need to replace the source column in the items table with an m:n association table. Will need to check the performance implications.

davidoskky · 2023-07-11T09:50:33Z

This is a very nice idea, what are you using as identifier to deduplicate? The url?
What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

jtojnar · 2023-07-11T10:23:32Z

what are you using as identifier to deduplicate? The url?

The UID. Most commonly, this is the post URL but it is not required. For example blogger.com will use something like tag:blogger.com,1999:blog-6112936277054198647.post-403878284366003238.

What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

We could have findAll return the source id in addition to item id and check whether the content and url matches when the source id does not, and only deduplicate it then.

That would also probably resolve the uid collisions.

The issue that items will be missing from some of the sources will still remain, though, which is why I would like to test the performance impact of having sources table in m:n relation to items.

jtojnar added the enhancement label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplicate items across sources #1444

Deduplicate items across sources #1444

mrichtarsky commented Jul 10, 2023

jtojnar commented Jul 10, 2023

davidoskky commented Jul 11, 2023

jtojnar commented Jul 11, 2023

Deduplicate items across sources #1444

Deduplicate items across sources #1444

Comments

mrichtarsky commented Jul 10, 2023

jtojnar commented Jul 10, 2023

davidoskky commented Jul 11, 2023

jtojnar commented Jul 11, 2023