Add drop_duplicates #4

jbesomi · 2020-04-26T11:02:35Z

(Edit)

Add hero.drop_duplicates(s, representation, distance_algorithm, threshold).

Where:

s is a Pandas Series
representation is either a Flair embedding or a hero representation function. Need to define a default value.
distance_algorithm is either a string or a function that takes as input two vectors and it computes their distance. Example of such a function is sklearn.metrics.pairwise.euclidean_distances (see scikit-learn repository)
threshold boolean values. All vectors that share a distance less than this value will be considered as a single document. The first in order of appearance of the Pandas Series will be kept.

Task:
Drop all duplicated from the given Pandas Series and return a cleaned version of it.

TODO:
It will be interesting to drop_duplicates from a DataFrame, specifying which column to drop (as done in Pandas).

The text was updated successfully, but these errors were encountered:

selimelawwa · 2020-05-16T15:07:21Z

@jbesomi Should it be checking line by line and if a line is duplicate remove it?
or no removal just state that there is duplicates.

jbesomi · 2020-05-16T15:44:18Z

The idea here is to compare long text of documents and try to find if there are some of them too similar; in this case, it might mean that documents are indeed duplicates. There are many applications for that, for instance, to detect plagiarism in papers.

A naive approach is to apply TF-IDF and look at the distance between vectors.

igponce · 2020-07-08T10:01:31Z

conI suggest having several methods for handling duplicated content.

In the very simplest form, you might just need to chech againsta a hash (sha1, for instance) to be sure you don't have exact duplicates (ok, this might be a preprocessing job).

The inteface might look like Pandas.Series.unique() but specifying a method / way to do the deduplication: unique( method='hash | jaquard | etc.' , threshold=xx).

jbesomi · 2020-07-08T10:26:04Z

Hey @igponce,

Exactly, the interface would look like hero.unique(df['text]).

A simple-yet-powerful solution is to simply compute a good representation of each text and remove documents that have very similar vectors.

Right, as you point the function will have as argument threshold. We will need to do some tests and pick a good one as default. This will largely depend on the underline algorithm.

Would you be interested in implementing this solution? Jaccard might work as well but it's easier to do better and to use word vectors instead of just counting.

Food for thoughts: what if the input must already be a representation? This would be even a better solution. In this case, the arguments might be the distance function as well as the threshold parameters.

jbesomi added the enhancement New feature or request label Apr 26, 2020

jbesomi changed the title ~~Duplicates~~ Add remove_duplicates Jul 8, 2020

jbesomi changed the title ~~Add remove_duplicates~~ Add drop_duplicates Jul 8, 2020

jbesomi self-assigned this Jul 8, 2020

This was referenced Jul 8, 2020

Add most_similar #45

Open

👩‍💻 API next steps: checklist #85

Open

mk2510 linked a pull request Aug 11, 2020 that will close this issue

Add drop duplicates; closes #4 #150

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add drop_duplicates #4

Add drop_duplicates #4

jbesomi commented Apr 26, 2020 •

edited

selimelawwa commented May 16, 2020

jbesomi commented May 16, 2020

igponce commented Jul 8, 2020 •

edited

jbesomi commented Jul 8, 2020

Add drop_duplicates #4

Add drop_duplicates #4

Comments

jbesomi commented Apr 26, 2020 • edited

selimelawwa commented May 16, 2020

jbesomi commented May 16, 2020

igponce commented Jul 8, 2020 • edited

jbesomi commented Jul 8, 2020

jbesomi commented Apr 26, 2020 •

edited

igponce commented Jul 8, 2020 •

edited