Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add drop_duplicates #4

Open
jbesomi opened this issue Apr 26, 2020 · 4 comments · May be fixed by #150
Open

Add drop_duplicates #4

jbesomi opened this issue Apr 26, 2020 · 4 comments · May be fixed by #150
Assignees
Labels
enhancement New feature or request

Comments

@jbesomi
Copy link
Owner

jbesomi commented Apr 26, 2020

(Edit)

Add hero.drop_duplicates(s, representation, distance_algorithm, threshold).

Where:

  • s is a Pandas Series
  • representation is either a Flair embedding or a hero representation function. Need to define a default value.
  • distance_algorithm is either a string or a function that takes as input two vectors and it computes their distance. Example of such a function is sklearn.metrics.pairwise.euclidean_distances (see scikit-learn repository)
  • threshold boolean values. All vectors that share a distance less than this value will be considered as a single document. The first in order of appearance of the Pandas Series will be kept.

Task:
Drop all duplicated from the given Pandas Series and return a cleaned version of it.

TODO:
It will be interesting to drop_duplicates from a DataFrame, specifying which column to drop (as done in Pandas).

@jbesomi jbesomi added the enhancement New feature or request label Apr 26, 2020
@selimelawwa
Copy link
Contributor

@jbesomi Should it be checking line by line and if a line is duplicate remove it?
or no removal just state that there is duplicates.

@jbesomi
Copy link
Owner Author

jbesomi commented May 16, 2020

The idea here is to compare long text of documents and try to find if there are some of them too similar; in this case, it might mean that documents are indeed duplicates. There are many applications for that, for instance, to detect plagiarism in papers.

A naive approach is to apply TF-IDF and look at the distance between vectors.

@igponce
Copy link

igponce commented Jul 8, 2020

conI suggest having several methods for handling duplicated content.

In the very simplest form, you might just need to chech againsta a hash (sha1, for instance) to be sure you don't have exact duplicates (ok, this might be a preprocessing job).

The inteface might look like Pandas.Series.unique() but specifying a method / way to do the deduplication: unique( method='hash | jaquard | etc.' , threshold=xx).

@jbesomi
Copy link
Owner Author

jbesomi commented Jul 8, 2020

Hey @igponce,

Exactly, the interface would look like hero.unique(df['text]).

A simple-yet-powerful solution is to simply compute a good representation of each text and remove documents that have very similar vectors.

Right, as you point the function will have as argument threshold. We will need to do some tests and pick a good one as default. This will largely depend on the underline algorithm.

Would you be interested in implementing this solution? Jaccard might work as well but it's easier to do better and to use word vectors instead of just counting.

Food for thoughts: what if the input must already be a representation? This would be even a better solution. In this case, the arguments might be the distance function as well as the threshold parameters.

@jbesomi jbesomi changed the title Duplicates Add remove_duplicates Jul 8, 2020
@jbesomi jbesomi changed the title Add remove_duplicates Add drop_duplicates Jul 8, 2020
@jbesomi jbesomi self-assigned this Jul 8, 2020
@mk2510 mk2510 linked a pull request Aug 11, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants