Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent skipping behavior in TextReuseCorpus #90

Open
tylerandrewscott opened this issue Jan 17, 2020 · 3 comments
Open

Inconsistent skipping behavior in TextReuseCorpus #90

tylerandrewscott opened this issue Jan 17, 2020 · 3 comments

Comments

@tylerandrewscott
Copy link

I am encountering an issue using the TextReuseCorpus function where I feed in a vector of texts (using the "text = " option in the function, and: (1) receive a warning of skipped texts due to insufficient length on character strings that should be long enough; and (2) get a different number of skip warnings each time. I am reading in a large vector (>300,000) of texts, ranging from 155 to 9900 characters, and usually 30k to 150k are skipped for being too short. I can take these same skipped strings, run TextReuseCorpus on them, and they'll be fine this time around. Perhaps I'm simply doing something wrong?

@lmullen
Copy link
Member

lmullen commented Jan 17, 2020

Can you please provide a reproducible example?

@tylerandrewscott
Copy link
Author

tylerandrewscott commented Jan 17, 2020 via email

@tylerandrewscott
Copy link
Author

Following up -- I can't seem to generate a reproducible example, as the behavior is different every time, but I suspect that might point to an issue outside the package? The behavior occurs when the number of texts is above a certain threshold. For instance, I consistently get skip notices when n = 50k, but never when n = 25k.
Screen Shot 2020-01-16 at 10 08 35 PM

However, I can run the same code twice at 50k and get different sets of skipped values:

Screen Shot 2020-01-16 at 10 27 36 PM

Here is the session info:
Screen Shot 2020-01-16 at 10 08 16 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants