Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment 1.3 - Rare words #82

Open
daviddao opened this issue Jul 17, 2015 · 2 comments
Open

Assignment 1.3 - Rare words #82

daviddao opened this issue Jul 17, 2015 · 2 comments

Comments

@daviddao
Copy link

In Assignment 1.3 it is written: "This will load the data in a bag-of-words representation where rare words (occurring less than 5 times in the training data) are removed". However, when I sum the word occurrences using the provided training dataset with

scr = srs.SentimentCorpus("books")

I get words, which doesn't appear at all (occurring less than 5 times)

>> scr.train_X.sum(0)
[..., 0.0, ...]
@negrinho
Copy link

Maybe the indexing has been built using also the development data. We will take a look.

@ramon-astudillo ramon-astudillo added this to To Do in lxmls2018 Apr 8, 2018
@ramon-astudillo ramon-astudillo added this to TODO in lxmls 2019 Mar 22, 2019
@ChristopherBrix
Copy link
Contributor

Yes, the whole corpus (training + dev) is used to discard rare words. This is because training and dev are not separated until after this filtering is performed.

>> srs.SentimentCorpus('books').X.sum(0)
[   5.  221. 1639. ...    5.    5.    6.]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
lxmls 2019
  
TODO
lxmls2018
  
To Do
Development

No branches or pull requests

3 participants