Assignment 1.3 - Rare words #82

daviddao · 2015-07-17T13:54:28Z

In Assignment 1.3 it is written: "This will load the data in a bag-of-words representation where rare words (occurring less than 5 times in the training data) are removed". However, when I sum the word occurrences using the provided training dataset with

scr = srs.SentimentCorpus("books")

I get words, which doesn't appear at all (occurring less than 5 times)

>> scr.train_X.sum(0)
[..., 0.0, ...]

The text was updated successfully, but these errors were encountered:

negrinho · 2015-07-17T14:15:18Z

Maybe the indexing has been built using also the development data. We will take a look.

ChristopherBrix · 2019-07-07T17:08:53Z

Yes, the whole corpus (training + dev) is used to discard rare words. This is because training and dev are not separated until after this filtering is performed.

>> srs.SentimentCorpus('books').X.sum(0)
[   5.  221. 1639. ...    5.    5.    6.]

ramon-astudillo added this to To Do in lxmls2018 Apr 8, 2018

ramon-astudillo added this to TODO in lxmls 2019 Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment 1.3 - Rare words #82

Assignment 1.3 - Rare words #82

daviddao commented Jul 17, 2015

negrinho commented Jul 17, 2015

ChristopherBrix commented Jul 7, 2019

Assignment 1.3 - Rare words #82

Assignment 1.3 - Rare words #82

Comments

daviddao commented Jul 17, 2015

negrinho commented Jul 17, 2015

ChristopherBrix commented Jul 7, 2019