Extracted topics make no sense; might have something to do with unicodes #132

hedgy123 · 2017-09-26T16:45:41Z

Hi,

I've just installed the latest version of textacy in python 2.7 on a Mac. I am trying to extract topics from a set of comments that do have quite a few non-ASCII characters. The topics I am getting make no sense.

Here's what's going on. I create a corpus of comments like this:

    corpus = textacy.Corpus('en',texts=the_data)

This create a Corpus(3118 docs; 71018 tokens). If I print out the first three documents/tokens from the corpus, they look normal:

   [Doc(45 tokens; "verrrrry slow pharmacy staff-pharmacist was wai..."),
   Doc(17 tokens; "prices could be a bit lower. service desk could..."),
   Doc(11 tokens; "i got what i wanted at the price i wanted.")]

Then:

vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,
                            min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform(
             (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) 
              for doc in corpus))
# initialize and train topic model
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
     print('topic', topic_idx, ':', '   '.join(top_terms))

And that's where I get back "topics" that make no sense:

  (u'topic', 0, u':', u"be   's   p.m.   -PRON-   because   will   would   have   not")
  (u'topic', 1, u':', u"not   p.m.   because   's   -PRON-   will   would   have   be")
  (u'topic', 2, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 3, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 4, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 5, u':', u"have   's   p.m.   -PRON-   because   will   would   not   be")
  (u'topic', 6, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 7, u':', u"will   's   p.m.   -PRON-   because   would   have   not   be")
  (u'topic', 8, u':', u"would   's   p.m.   -PRON-   because   will   have   not   be")
  (u'topic', 9, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")

Somehow, the fact that everything comes up with "u's" seems to indicate to me that unicodes are potentially messing things up, but I am not sure how to fix that. The printed corpus seemed perfectly fine.

Could you please help? Thanks a lot!

The text was updated successfully, but these errors were encountered:

bdewilde · 2017-09-26T17:03:07Z

Hey @hedgy123 , it's hard for me to tell what's going wrong here, but since your code looks correct, I'm guessing that the garbage topics result from some combination of problems with the data, the term normalization, and the parameters for the topic model being trained.

Here are a few things to try:

Confirm that there aren't duplicates in your training data, since that has been known to negatively affect topic model outputs.
Don't lemmatize your terms in doc.to_terms_list(), by specifying either normalize='lower' or normalize=False.
Try a different model type, i.e 'lda' or 'lsa'. Try varying your topic model's n_topics, both higher and lower. Try increasing your max_iter, in case the model is simply failing to converge.

If none of that works, I'd assume that either your corpus isn't conducive to topic modeling (ugh for you) or there's a bug somewhere in textacy (ugh for me). Please let me know how your experiments go!

LeonardoReyes · 2017-09-27T09:05:42Z

From the top of my head, I think it might be some escaping formatting issues related to ' and ", because of this structure (u'topic', 0, u':', u"be 's

It's worth trying to escape them properly or remove them altogether from your raw data before pushing it to doc.to_terms_list()

This might help to escape them if you want to keep the punctuation: https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes

anamariakantar · 2017-10-02T20:49:10Z

@hedgy123 I get the same issue, the topics do not make sense - did you figure out what the problem was?

lyons422 · 2017-10-04T04:19:36Z

Ran into same issue here🤔

bdewilde · 2017-10-05T16:28:33Z

Okay, sounds like I should confirm that topic models behavior is expected... I've been punting on major textacy development while I wait for the official spacy v2 release, but this issue is probably independent of that. Will let y'all know if I find anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted topics make no sense; might have something to do with unicodes #132

Extracted topics make no sense; might have something to do with unicodes #132

hedgy123 commented Sep 26, 2017

bdewilde commented Sep 26, 2017

LeonardoReyes commented Sep 27, 2017 •

edited

anamariakantar commented Oct 2, 2017

lyons422 commented Oct 4, 2017

bdewilde commented Oct 5, 2017

Extracted topics make no sense; might have something to do with unicodes #132

Extracted topics make no sense; might have something to do with unicodes #132

Comments

hedgy123 commented Sep 26, 2017

bdewilde commented Sep 26, 2017

LeonardoReyes commented Sep 27, 2017 • edited

anamariakantar commented Oct 2, 2017

lyons422 commented Oct 4, 2017

bdewilde commented Oct 5, 2017

LeonardoReyes commented Sep 27, 2017 •

edited