Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracted topics make no sense; might have something to do with unicodes #132

Open
hedgy123 opened this issue Sep 26, 2017 · 5 comments
Open

Comments

@hedgy123
Copy link

Hi,

I've just installed the latest version of textacy in python 2.7 on a Mac. I am trying to extract topics from a set of comments that do have quite a few non-ASCII characters. The topics I am getting make no sense.

Here's what's going on. I create a corpus of comments like this:

    corpus = textacy.Corpus('en',texts=the_data)

This create a Corpus(3118 docs; 71018 tokens). If I print out the first three documents/tokens from the corpus, they look normal:

   [Doc(45 tokens; "verrrrry slow pharmacy staff-pharmacist was wai..."),
   Doc(17 tokens; "prices could be a bit lower. service desk could..."),
   Doc(11 tokens; "i got what i wanted at the price i wanted.")]

Then:

vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,
                            min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform(
             (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True) 
              for doc in corpus))
# initialize and train topic model
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
     print('topic', topic_idx, ':', '   '.join(top_terms))

And that's where I get back "topics" that make no sense:

  (u'topic', 0, u':', u"be   's   p.m.   -PRON-   because   will   would   have   not")
  (u'topic', 1, u':', u"not   p.m.   because   's   -PRON-   will   would   have   be")
  (u'topic', 2, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 3, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 4, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 5, u':', u"have   's   p.m.   -PRON-   because   will   would   not   be")
  (u'topic', 6, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")
  (u'topic', 7, u':', u"will   's   p.m.   -PRON-   because   would   have   not   be")
  (u'topic', 8, u':', u"would   's   p.m.   -PRON-   because   will   have   not   be")
  (u'topic', 9, u':', u"'s   p.m.   -PRON-   because   will   would   have   not   be")

Somehow, the fact that everything comes up with "u's" seems to indicate to me that unicodes are potentially messing things up, but I am not sure how to fix that. The printed corpus seemed perfectly fine.

Could you please help? Thanks a lot!

@bdewilde
Copy link
Collaborator

Hey @hedgy123 , it's hard for me to tell what's going wrong here, but since your code looks correct, I'm guessing that the garbage topics result from some combination of problems with the data, the term normalization, and the parameters for the topic model being trained.

Here are a few things to try:

  1. Confirm that there aren't duplicates in your training data, since that has been known to negatively affect topic model outputs.
  2. Don't lemmatize your terms in doc.to_terms_list(), by specifying either normalize='lower' or normalize=False.
  3. Try a different model type, i.e 'lda' or 'lsa'. Try varying your topic model's n_topics, both higher and lower. Try increasing your max_iter, in case the model is simply failing to converge.

If none of that works, I'd assume that either your corpus isn't conducive to topic modeling (ugh for you) or there's a bug somewhere in textacy (ugh for me). Please let me know how your experiments go!

@LeonardoReyes
Copy link

LeonardoReyes commented Sep 27, 2017

From the top of my head, I think it might be some escaping formatting issues related to ' and ", because of this structure (u'topic', 0, u':', u"be 's

It's worth trying to escape them properly or remove them altogether from your raw data before pushing it to doc.to_terms_list()

This might help to escape them if you want to keep the punctuation: https://stackoverflow.com/questions/18935754/how-to-escape-special-characters-of-a-string-with-single-backslashes

@anamariakantar
Copy link

@hedgy123 I get the same issue, the topics do not make sense - did you figure out what the problem was?

@lyons422
Copy link

lyons422 commented Oct 4, 2017

Ran into same issue here🤔

@bdewilde
Copy link
Collaborator

bdewilde commented Oct 5, 2017

Okay, sounds like I should confirm that topic models behavior is expected... I've been punting on major textacy development while I wait for the official spacy v2 release, but this issue is probably independent of that. Will let y'all know if I find anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants