Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train/modify collocation model with existing (ngram) dictionary? (question) #224

Open
manuelbickel opened this issue Dec 10, 2017 · 6 comments

Comments

@manuelbickel
Copy link
Contributor

manuelbickel commented Dec 10, 2017

Dear Dmitriy,

thank you again for solving issue #218 concerning replacement of terms by multiple synonyms. I now have a question concerning how to best incorporate dictionaries that include information on ngrams/collocations, e.g., city names. A standard solution would be to simply replace all matched patterns in the text by the dashed_version_of_patterns, e.g., via stri_replace_all. However, this is slow for large corpora and I am interested how you would solve this task in text2vec.

As a workaround, I trained a collocation model on a modified dictionary containing all terms bound by dashes leaving the first unigram unbound so the model sees only one prefix and suffix, e.g, "new york_city". Please, see below code example.

I was wondering if you would incorporate such dictionary information differently, e.g., without training a model and manually defining colloaction_stats or so.

I would appreciate your thoughts. Thank you in advance.

library(text2vec)

txt <- c("new york city", "new york city district in new york", "san francisco")

dict_ngrams <- c("new york", "san francisco", "new york city", "the state of new york city", "city district")

#modify dict for limiting to one prefix/suffix
dict_ngrams_dashed <- gsub(" ", "_", dict_ngrams)
dict_ngrams_dashed <- sub("_", " ", dict_ngrams_dashed)

#train model based on dict
cc_model <- Collocations$new(collocation_count_min = 1
                             ,pmi_min = 0
                             ,gensim_min = 0
                             ,lfmd_min = -Inf
)

it_dict_dashed <- itoken(dict_ngrams_dashed, progressbar = FALSE)
cc_model$partial_fit(it_dict_dashed)
# cc_model$collocation_stat
# prefix                 suffix n_i n_j n_ij      pmi      lfmd gensim rank_pmi rank_lfmd rank_gensim
# 1:   city               district   1   1    1 3.321928 -3.321928      0        1         1           3
# 2:    the state_of_new_york_city   1   1    1 3.321928 -3.321928      0        2         2           4
# 3:    san              francisco   1   1    1 3.321928 -3.321928      0        3         3           5
# 4:    new              york_city   2   1    1 2.321928 -4.321928      0        4         4           1
# 5:    new                   york   2   1    1 2.321928 -4.321928      0        5         5           2

it_txt <- itoken(txt, progressbar = FALSE)
it_txt_cc <- cc_model$transform(it_txt)

v <- create_vocabulary(it_txt_cc)
#side effect of cc model that might be interesting: "city_district" is ranked lower than "new_yor_city"
#and thus the latter is preferred over the former , see below
v
# Number of docs: 3 
# 0 stopwords:  ... 
# ngram_min = 1; ngram_max = 1 
# Vocabulary: 
#   term term_count doc_count
# 1:      district          1         1
# 2:            in          1         1
# 3:      new_york          1         1
# 4: san_francisco          1         1
# 5: new_york_city          2         2
@dselivanov
Copy link
Owner

The current flow is that if you want "new york city" to be identified/collapsed as a collocation than there should be pairs of c("new", "york") and c("new_york", "city") (or c("york", "city")` and `c("new", "york_city")) in cc_model $collocation_stat table rows.
There is no such functionality at the moment in the package. However it should not be that hard to add it. If you want to give a try - I will definitely help.

@manuelbickel
Copy link
Contributor Author

manuelbickel commented Dec 21, 2017

Thank you for your reply. So I was halfway on the right track by introducing dashes into the ngrams of the dictionary. Something like cc_model$collocation_stat <- cc_model$MY_collocation_stat is currently not possible from the R interface, at least, I could not figure out how. From my understanding the main task would be identifying the leading and trailing ngrams in the dictionary, bind them with dashes, add them to the dict and set the collocation_stat internally on this basis. E.g., in line 185 in model_Collocations.R (with a command introduced somewhere like if(!is.null(cc_dictionary))...?

So far, I have come up with a workaround (see code below) that identifies trailing Ngrams in the dictionary and does a stepwise iteration over N (from low to high) to transform the main iterator for the documents. I have used this stepwise iteration because as soon as there are more than trigrams the inner ngram should presumably not been recognized as a collocation (e.g. in "the state of new york city" the collocation "the state" would usually not be desired). This workaround still requires training of cc_models, which, I guess, could probably be skipped with the right implementation.

I would have to think over the order of which collocations to be bound, but I would be happy to work on a solution for this problem. Since I am not a computer/data scientist, my programming skills are not of mature professional nature, hence, I would be glad if you could provide some guidance/pointers/critique so that I am on the right track.

library(text2vec)
library(stringi)
dict_cc <- c("new york", "new york city", "the state of new york city")

docs <- c("new york"
          , "new york city"
          , "the state of new york city"
          , "new york state of mind"
          ,"the state of new hampshire"
          )

iterator_docs <- itoken(docs)

#helper function to find trailing ngarms 
#for binding collocations from the end to the front of the string
find_trailing_ngrams <- function(dict, unlist_result = TRUE) {
  dict_tok <- tokenizers::tokenize_words(dict)
  trailing_ngrams <- lapply(dict_tok, function(x) {
    lx <- length(x)
    if (lx > 2) { #trailing ngrams only occur in trigrams
      trailing_ngrams <- sapply( 2:(lx-1), function(y) {
        paste(x[y:lx], collapse = " ")
      })
    } else { #unigrams and bigrams are returned as NULL
      NULL
    }
  })
  
  if (unlist_result == TRUE) {
    return(unlist(trailing_ngrams, recursive = TRUE))
  } else {
    return(trailing_ngrams)
  }
}


dict_cc <- unique(c(dict_cc, find_trailing_ngrams(dict_cc))) %>% 
           .[order(stringi::stri_count_fixed(. , " "), decreasing = T)] %>% 
            split(., stringi::stri_count_fixed(., " "))

# $`1`
# [1] "new york"  "york city"
# $`2`
# [1] "new york city"
# $`3`
# [1] "of new york city"
# $`4`
# [1] "state of new york city"
# $`5`
# [1] "the state of new york city"

#itetare through dict from lowest to highest n
#and transform iterator_docs stepwise with cc_model
for (dict_cc_n in dict_cc) {
  #dash ngrams so that collocations are learned as is in dictionary
  dict_cc_n <-  gsub(" ", "_", dict_cc_n) %>%
                gsub("_(?=[a-z]+$)", " ", ., perl = T)
  
  #set up basic model for collocations based on *modified* dictionary with ngrams
  cc_model_dict <- Collocations$new(collocation_count_min = 1
                                    ,pmi_min = 0
                                    ,gensim_min = 0
                                    ,lfmd_min = -Inf
  )
  
  iterator_dict_cc <- itoken(dict_cc_n, progressbar = FALSE)
  
  cc_model_dict$partial_fit(iterator_dict_cc)
  
  iterator_docs <- cc_model_dict$transform(iterator_docs)
}

create_vocabulary(iterator_docs)
# term term_count doc_count
# 1:                  hampshire          1         1
# 2:              new_york_city          1         1
# 3: the_state_of_new_york_city          1         1
# 4:                        the          1         1
# 5:                        new          1         1
# 6:                       mind          1         1
# 7:                      state          2         2
# 8:                         of          2         2
# 9:                   new_york          2         2

@dselivanov
Copy link
Owner

Sorry for being not very responsive these days, a lot of things going... Will try to spend some time tomorrow to clarify the way we can proceed with this issue.

@manuelbickel
Copy link
Contributor Author

That`s fine, I guess you have some more important/complex problems to solve than some dictionary lookups. For the time being I think I can use my workaround, but as soon as you find some time I am happy to provide support where possible (or to the extent my limited skills allow) to establish a more elegant/robust solution.

@manuelbickel
Copy link
Contributor Author

Just realized that I had forgotten to insert the helper function that finds trailing ngrams into the code, sorry for that. I have updated my last code comment accordingly so that you have my latest attempt as soon as you find time to have a further look.

@manuelbickel
Copy link
Contributor Author

This is an update on this issue, however, not a solution, yet.

As per your first comment in this thread, I have created a collocation_stat from a cc_dictionary (here only the single 5-gram "state of new york city" for testing) that includes all combinations of dashed versions of ngrams (just used the next best approach on SO using regex to set up the combinations, certainly not the fastest option), trained a cc_model on this collection and transformed the iterator of the documents. (I hope that this resembles what you had in mind)

Following this approach correctly trains the colllocation "state_of_new_york_city" as present in the dictionary. However, it also trains word combinations embedded within longer phrases such as "state_of" which would not be a desired output. To avoid this behaviour I can only think of some kind of iterative training of collocations (e.g. from low to high ngrams or so) to improve the results. Let me know if you have any other idea how to approach this further (as far as you have time)?

library(text2vec)
library(stringi)
packageVersion("text2vec")
#[1] ‘0.5.0.9’
dict_cc <- c("state of new york city")
docs <- c("new york"
          , "new york city"
          , "the state of new york city"
          , "new york state of mind"
          ,"the state of new hampshire"
)
#initial iterator to be transformed
iterator_docs <- itoken(docs, progressbar = FALSE)

ngrams_dict_cc <- create_vocabulary(itoken(dict_cc, progressbar = FALSE)
                                    ,ngram = c(2L, (max(stringi::stri_count_fixed(dict_cc, " "))+1))
                                    )
# term term_count doc_count
# 1: state_of_new_york_city          1         1
# 2:                 of_new          1         1
# 3:               state_of          1         1
# 4:          new_york_city          1         1
# 5:           state_of_new          1         1
# 6:              york_city          1         1
# 7:               new_york          1         1
# 8:      state_of_new_york          1         1
# 9:       of_new_york_city          1         1
# 10:            of_new_york          1         1

#create combinations of each ngram wiht each dash in the ngram replaced individually
ngrams_dict_cc <- sapply(c(0:max(stringi::stri_count_fixed(dict_cc, " "))), function(x) {
    #https://stackoverflow.com/questions/41989775/replacing-the-ith-occurrence-of-h-in-a-string-in-r
    sub(paste0("((?:[^_]*_){", x,"}[^_]*)_"), "\\1 ",   ngrams_dict_cc$term)
  })
ngrams_dict_cc <- unique(as.character(ngrams_dict_cc))
# [1] "state of_new_york_city" "of new"                 "state of"               "new york_city"         
# [5] "state of_new"           "york city"              "new york"               "state of_new_york"     
# [9] "of new_york_city"       "of new_york"            "state_of new_york_city" "of_new"                
# [13] "state_of"               "new_york city"          "state_of new"           "york_city"             
# [17] "new_york"               "state_of new_york"      "of_new york_city"       "of_new york"           
# [21] "state_of_new york_city" "new_york_city"          "state_of_new"           "state_of_new york"     
# [25] "of_new_york city"       "of_new_york"            "state_of_new_york city" "state_of_new_york"     
#[29] "of_new_york_city"       "state_of_new_york_city

cc_model_ngrams_dict_cc <- Collocations$new(collocation_count_min = 1
                                            ,pmi_min = 0
                                            ,gensim_min = 0
                                            ,lfmd_min = -Inf
                                            ,log_lik_min = 0
                                            )
iterator_ngrams_dict_cc <- itoken(ngrams_dict_cc, progressbar = FALSE)
cc_model_ngrams_dict_cc$partial_fit(iterator_ngrams_dict_cc)
cc_model_ngrams_dict_cc$collocation_stat[,1:2]
# prefix           suffix
# 1: state_of_new_york             city
# 2:             state of_new_york_city
# 3:      state_of_new             york
# 4:       of_new_york             city
# 5:          state_of    new_york_city
# 6:             state      of_new_york
# 7:                of    new_york_city
# 8:      state_of_new        york_city
# 9:          state_of              new
# 10:            of_new             york
# 11:               new             york
# 12:                of         new_york
# 13:            of_new        york_city
# 14:             state           of_new
# 15:              york             city
# 16:                of              new
# 17:          new_york             city
# 18:               new        york_city
# 19:          state_of         new_york
# 20:             state               of

iterator_docs <- cc_model_ngrams_dict_cc$transform(iterator_docs)  
create_vocabulary(iterator_docs)
# term term_count doc_count
# 1: state_of_new_york_city          1         1
# 2:              hampshire          1         1
# 3:               state_of          1         1
# 4:          new_york_city          1         1
# 5:           state_of_new          1         1
# 6:                   mind          1         1
# 7:               new_york          2         2
# 8:                    the          2         2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants