Review verbose behaviours #2329

kbenoit · 2023-12-15T11:16:04Z

Some functions don't even have the verbose argument, e.g. tokens_split(). We should review the functions to see which need it, and review the behaviours when verbose = TRUE to make sure they are consistent.

Not sure anyone uses it much, but for the sake of consistency it's still worth reviewing.

The text was updated successfully, but these errors were encountered:

koheiw · 2024-01-03T17:37:28Z

We tried to make verbose messages more consistent using message_select(). We could do something similar across methods for corpus, tokens and dfm objects.

quanteda/R/message.R

Lines 56 to 81 in 84ecce8

    
           message_select <- function(selection, nfeats, ndocs, nfeatspad = 0, ndocspad = 0) { 
        
               catm(if (selection == "keep") "kept" else "removed", " ", 
        
                    format(nfeats, big.mark = ",", scientific = FALSE), 
        
                    " feature", if (nfeats != 1L) "s" else "", sep = "") 
        
               if (ndocs > 0) { 
        
                   catm(" and ", 
        
                        format(ndocs, big.mark = ",", scientific = FALSE), 
        
                        " document", if (ndocs != 1L) "s" else "", 
        
                        sep = "") 
        
               } 
        
               if ((nfeatspad + ndocspad) > 0) { 
        
                   catm(", padded ", sep = "") 
        
               } 
        
               if (nfeatspad > 0) { 
        
                   catm(format(nfeatspad, big.mark = ",", scientific = FALSE), 
        
                        " feature", if (nfeatspad != 1L) "s" else "", 
        
                        sep = "") 
        
               } 
        
               if (ndocspad > 0) { 
        
                   if (nfeatspad > 0) catm(" and ", sep = "") 
        
                   catm(format(ndocspad, big.mark = ",", scientific = FALSE), 
        
                        " document", if (ndocspad != 1L) "s" else "", 
        
                        sep = "") 
        
               } 
        
               catm("", appendLF = TRUE) 
        
           }

However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation.

require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090

toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#> 
length(types(toks2))
#> [1] 9942

sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584

Further, the repeated use of types() is not a good idea because it triggers recompilation of tokens_xptr, reducing the new objects' performance gain.

The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. pattern, dictionary, n). This make it easy to create a fit-for-all messaging function easier too.

koheiw · 2024-01-12T01:15:37Z

How about making message_tokens() and message_dfm() that can be used in all the methods?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)

stats_tokens <- function(x) {
    list(ndoc = ndoc(x),
         ntoken = sum(ntoken(x, remove_padding = TRUE)))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

stats_dfm <- function(x) {
    list(ndoc = ndoc(x),
         nfeat = nfeat(dfm_remove(x, "")))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

message_dfm <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
                   operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)

before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)

dfmt <- dfm(toks)

before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents)

^{Created on 2024-01-12 with reprex v2.0.2}

kbenoit · 2024-01-12T01:56:23Z

Makes sense to me!

kbenoit added this to the v4.1 milestone Dec 15, 2023

koheiw mentioned this issue Jan 4, 2024

Add remove_padding to ntoken() #2336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review verbose behaviours #2329

Review verbose behaviours #2329

kbenoit commented Dec 15, 2023

koheiw commented Jan 3, 2024

koheiw commented Jan 12, 2024

kbenoit commented Jan 12, 2024

Review verbose behaviours #2329

Review verbose behaviours #2329

Comments

kbenoit commented Dec 15, 2023

koheiw commented Jan 3, 2024

koheiw commented Jan 12, 2024

kbenoit commented Jan 12, 2024