Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review verbose behaviours #2329

Open
kbenoit opened this issue Dec 15, 2023 · 3 comments
Open

Review verbose behaviours #2329

kbenoit opened this issue Dec 15, 2023 · 3 comments
Milestone

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 15, 2023

Some functions don't even have the verbose argument, e.g. tokens_split(). We should review the functions to see which need it, and review the behaviours when verbose = TRUE to make sure they are consistent.

Not sure anyone uses it much, but for the sake of consistency it's still worth reviewing.

@kbenoit kbenoit added this to the v4.1 milestone Dec 15, 2023
@koheiw
Copy link
Collaborator

koheiw commented Jan 3, 2024

We tried to make verbose messages more consistent using message_select(). We could do something similar across methods for corpus, tokens and dfm objects.

quanteda/R/message.R

Lines 56 to 81 in 84ecce8

message_select <- function(selection, nfeats, ndocs, nfeatspad = 0, ndocspad = 0) {
catm(if (selection == "keep") "kept" else "removed", " ",
format(nfeats, big.mark = ",", scientific = FALSE),
" feature", if (nfeats != 1L) "s" else "", sep = "")
if (ndocs > 0) {
catm(" and ",
format(ndocs, big.mark = ",", scientific = FALSE),
" document", if (ndocs != 1L) "s" else "",
sep = "")
}
if ((nfeatspad + ndocspad) > 0) {
catm(", padded ", sep = "")
}
if (nfeatspad > 0) {
catm(format(nfeatspad, big.mark = ",", scientific = FALSE),
" feature", if (nfeatspad != 1L) "s" else "",
sep = "")
}
if (ndocspad > 0) {
if (nfeatspad > 0) catm(" and ", sep = "")
catm(format(ndocspad, big.mark = ",", scientific = FALSE),
" document", if (ndocspad != 1L) "s" else "",
sep = "")
}
catm("", appendLF = TRUE)
}

However, it is not easy to provide detailed information on the operations that will be performed in C++. For example, the message below says removing more features than actually exist. Since 20215 is only the possible sequence of tokens that the pattern would match (and be removed), we only know actual number of tokens removed (4584) only after the operation.

require(quanteda)
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)
length(types(toks))
#> [1] 10090

toks2 <- tokens_remove(toks, phrase("a *"), verbose = TRUE)
#> removed 20,180 features
#> 
length(types(toks2))
#> [1] 9942

sum(ntoken(toks)) - sum(ntoken(toks2))
#> [1] 4584

Further, the repeated use of types() is not a good idea because it triggers recompilation of tokens_xptr, reducing the new objects' performance gain.

The best approach would be to simplify the message including only the number of documents (and/or tokens), the type of operation (remove/keep, lookup, ngrams etc) and, maybe, a main parameter (e.g. pattern, dictionary, n). This make it easy to create a fit-for-all messaging function easier too.

@koheiw
Copy link
Collaborator

koheiw commented Jan 12, 2024

How about making message_tokens() and message_dfm() that can be used in all the methods?

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens(data_corpus_inaugural, remove_numbers = TRUE)

stats_tokens <- function(x) {
    list(ndoc = ndoc(x),
         ntoken = sum(ntoken(x, remove_padding = TRUE)))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

stats_dfm <- function(x) {
    list(ndoc = ndoc(x),
         nfeat = nfeat(dfm_remove(x, "")))
}

message_tokens <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d tokens (%d documents) to %d tokens (%d documents)",
                   operation, pre$ntoken, pre$ndoc, post$ntoken, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

message_dfm <- function(operation, pre, post) {
    msg <- sprintf("%s: from %d features (%d documents) to %d features (%d documents)",
                   operation, pre$nfeat, pre$ndoc, post$nfeat, post$ndoc)
    msg <- prettyNum(msg, big.mark = ",")
    cat(msg)
}

before <- stats_tokens(toks)
toks <- tokens_remove(toks, stopwords())
after <- stats_tokens(toks)
message_tokens("tokens_remove()", before, after)
#> tokens_remove(): from 151,442 tokens (59 documents) to 79,535 tokens (59 documents)

before <- stats_tokens(toks)
toks <- tokens_subset(toks, Year > 2000)
after <- stats_tokens(toks)
message_tokens("tokens_subset()", before, after)
#> tokens_subset(): from 79,535 tokens (59 documents) to 7,459 tokens (6 documents)

dfmt <- dfm(toks)

before <- stats_dfm(dfmt)
dfmt <- dfm_trim(dfmt, min_termfreq = 10)
after <- stats_dfm(dfmt)
message_dfm("dfm_trim()", before, after)
#> dfm_trim(): from 2,185 features (6 documents) to 104 features (6 documents)

Created on 2024-01-12 with reprex v2.0.2

@kbenoit
Copy link
Collaborator Author

kbenoit commented Jan 12, 2024

Makes sense to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants