Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'meta' data is lost when using '+' to concatenate corpus objects #2295

Open
mrstlee opened this issue Sep 24, 2023 · 6 comments
Open

'meta' data is lost when using '+' to concatenate corpus objects #2295

mrstlee opened this issue Sep 24, 2023 · 6 comments

Comments

@mrstlee
Copy link

mrstlee commented Sep 24, 2023

Describe the bug

Corpus-level meta data assigned with meta is lost when 2 corpus objects are merged together with the '+' operator.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

c1 <- corpus(letters)
## Default docnames will clash when the 2 corpus objects are added together
docnames(c1) <- rownames(mtcars)[1:26]

meta(c1) <- list( a = 1)

c2 <- corpus(LETTERS)
meta(c2) <- list( b =2 )

print( list(meta(c1), meta(c2) ) )
#> [[1]]
#> [[1]]$a
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$b
#> [1] 2

c3 <- c1 + c2

print (meta(c3))
#> list()


<sup>Created on 2023-09-24 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

Expected behavior

The separate meta data for each corpus should be merged e.g

print (meta(c3)) should give :

$a
[1] 1

$b
[1] 2

or similar.

## System information

Please run sessionInfo() and paste the output.

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.7.2               qreport_0.1.0             spacyr_1.2.1             
 [4] xfun_0.40                 rvest_1.0.3               lubridate_1.9.2          
 [7] forcats_1.0.0             Hmisc_5.1-0               ggplot2_3.4.2            
[10] quanteda.textstats_0.96.3 magrittr_2.0.3            quanteda.textplots_0.94.3
[13] quanteda_3.3.1            stringr_1.5.0             rlist_0.4.6.2            
[16] readr_2.1.4               data.table_1.14.2        

loaded via a namespace (and not attached):
  [1] TH.data_1.1-2      colorspace_2.0-3   ellipsis_0.3.2     rprojroot_2.0.3    htmlTable_2.4.1   
  [6] markdown_1.8       base64enc_0.1-3    fs_1.6.3           rstudioapi_0.14    MatrixModels_0.5-1
 [11] bit64_4.0.5        fansi_1.0.3        mvtnorm_1.2-2      xml2_1.3.3         R.methodsS3_1.8.2 
 [16] codetools_0.2-18   splines_4.2.1      cachem_1.0.6       knitr_1.39         pkgload_1.3.0     
 [21] Formula_1.2-5      jsonlite_1.8.4     gt_0.9.0           cluster_2.1.3      R.oo_1.25.0       
 [26] png_0.1-8          clipr_0.8.0        compiler_4.2.1     httr_1.4.5         backports_1.4.1   
 [31] Matrix_1.5-4.1     fastmap_1.1.0      cli_3.6.1          later_1.3.0        htmltools_0.5.6   
 [36] quantreg_5.95      tools_4.2.1        gtable_0.3.0       glue_1.6.2         dplyr_1.1.2       
 [41] fastmatch_1.1-3    Rcpp_1.0.9         styler_1.9.1       jquerylib_0.1.4    vctrs_0.6.2       
 [46] nlme_3.1-157       ps_1.7.1           stopwords_2.3      miniUI_0.1.1.1     timechange_0.2.0  
 [51] nsyllable_1.0.1    mime_0.12          lifecycle_1.0.3    sparkline_2.0      polspline_1.1.22  
 [56] MASS_7.3-57        zoo_1.8-12         scales_1.2.1       vroom_1.6.3        hms_1.1.3         
 [61] promises_1.2.0.1   parallel_4.2.1     sandwich_3.0-2     SparseM_1.81       RColorBrewer_1.1-3
 [66] yaml_2.3.5         memoise_2.0.1      reticulate_1.28    gridExtra_2.3      sass_0.4.7        
 [71] rms_6.7-0          rpart_4.1.16       stringi_1.7.8      highr_0.9          checkmate_2.1.0   
 [76] rlang_1.1.1        pkgconfig_2.0.3    commonmark_1.9.0   evaluate_0.16      lattice_0.20-45   
 [81] purrr_1.0.1        htmlwidgets_1.6.2  processx_3.7.0     bit_4.0.4          tidyselect_1.2.0  
 [86] here_1.0.1         R6_2.5.1           generics_0.1.3     multcomp_1.4-24    pillar_1.9.0      
 [91] foreign_0.8-82     withr_2.5.0        survival_3.3-1     nnet_7.3-17        tibble_3.2.1      
 [96] crayon_1.5.1       utf8_1.2.2         tzdb_0.3.0         rmarkdown_2.14     viridis_0.6.3     
[101] grid_4.2.1         callr_3.7.1        reprex_2.0.2       digest_0.6.29      R.cache_0.16.0    
[106] xtable_1.8-4       httpuv_1.6.5       R.utils_2.12.2     RcppParallel_5.1.7 munsell_0.5.0     
[111] viridisLite_0.4.0  bslib_0.4.0 

Additional info

The 'Quick Start' states :

"Corpus-level meta-data is also concatenated."

@koheiw
Copy link
Collaborator

koheiw commented Oct 19, 2023

We have corpus meta fields to keep track of sources of the texts etc., so meta field should be kept only when the two objects have the same values.

@mrstlee
Copy link
Author

mrstlee commented Oct 20, 2023

Thank you for the feedback.

In this case maybe the 'Quick Start' documentation is misleading?

"Corpus-level meta-data is also concatenated."

I apologise if I have misunderstood - you seem to be saying that unless the meta data for 2 corpus objects have the same field/attribute names the new corpus object formed by '+' will not have the meta data from either of the source corpus objects?

Thanks!

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 20, 2023

It should probably just say: "docvars are combined".

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 21, 2023

I have a working branch on this, but the unresolved policy question is what to do with metadata from two corpus objects that have the same "key". In other words, "title" and "title". Options are:

  1. just keep the first corpus object's metadata;
  2. make two new fields "title.1" and "title.2"; or
  3. concatenate the individual values into a list, such as title = c("First corpus", "Second corpus").

And what happens if two previous combined corpus object are then added together?

@mrstlee
Copy link
Author

mrstlee commented Oct 21, 2023

I've hacked up a version of 3 for my particular case.

  1. Seems to go against the usual commutative implication of '+'. X + Y != Y + X.
  2. Seems better since the data from both is kept and a consistent naming convention makes it easy to program with.
  3. Seems best - a long vs wide data structure would scale up better for chained concatenation. Although I would go for a list rather than vector as the underlying data structure.

Still, I've been wrong before. And that's just this morning.

@mrstlee
Copy link
Author

mrstlee commented Oct 21, 2023

In answer to the last question I would vote for:

c3 <-  c1 + c2  # title = list("First", "Second")
c5 <-  c3 + c4  # title = list("First", "Second","Third")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants