'meta' data is lost when using '+' to concatenate corpus objects #2295

mrstlee · 2023-09-24T11:59:27Z

Describe the bug

Corpus-level meta data assigned with meta is lost when 2 corpus objects are merged together with the '+' operator.

Reproducible code

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.

c1 <- corpus(letters)
## Default docnames will clash when the 2 corpus objects are added together
docnames(c1) <- rownames(mtcars)[1:26]

meta(c1) <- list( a = 1)

c2 <- corpus(LETTERS)
meta(c2) <- list( b =2 )

print( list(meta(c1), meta(c2) ) )
#> [[1]]
#> [[1]]$a
#> [1] 1
#> 
#> 
#> [[2]]
#> [[2]]$b
#> [1] 2

c3 <- c1 + c2

print (meta(c3))
#> list()


<sup>Created on 2023-09-24 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

Expected behavior

The separate meta data for each corpus should be merged e.g

print (meta(c3)) should give :

$a
[1] 1

$b
[1] 2

or similar.

## System information

Please run sessionInfo() and paste the output.

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.5

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] shiny_1.7.2               qreport_0.1.0             spacyr_1.2.1             
 [4] xfun_0.40                 rvest_1.0.3               lubridate_1.9.2          
 [7] forcats_1.0.0             Hmisc_5.1-0               ggplot2_3.4.2            
[10] quanteda.textstats_0.96.3 magrittr_2.0.3            quanteda.textplots_0.94.3
[13] quanteda_3.3.1            stringr_1.5.0             rlist_0.4.6.2            
[16] readr_2.1.4               data.table_1.14.2        

loaded via a namespace (and not attached):
  [1] TH.data_1.1-2      colorspace_2.0-3   ellipsis_0.3.2     rprojroot_2.0.3    htmlTable_2.4.1   
  [6] markdown_1.8       base64enc_0.1-3    fs_1.6.3           rstudioapi_0.14    MatrixModels_0.5-1
 [11] bit64_4.0.5        fansi_1.0.3        mvtnorm_1.2-2      xml2_1.3.3         R.methodsS3_1.8.2 
 [16] codetools_0.2-18   splines_4.2.1      cachem_1.0.6       knitr_1.39         pkgload_1.3.0     
 [21] Formula_1.2-5      jsonlite_1.8.4     gt_0.9.0           cluster_2.1.3      R.oo_1.25.0       
 [26] png_0.1-8          clipr_0.8.0        compiler_4.2.1     httr_1.4.5         backports_1.4.1   
 [31] Matrix_1.5-4.1     fastmap_1.1.0      cli_3.6.1          later_1.3.0        htmltools_0.5.6   
 [36] quantreg_5.95      tools_4.2.1        gtable_0.3.0       glue_1.6.2         dplyr_1.1.2       
 [41] fastmatch_1.1-3    Rcpp_1.0.9         styler_1.9.1       jquerylib_0.1.4    vctrs_0.6.2       
 [46] nlme_3.1-157       ps_1.7.1           stopwords_2.3      miniUI_0.1.1.1     timechange_0.2.0  
 [51] nsyllable_1.0.1    mime_0.12          lifecycle_1.0.3    sparkline_2.0      polspline_1.1.22  
 [56] MASS_7.3-57        zoo_1.8-12         scales_1.2.1       vroom_1.6.3        hms_1.1.3         
 [61] promises_1.2.0.1   parallel_4.2.1     sandwich_3.0-2     SparseM_1.81       RColorBrewer_1.1-3
 [66] yaml_2.3.5         memoise_2.0.1      reticulate_1.28    gridExtra_2.3      sass_0.4.7        
 [71] rms_6.7-0          rpart_4.1.16       stringi_1.7.8      highr_0.9          checkmate_2.1.0   
 [76] rlang_1.1.1        pkgconfig_2.0.3    commonmark_1.9.0   evaluate_0.16      lattice_0.20-45   
 [81] purrr_1.0.1        htmlwidgets_1.6.2  processx_3.7.0     bit_4.0.4          tidyselect_1.2.0  
 [86] here_1.0.1         R6_2.5.1           generics_0.1.3     multcomp_1.4-24    pillar_1.9.0      
 [91] foreign_0.8-82     withr_2.5.0        survival_3.3-1     nnet_7.3-17        tibble_3.2.1      
 [96] crayon_1.5.1       utf8_1.2.2         tzdb_0.3.0         rmarkdown_2.14     viridis_0.6.3     
[101] grid_4.2.1         callr_3.7.1        reprex_2.0.2       digest_0.6.29      R.cache_0.16.0    
[106] xtable_1.8-4       httpuv_1.6.5       R.utils_2.12.2     RcppParallel_5.1.7 munsell_0.5.0     
[111] viridisLite_0.4.0  bslib_0.4.0

Additional info

The 'Quick Start' states :

"Corpus-level meta-data is also concatenated."

The text was updated successfully, but these errors were encountered:

koheiw · 2023-10-19T07:57:23Z

We have corpus meta fields to keep track of sources of the texts etc., so meta field should be kept only when the two objects have the same values.

mrstlee · 2023-10-20T11:03:46Z

Thank you for the feedback.

In this case maybe the 'Quick Start' documentation is misleading?

"Corpus-level meta-data is also concatenated."

I apologise if I have misunderstood - you seem to be saying that unless the meta data for 2 corpus objects have the same field/attribute names the new corpus object formed by '+' will not have the meta data from either of the source corpus objects?

Thanks!

kbenoit · 2023-10-20T11:05:19Z

It should probably just say: "docvars are combined".

kbenoit · 2023-10-21T08:33:48Z

I have a working branch on this, but the unresolved policy question is what to do with metadata from two corpus objects that have the same "key". In other words, "title" and "title". Options are:

just keep the first corpus object's metadata;
make two new fields "title.1" and "title.2"; or
concatenate the individual values into a list, such as title = c("First corpus", "Second corpus").

And what happens if two previous combined corpus object are then added together?

mrstlee · 2023-10-21T09:35:16Z

I've hacked up a version of 3 for my particular case.

Seems to go against the usual commutative implication of '+'. X + Y != Y + X.
Seems better since the data from both is kept and a consistent naming convention makes it easy to program with.
Seems best - a long vs wide data structure would scale up better for chained concatenation. Although I would go for a list rather than vector as the underlying data structure.

Still, I've been wrong before. And that's just this morning.

mrstlee · 2023-10-21T10:00:01Z

In answer to the last question I would vote for:

c3 <-  c1 + c2  # title = list("First", "Second")
c5 <-  c3 + c4  # title = list("First", "Second","Third")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'meta' data is lost when using '+' to concatenate corpus objects #2295

'meta' data is lost when using '+' to concatenate corpus objects #2295

mrstlee commented Sep 24, 2023

koheiw commented Oct 19, 2023

mrstlee commented Oct 20, 2023

kbenoit commented Oct 20, 2023

kbenoit commented Oct 21, 2023

mrstlee commented Oct 21, 2023 •

edited

mrstlee commented Oct 21, 2023

'meta' data is lost when using '+' to concatenate corpus objects #2295

'meta' data is lost when using '+' to concatenate corpus objects #2295

Comments

mrstlee commented Sep 24, 2023

Describe the bug

Reproducible code

Expected behavior

Additional info

koheiw commented Oct 19, 2023

mrstlee commented Oct 20, 2023

kbenoit commented Oct 20, 2023

kbenoit commented Oct 21, 2023

mrstlee commented Oct 21, 2023 • edited

mrstlee commented Oct 21, 2023

mrstlee commented Oct 21, 2023 •

edited