Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus_registry_dir() duplicating on Windows #267

Open
maw44989 opened this issue Oct 4, 2023 · 3 comments
Open

corpus_registry_dir() duplicating on Windows #267

maw44989 opened this issue Oct 4, 2023 · 3 comments

Comments

@maw44989
Copy link

maw44989 commented Oct 4, 2023

We are using polmineR for a Text and Corpus Analysis class for Undergraduate and Graduate students. For individuals using polmineR on Windows, there is a recurring issue preventing use of polmineR. Here is the ouput of the error:

" error in evaluating the argument '.Object' in selecting a method for function 'count': Cannot initialize corpus object - corpus defined by two different registry files."

Included below is the issue on Windows and a positive example of how polmineR correctly works on Mac.

Windows Issue

Version Numbers

packageVersion("RcppCWB")
[1] 0.6.2
packageVersion("polmineR")
[1] 0.8.8
R version 4.3.1 (2023-06-16 ucrt) -- "Beagle Scouts"

Check path before loading polmineR

c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC"))
c_regdir
NA

Check path after loading

library(polmineR)
c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC"))
c_regdir
R:/windows_registry

Run count command once

QD <- count("BNC", query = "'quite' [pos = '(DT0|DTQ)']", cqp = T, regex = T)
head(QD)
query count freq
1: 'quite' [pos = '(DT0|DTQ)'] 626 5.581494e-06

Registry is now duplicated (length = 2), prompting the above error on all future commands using polmineR

c_regdir <- fs::path(RcppCWB::corpus_registry_dir("BNC"))
c_regdir
R:/windows_registry R:/windows_registry

Mac Success

Versions

R version 4.3.1 (2023-06-16) -- "Beagle Scouts"

packageVersion("polmineR")
[1] ‘0.8.8’
packageVersion("RcppCWB")
[1] ‘0.6.2’

Set Registry Environment and load polmineR

Sys.setenv("CORPUS_REGISTRY" = "/Volumes/cwb_registry/mac_registry")

library(polmineR)

Run count command --> registry is still length 1

QD <- count("BNC", query = "'quite' [pos = '(DT0|DTQ)']", cqp = T, regex = T)
fs::path(RcppCWB::corpus_registry_dir("BNC"))
/Volumes/cwb_registry/mac_registry
RA <- count("BNC", query = "'rather' [pos = '(A.)']", cqp = T, regex = T)
RA
query count freq
1: 'rather' [pos = '(A.
)'] 12658 0.0001128603

Even after running count() twice the corpus_registry_dir of the British National Corpus stil has length 1. On Windows it doubles and becomes length 2

fs::path(RcppCWB::corpus_registry_dir("BNC"))
/Volumes/cwb_registry/mac_registry

@jthale76
Copy link

jthale76 commented Oct 8, 2023

We can actually make it happen using only lower-level RcppCWB functions. The call to cqp_subcorpus_size shows that the search itself was successful. Have there been any changes in the last year to RcppCWB that might lead to such doubling? The demonstration below uses Windows version 10.0.19045

library(RcppCWB)
packageVersion("RcppCWB")
[1] ‘0.6.2’

Sys.setenv("CORPUS_REGISTRY"="R:/windows_registry")
cqp_reset_registry(registry = Sys.getenv("CORPUS_REGISTRY"))
[1] TRUE

cqp_query(corpus = "BNC", query = '"the";')
<pointer: 0x000001e1fc908c50>

cqp_subcorpus_size("BNC",subcorpus="QUERY")
[1] 5405646

corpus_registry_dir("BNC")
R:/windows_registry R:/windows_registry

cqp_query(corpus = "BNC", query = '"of";')
<pointer: 0x000001e1fc908c50>

corpus_registry_dir("BNC")
R:/windows_registry R:/windows_registry`

@ablaette
Copy link
Collaborator

I face a similar issue when doing this on macOS:

library(polmineR)
use("GermaParl2")

foo <- corpus("GERMAPARL2MINI") %>%
  subset(protocol_date == "1949-09-07", verbose = TRUE) %>% 
  subset(speaker_name == "Konrad Adenauer", verbose = TRUE)


foo <- corpus("GERMAPARL2MINI") %>%
  subset(protocol_date == "1949-09-07", verbose = TRUE) %>% 
  subset(speaker_name == "Konrad Adenauer", verbose = TRUE)

It's absolutely clear that this issue needs to be solved. Apologies for taking it up this late!

@ablaette
Copy link
Collaborator

There is a closely a related issue on macOS: RcppCWB::cl_struc_values() will result in a corpus being loaded twice.

PolMine/RcppCWB#77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants