Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRAN plans #5

Open
lmullen opened this issue Mar 13, 2018 · 15 comments
Open

CRAN plans #5

lmullen opened this issue Mar 13, 2018 · 15 comments

Comments

@lmullen
Copy link

lmullen commented Mar 13, 2018

It's been a while, but I wanted to inquire whether there are plans to put this tif package on CRAN. I still think it would be a worthwhile addition to the ecosystem. I think the work that @statsmaths did is a good base for a first release. I'm not sure if there are unresolved discussions to have about the interchange formats, and whether we should have that now or at the NYU meeting. But I would find this package very helpful for moving between the various formats required by different text mining packages.

I'm willing to do whatever is necessary if it would be helpful to get this on CRAN.

What are your thoughts, @kbenoit @statsmaths @patperry?

@statsmaths
Copy link
Collaborator

I agree that getting this on CRAN would be helpful. The only work on tif that I have done since September was to make cleanNLP compliant with the format; I believe @kbenoit has done with same with quanteda as well.

In addition to CRAN, how do we feel about getting this to be an official ROpenSci project? In many ways I see that as perhaps even more important in order to get larger adoption.

@lmullen
Copy link
Author

lmullen commented Mar 13, 2018

Yes, I agree that going through rOpenSci onboarding would be a good step to take before publishing to CRAN.

I'm checking this now, but as far as I know all the functions in tokenizers output lists in the format tif expects, and so the package should work with tif.

@statsmaths
Copy link
Collaborator

statsmaths commented Mar 13, 2018

@lmullen It looks like tokenizers is good in terms of the output:

song <-  paste0("How many roads must a man walk down\n",
                     "Before you call him a man?\n",
                     "How many seas must a white dove sail\n",
                     "Before she sleeps in the sand?\n",
                     "\n",
                     "How many times must the cannonballs fly\n",
                     "Before they're forever banned?\n",
                     "The answer, my friend, is blowin' in the wind.\n",
                     "The answer is blowin' in the wind.\n")
     
tokens <- tokenizers::tokenize_words(song)
tif::tif_is_tokens_list(tokens)
TRUE

However, currently tokenizers does not accept a data frame corpus object for input. At least in the current specification, we had that "packages should accept both and return or coerce to at least one of these".

@lmullen
Copy link
Author

lmullen commented Mar 13, 2018

@statsmaths Right. What I'm wondering is whether the "accept both" language should be "accept at least one". It seems like with the tif package available, packages wouldn't need to support both forms of input, because they could explain in their documentation how to use tif to coerce to formats needed by other packages.

In other words, tif could serve as the hub in a hub-and-spokes model with all the other text analysis packages, and each package only needs one spoke back to the hub.

For instance, a user might do something like this.

library(tif)
library(tokenizers)
library(dplyr)

corpus <- c("Aujourd'hui, maman est morte.", "It was a pleasure to burn.", "All this happened, more or less.")
names(corpus) <- c("Camus", "Bradbury", "Vonnegut")

corpus_df <- tif_as_corpus_df(corpus)

corpus_df %>% mutate(token = tokenize_words(text)) %>% select(-text) %>% tidyr::unnest() %>% 
  tif_is_tokens_df()
#> [1] TRUE

corpus_df %>% tif_as_corpus_character() %>% tokenize_words() %>% tif_is_tokens_list()
#> [1] TRUE

If the consensus is that "accept both" is the right way to go, then I am willing to adopt that in tokenizers. I'd just need tif to be published to CRAN before I could publish a new version of tokenizers.

@kbenoit
Copy link
Member

kbenoit commented Mar 13, 2018

I fully support the idea of returning to and completing the tif package, along with extensive guidelines for adoption. I would propose that tif have the checking functions, and conversion to and from its own general types, but we issue guidelines for each package to import and export its formats.

I’m not sure that these guidelines should dictate nomenclature, since each package has its own preferences for naming things. In quanteda for instance, we have our custom tokens class object, but as.list(x) produces a list compliant with the tif tokens list standard, for instance. We also have an as.tokens(x) which converts the tif formats into the tokens class. Similar methods exist for corpus input and output.

We could however maintain a checklist and table on the GitHub site for tif compliance for each package, with associated function names for I/O. This could be based on a template of tests that each package must pass, after substituting a list of generic function object references with their package’s own functions. If those pass, then a package is fully compliant.

I am sprinting through Mar 23 to complete our spring term but could work on this first/second week of April.

@lmullen
Copy link
Author

lmullen commented Mar 13, 2018

@kbenoit I agree with that description of the basic functionality. That's what the tif package as currently written already does, right, @statsmaths? If so, I don't think there is anything hindering the package from going on CRAN as it is.

I like the idea of a table of compliance and would be willing to help with that whenever.

@statsmaths
Copy link
Collaborator

@lmullen Yes, that is exactly what the package does as written. I've changed the column ordering issue with the corpus and tokens object. Can you think of any other outstanding issues that we need to address before publishing?

I do think a vignette with a table showing compliant packages would be great to get before pushing to CRAN (and as mentioned above, doing the ROpenSci onboarding would be a good next step after that).

@lmullen
Copy link
Author

lmullen commented Mar 14, 2018

@statsmaths I've raised a separate question in #8. But if the answer to that is no, then I don't see any reason that this couldn't be published to CRAN.

I do think that such a vignette would be great, but it could wait to a point release, since I suspect it will be very time consuming to create. Your call, of course.

@lmullen
Copy link
Author

lmullen commented Mar 14, 2018

BTW the master branch of tokenizers now meets the requirement to take corpus data frames, which I was waffling on earlier.

@statsmaths
Copy link
Collaborator

Okay, I agree that we can probably wait on developing a table of compliant packages. I just uploaded some details to follow the ROpenSci onboarding guidelines and the goodpractice checks. Once we resolve issue #8 (I just commented on that issue there), I'm comfortable uploading to CRAN.

@lmullen
Copy link
Author

lmullen commented Mar 14, 2018

🚀

I'm fine with passing over the suggestion in #8. It would make things needless complicated.

@kbenoit
Copy link
Member

kbenoit commented Mar 14, 2018

I think that it would be crucial to include a vignette explaining not only the standard but also how a package can complete the requirements for compliance. I also think we can include some tests where there are globals that can be replaced by each package's functions, and the tests calling the global references to the tested package's functions would run the same test code for compliance. Passing the test = compliance.

I think it would be more natural to complete this before publishing on CRAN. I'm happy to work on this from Mar 26.

@lmullen
Copy link
Author

lmullen commented Mar 14, 2018

A good rule of thumb is that any argument for deliberateness should trump an argument for speed. 😄

I'm willing to pick this back up at the end of March / beginning of April and help in any way with the vignettes and tests.

If we are going to wait, I wonder if we should also pursue a more formal means of comment on the formats before release. That could look like sending an e-mail requesting comment to all the people that Ken assembled last year for the meeting, and perhaps another e-mail to all the maintainers with a package on the relevant CRAN task view. We're really talking about a human problem, not a technical problem, and perhaps a widespread review would help get buy in from the community.

@kbenoit
Copy link
Member

kbenoit commented Mar 14, 2018

Sounds wise to me. See ropensci/textworkshop18#8 - we thought this could be worth revisiting next month with an aim to closure for an initial release.

I'm happy to draft my test suite for certification idea but just cannot squeeze more time out of my next 10 days.

@u17289077
Copy link

hello
how can i resolve this problem

Error in parse(outFile) :
C:/Users/hugues/AppData/Local/Temp/Rtmpstb2KH/R.INSTALL2c2020bf2630/tif/R/coercion.R:177:27: unexpected ')'
176: stop("Cannot convert object of class ", class(tokens),
177: " to tif tokens"))
^
ERROR: unable to collate and parse R files for package 'tif'

  • removing 'C:/Users/hugues/Documents/R/win-library/3.5/tif'
    In R CMD INSTALL
    Error in i.p(...) :
    (converted from warning) installation of package ‘C:/Users/hugues/AppData/Local/Temp/RtmpeqtJai/file1d983bd2c12/tif_0.3.tar.gz’ had non-zero exit status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants