Skip to content

Releases: PolMine/cwbtools

Blackbird

29 Apr 12:03
Compare
Choose a tag to compare
  • All packages listed in 'Suggests' section of DESCRIPTION used conditionally (pass checks with R_CHECK_DEPENDS_ONLY=true) #74.

Flying Panda

29 Feb 17:27
Compare
Choose a tag to compare

New features

  • New method encode() to prospectively supersed CorpusData class. Includes argument properties #13.
  • New function corpus_reload() for convenient unloading/reloading corpora #68.
  • New utility function registry_set_name() #13.

Minor improvements

  • cwb_get_url() will get CWB v3.5 installation files #63.
  • corpus_remove() returns FALSE (rather than failing with ERROR) when corpus
    does not exist. More telling messages.
  • p_attribute_encode() has new argument quietly passed into RcppCWB functions
    cwb_compress() cwb_huffcode() and cwb_compress_rdx() to control verbosity.
  • Method $encode() of CorpusData class has new argument quietly passed into
    p_attribute_encode().
  • Method $encode() has new argument reload to trigger unloading and reloading
    corpus, to make s-attributes available #57.
  • The CorpusData$encode() method uses messages from the cli package #59.
  • Outdated documentation of p_attribute_encode() rewritten, including explanation
    of argument compress and simplification of sample code #61.
  • Corrected inconsistencies in the vignette #55.
  • s_attribute_encode() coerces input values to character (rather than failing) #62.
  • The validity of attribute names is checked by s_attribute_encode(),
    p_attribute_encode() and CorpusData$encode() using a new (internal)
    function, a telling message is issued if non-ASCII or uppercase characters are
    used. The documentation has been augmented accordingly #48.
  • For method "R", p_attribute_encode() checks whether files for encoded p-attribute
    exist and fails gracefully with telling error message if yes #4.
  • Argument compress defaults to FALSE as corpus compression is not stable on Windows #3.
  • function corpus_as_tarball() and corpus_copy() now have registry_file_parse(corpus, registry_dir)[["home"]] as default value, so that values are more consistent across corpus_* functions #18.
  • cwb_get_bindir() tries to find cwb-config system utility, if it is on the path.
  • s_attribute_encode() issues warning on Windows when using s-attribute 'id' #69.
  • Replaced normalizePath() by fs::path() in p_attribute_encode() #65.

Improved documentataion

  • Simplifications of the vignette #60.
  • Scenario how to add stemmed token stream to existing corpus added to vignette #14.

Solid Path

01 Sep 09:48
Compare
Choose a tag to compare
  • Package names, software names and API are wrapped in single quotes in the
    DESCRIPTION files, to follow section 1.1.1 of 'Writing R extensions' #43.
  • References in the description of the DESCRIPTION file have been standardized
    #44.
  • To meet CRAN requirements, any remaining usage of install.packages() has
    been removed from the package. Using argument pkg of corpus_install() will
    install corpora found in a package as system corpora defined in the default
    registry directory #46.
  • The vignettes 'opennnlp.Rmd' and 'sentences.Rmd' have been removed from the
    package; they are now part of the PolMine Cookbook repository at
    https://github.com/PolMine/cookbook. Packages 'NLP' and 'openNLP' are no
    longer suggested and the install.packages() call (though not evaluated) is
    omitted. Part of the fix for #46.
  • The fs::path() function replaces base R file.path() throughout to solidify
    the generation of paths and to improve the readability of the code throughout.
  • p_attribute_encode() checks that the character vector token_stream does
    not exceed the CWB corpus size limit (2^31 - 1) #40.

Houston Calling

20 Jul 12:51
Compare
Choose a tag to compare
  • Ensure that zenodo_get_tarball() fails gracefully if Zenodo is temporarily
    not available.

Secret Spell

15 May 12:28
Compare
Choose a tag to compare
  • New function p_attribute_rename(), corresponding to s_attribute_rename().
  • p_attribute_encode() will remove the [p_attr].corpus file as suggested by
    cwb-makeall (if compress is TRUE).
  • Assumptions about the statement of an info file in registry files are relaxed,
    the line starting with "INFO" is not required.
  • Internally, functionality from the fs package for a consistent handling of
    paths (such as fs::path()) is used more widely (#36).
  • Assumptions about the definition of a version in the name of a corpus tarball
    are relaxed. If possible, the version is taken from the properties (i.e. the
    registry file).
  • New function zenodo_get_tarball() for downloading corpus tarballs from
    Zenodo. Restricted access can be handled too (personalized URL with token).
  • Function corpus_install() has new argument load to control whether corpus
    is loaded after installation.

Hemicycle

23 Feb 21:10
Compare
Choose a tag to compare

NEW FEATURES

  • Assumptions about the directory structure in a corpus tarball are somewhat relaxed: The name of the data directory may also be "data" (not just "indexed_corpora") and data files need not be necessarily in a subdirectory of the data directory. This makes downloading and installing the Europarl and the Dickens corpus possible.

MINOR IMPROVEMENTS

  • The dependency on the devtools package can be dropped as one consequence of removing the Europarl vignette.
  • The dependency on the usethis package has been removed.
  • The sentences-vignette is more robust by explicitly creating a temporary registry directory.

BUX FIXES

  • A unit test that involves calling cwb_install() is skipped on Solaris to ensure that Solaris CRAN tests will not fail: A CWB binary is not available for Solaris.

DOCUMENTATION FIXES

  • The vignette "europarl.Rmd" is dropped altogether: Putting corpora into packages is not the recommended approach any more.

Il Postino

22 Feb 13:37
Compare
Choose a tag to compare

NEW FEATURES

  • It is now possible to install a corpus from S3 by stating a S3-URI as argument tarball of corpus_install().
  • A new argument checksum for the corpus_install() function introduces functionality to check the integrity of a downloaded corpus tarball. If the tarball is downloaded from Zenodo (by stating a DOI using argument doi), the md5 checksum included in the record's metadata is extracted internally and used for checking.
  • A new vignette explains how an existing CWB corpus can be enhanced using openNLP.
  • The function corpus_copy() will accept a new argument remove. If TRUE (the default value is FALSE), files that have been copied will be removed. Removing files is reasonable to handle disk space parsimonously if the source corpus is at a temporary location where nobody will miss it.

MINOR IMPROVEMENTS

  • The corpus_install() function will abort with a warning and return value FALSE rather than an error if the DOI is not offered by Zenodo.
  • If corpus_install() is used to install a corpus from a tarball present locally, a somewhat confusing message suggested that the tarball was downloaded. This message is not shown any more.
  • Extracting a corpus tarball present locally involved copying the tarball to a temporary location before extracting it. This step consuming more disk space than necessary (inefficient and potentially problematic with large corpora) is now omitted.
  • The function cwb_install() now replaces an internally hardcoded argument cwb_dir with an argument cwb_dir; the function returns the directory where the CWB is installed rather than NULL value.
  • The function cwb_get_bindir() now introduces an argument bindir.
  • Argument compress of p_attribute_encode( now has default value FALSE (#29).
  • Examples in documentation of p_attribute_encode() have been adapted so that GitHub Action unit test passes on Windows.
  • A user abort if an existing corpus would be removed by installing the same version anew will not result in an error message any more, but in return value FALSE (#25).

BUG FIXES

  • To avoid an issue with a false negative issued by RCurl::url.exists(), this function has been replaced by httr::http_error() (#31).
  • The corpus_install() function still showed some progress messages even when verbose was set as FALSE (argument not passed to corpus_copy(). Fixed.
  • The code in the vignette on adding a sentence annotation was not executed when building the package and a bug in the code went unnoticed. Fixed (#17).
  • The get_encoding() method would return NA if localeToCharset() fails to infer charset from locale. In this case, UTF-8 is assumed.

DOCUMENTATION FIXES

  • A misleading, deprecated example in a dontrun section of the general package documentation has been removed (#23). The vignette includes a working and tested example how to encode the REUTERS corpus.

Straight No Chaser

22 Jul 10:35
Compare
Choose a tag to compare

NEW FEATURES

  • The (weak) dependency on the polmineR package (it was in the 'Suggests:' section of the DESCRIPTION file) has been removed. Changes are purely internal (higher-level polmineR functions have been replaced by lower-level RcppCWB functions, some tests were re-written). Dropping the dependency has the advantage that there is a much clearer structure of dependencies now (RcppCWB -> cwbtools -> polmineR).

MINOR IMPROVEMENTS

  • A remaining CLI formatting issue has been removed from the user dialogue for modifying the .Renviron file.
  • Unit tests used a test download of the United Nations General Assembly (UNGA) corpus from Zenodo. To reduce the time required for testing the package, a test download of the (much smaller) GermaParlSample copus is performed.

Apple Picker

17 Jul 06:13
Compare
Choose a tag to compare

NEW FEATURES

  • The corpus_install() gives much better and nicer reports on steps performed during
    corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance.
  • The use_corpus_registry_envvar() function is called by corpus_install() and will
    amend the .Renviron file as appropriate if the user so desires.
  • To resolve a DOI, the 'zen4R' package is used, to extract information on the whereabouts
    of a corpus tarball efficiently from the Zenodo API.
  • A corpus_testload() has been implemented to check whether a (newly installed) corpus
    is accessible.

MINOR IMPROVEMENTS

  • Extracting the version number from the corpus tarball is somewhat more forgiving if the
    version number does not start with "v".
  • The registry file for a newly downloaded corpus is refreshed only if a temporary registry directory is used.
  • To remedy the fairly common error that the path to the info file is not stated correctly in the registry file, a fallback mechanism will look up potential alternatives to an info file stated wrongly.

BUG FIXES

  • The json string returned from Zenodo may include newline strings that are escaped such
    that they cannot be processed by jsonlite::fromJSON(). The auxiliary function to get and
    process information from Zenodo now ensures that newline characters are escaped such that
    they can be processed.
  • The corpus_copy() function did not set the path to the info file to the new data directory - corrected.
  • The corpus_install() function failed when the registry_dir got a NULL value from the default call to cwbtools::cwb_registry_dir(). But if the directories are created, the registry directory is there. Fixed.
  • Removed a bug (faulty assignment) that would prevent that the path of a registry file
    is handled correctly (i.e. wrapped in quotation marks) by registry_file_compose() when the
    path includes any whitespace characters.

DOCUMENTATION FIXES

  • A problem with updating the curl dependency of cwbtools that may arise when devtools::install_github() is used is addressed in an extended explanation in the README.md file how to install the development version of cwbtools using remotes::install_github() (#21).

Late Vintage

10 Dec 09:00
Compare
Choose a tag to compare

This is a minor release that anticipates an upcoming change in R's matrix class, that will inherit from the array class starting with R 4.0.

MINOR IMPROVEMENTS

  • The pkg_add_corpus() function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).

BUG FIXES

  • In the upcoming R version 4.0, the matrix class will inherit from class array. The new package version now takes into account that length(class(matrix(1:4,2,2))) will return the value 2.

DOCUMENTATION FIXES

  • The NEWS file now follows the styleguide such that pkgdown::build_site() will generate a proper changelog page.