Skip to content

Releases: PolMine/polmineR

Nested Boxes

01 Sep 09:39
Compare
Choose a tag to compare

New features

  • Using the corpus class throughout is an opportunity to keep the corpus ID
    together with the registry directory of a corpus. And as we are able now to
    handle corpora defined in different registry files, the temporary registry
    directory is not necessary any more. It still exists, yet only for temporary
    corpora and corpora that are described by registry files that cannot be
    modified, i.e. corpora shipped in packages. The test corpus of the polmineR
    package is an important respective scenario.
  • get_token_stream() now has an argument min_length.
  • registry_*() functions are superseded by RcppCWB::corpus_* functions and
    throw a warning that they are deprecated.
  • The REUTERS corpus is not included in the package any more: There was an
    identical copy of the REUTERS corpus included in the RcppCWB package. All
    examples and unit tests now use use(pkg = "RcppCWB", corpus = "REUTERS") to
    make the REUTERS corpus available.
  • size() works for partition/subcorpus with s-attribute that is a child
    of the s-attribute the object is based on #216.
  • The trim()-method for context objects has a new argument fn for
    supplying a (trimming) function to be applied all match contexts.
  • A new s-attribute "protocol_date" has been added to sample corpus
    "GERMAPARLMINI", so that sample data for nested corpus data is available. To
    prevent confusion between s-attributes "protocol_date" (at protocol-level) and
    "date" (at speaker-level), argument s_attribute_date is stated explicitly in
    all examples.
  • Method size() has been refactored to work with nested corpora.
  • Method encoding() and replace method encoding<- are defined for call
    and quosure objects to get and adjust the encoding, replacing a previously
    unexported function .recode_call().
  • The subset() methods for corpus and subcorpus objects now handle
    expressions for subsetting as quosures, laying the ground to program against
    subset(), see respective update of the examples, #212.
  • Functionality for indexing bundle objects with single square brackets is
    developed now. Indexing with double brackets, suppling multiple values for i
    is deprecated. The aim is a consistent behavior that a bundle indexed by [
    will always return a bundle, and indexing with [[ always gets a single object
    from the list of objects. #214

Minor improvements

  • The use() function now has an additional argument corpus to specify which
    corpus from a package shall be loaded (#138).
  • The get_token_stream()-method for partition_bundle objects is more memory
    efficient (no exhaustion for big corpora) and faster.
  • Significantly improved performance of split()-method for corpus objects.
  • The split()-method for corpus objects offers progress bar.
  • as.speeches() for corpus objects has new argument subset, offering a
    significantly faster approach than the method for subcorpus objects in many
    cases.
  • The size() method will return NA and issue a telling warning if the slot
    corpus and registry_dir of the corpus object are not filled #222.
  • get_token_stream() will return list of integer values if decode is
    TRUE (#213).
  • After applying trim() on a context object using arguments positivelist
    or negativelist, the count slot as reported by length was not updated.
    Fixed. (#220)
  • The enrich() method for context objects has a new argument stat for
    creating / updating the data.table in the slot stat.
  • Method subset() for subcorpus objects has been debugged to work with
    nested corpora.
  • New option polmineR.mdsub configures substitutions that are applied on
    markdown documents to prevent presence of characters that would be
    misinterpreted as formatting instructions. Fixes #166.
  • The messages issued by check_cqp_query() now include a hint that argument
    check can be used to omit checking the CQP syntax to prevent false positives.
    Addresses #171.

Bug fixes

  • The ability of cooccurrences() (and context()) to process more than one
    p-attribute has been lost temporarily. Fixed. #208.
  • Removed a bug for hits() method for partition objects #215.
  • After applying trim() on a context object using arguments positivelist
    or negativelist, the count statistics reported in the stat slot were not
    updated. Fixed. (#220)
  • Structural attributes do not disappear any more after adding tooltips to a
    kwic object #218.
  • Method subset() would not work reliably with argument regex if more than
    one expression is passed #212. Fixed.
  • terms() did not work for subcorpus objects. Fixed. #209
  • When applying as.speeches() on a subcorpus, the date may have been missing
    from the object names. Fixed. #219
  • Fixed an issue that minNchar in the noise() method would work exactly the
    way opposite to the way intended #211.
  • The slot registry_dir of a cooccurrences_bundle derived from a
    partition_bundle was not filled, resulting in an error of the show()-method
    for the cooccurrences_bundle. Fixed #222.

Documentation

  • The documentation of the cooccurrences() method now includes example code
    for creating a table using DT::datatable() with buttons for exporting tables
    (to Excel, for instance).

Yellow Submarine

05 May 21:16
Compare
Choose a tag to compare

New Features

  • The dispersion() method now accepts an argument fill, a logical value to
    explicitly control whether (#160) zero matches for a value of a structural
    attribute should be reported. The performance of adding columns (requred only if
    two structural attributes are provided) is improved substantially by using the
    reference semantic of the data.table package. If many columns are added at once,
    a warning issued by the data.table package is supplemented by an further
    explanatory warning of the polmineR package. Filling up the data.table was
    limited previously to freq = FALSE, this limitation is lifted.
  • The html() method is implemented for remote_subcorpus objects.
  • The hits() method is implemented for remote_corpus and remote_subcorpus
    class (#160).
  • A new S4 class ranges is introduced to manage ranges of corpus positions for
    query matches. This is a preparatory step to remove an inconsistency from the
    hits class that mixed two very usages (getting ranges of corpus positions for
    matches and getting counts).
  • A new S4 method ranges serves as the constructor to prepare a ranges class
    object. In combination with as.data.table(), it replaces former functionality
    of hits() without argument s_attribute.
  • The output of the hits() method is altered, making it much more consistent
    than previously: The method will consistently return a hits object.
  • The method hits() has a new argument fill that will report zeros for
    combinations of s-attributes with no matches for a query.
  • The argument subset for the subset method for remote_corpus objects can
    now be a call (#162), this is a basis for passing vectors to OpenCPU server. -
    p_attributes() implemented for remote_corpus and remote_partition.
  • A new regions() method (for corpus class objects to start with) returns a
    regions class object with a regions matrix (slot cpos) with regions for an
    s-attribute (#176).
  • The get_token_stream()-method for regions and matrix objects will now
    accept a logical argument split. If TRUE, a list of character vectors is
    returned. The envisaged use case is a fast decoding of sentences (#176).
  • A encoding() method has been defined if argument object is missing.
    Calling encoding() will return the session character set. If it cannot be
    determined using localeToCharset(), a UTF-8 session charset will be assumed.
    Internally, encoding() replaces a direct call of localeToCharset() to avoid
    errors that have occurred on GitHub Actions with Ubuntu 20.04 (#188).
  • If the session character set cannot be guessed by localeToCharset() (NA
    return value), a startup message will issue a warning that 'UTF-8' is assumed
    (#188).
  • The size() method is now able to handle nested s-attributes.
  • The trim() method for context objects will now accept a matrix with ranges
    a positivelist argument.
  • The highlight() method now acceps matrix objects as elements of the list
    of items to be highlighted. It is treated as a set of regions, such as resulting
    from cpos(). Thus it is possible to highlight matches for CQP queries.
  • The package now requires at least RcppCWB v0.5.2, which includes a much more
    efficient worker for token contexts for the context() method.
  • The count()-method for partition_bundle objects failed with an opaque
    error message if there were no query matches at all. There is now a check for
    this scenario and the expected table is returned (zero values throughout.)
  • The corpus class is now a superclass for the textstat class, starting to
    create a more coherent class structure in general. This is an important
    preparatory step to be able to keep all registry files in the temporary registry
    directory. To avoid a confusion in the class system resulting from the coerce
    method from partition to corpus objects, this coerce method (defined by
    setAs()) has been removed. The get_template()-method for partition objects
    using this coerce method has been removed - as it inherits the method anyway, it
    is not needed any more. See #201.
  • The kwic tab of the shiny app included in the package exposes the improved
    capabilities to determine the context of a query match based on an s-attribute
    (argument region) and to consider the changing value of an s-attribute as
    a boundary of a context (argument boundary). New menu "boundary" and radio
    buttons, conditional on presence of s-attributes "s" and/or "p".

Minor Improvements

  • If arguments sAttribute or pAttribute (instead of s_attribute and
    p_attribute) are still used with dispersion() method, a warning is issued
    declaring that the argument is deprecated.
  • Examples in packages that depend on polmineR would have faced the issue that
    loading/re-loading the package in several examples would not be posssible as the
    mechanism of cleaning up between examples would trigger a removal of polmineR's
    temporary directories but not the re-creation. Removing temporary files is now
    moved from polmineR's .onDetach() to .onUnload() (#164).
  • Significant improvement of the performance of the as.phrases() method (#172).
  • The as.corpusEnc() auxiliary function will now check whether non-convertible
    characters lead to an NA result and issue a warning how this warning can be
    avoided (#151).
  • Significant performance improvement of the context() method for matrix
    objects if arguments left and right are named integer vectors. All
    context() benefit from the improved performance of this worker for creating
    contexts for query matches.
  • New coerce-method to derive matrix with ranges from a context object.
  • The enrich() method for context objects will now perform an in-place
    operation when adding new s-attributes.
  • The as.cqp() function includes arguments check and warn for running
    check_cqp_query() on queries.
  • The context() method for matrix objects includes a new argument boundary
    and relies on a new functionRcppCWB::region_matrix_context().
  • Default value of argument verbose of context()-methods is now FALSE.
  • The as.corpusEnc() auxiliary function now includes a test whether input
    character vector includes unexpected encodings and issues a warning if this is
    the case.
  • The cpos() method will now check for accidental leading and/or trailing
    whitespace and remove it for token lookup. Note that hits(), count() and
    dispersion() will report queries without removing whitespace.
  • Internals of the count()-method for partition_bundle objects will be much
    more efficient when many columns with zero matches need to be added. The
    implementation avoids a data.table warning when the bulk action of adding new
    columns exceeds the number of columns reserved by data.table objects.
  • The DESCRIPTION files does not state "LazyData: yes" any more, as the package
    does not have a data directory.
  • Typo in messages of trim() is removed (#197).
  • encoding() relies on l10n_info() before using localeToCharset() as a
    matter of performance and robustness (#196).
  • Class corpus has a new slot registry_dir. This is a preparatory step that
    will facilitate managing corpora described by registry files in different
    registry directories.
  • Constructor corpus() for corpus-class objects has an argument
    registry_dir that will be required to distinguish corpora described by
    registry files in different registry directories.
  • The package now relies on the the fs package to handle directories and paths.
    Slots in S4 classes are not fs_path classes.
  • Internally, functions registry_get_home() and registry_get_encoding() have
    been replaced by RcppCWB functions cl_charset_name() and corpus_data_dir()
    with equivalent result, but faster due to immediate access to C representation
    of the corpus.
  • The corpus() method will deduce the registry directory from the C representation
    of the corpus if possible.
  • An inefficiency in the implementation of as.markdown() has been removed,
    making fulltext display (using read() or html()) much faster.
  • Calling corpus() without any arguments now returns an expanded data.frame
    reporting all slots of the corpus class objects, skipping only the data
    directory of the corpus.
  • The cpos() method for matrix objects that turns a matrix with corpus
    positions into a vector of integer values now relies on a C-level
    implementation newly included in the RcppCWB package, that is significantly
    faster than the best possible implementation in R.
  • The table generated by kwic() shows row numbers, which is convenient
    when referring to specific rows (#184).
  • The as.cqp() now checks whether argument query meets the expectation that
    it is a query (#191).
  • The method make_region_matrix(), which has been used internally only, has
    been removed. RcppCWB::s_attr_regions() replaces the functionality.
  • The as.speeches() method had not yet been implemented for nested corpora. A
    limited rewrite makes this work now (#198).
  • Inconsistencies and unnecessary limitations of the get_token_stream() method
    for partition_bundle objects have been addressed: Multiple p-attributes can be
    used without providing phrases at the same time (#142) and using the subset
    argument does not depend on using phrases either (#141).
  • The as.sparseMatrix() method is now also defined for DocumentTermMatrix
    objects (was available previously ony for TermDocumentMatrix objects).
  • If a vector of queries is named, theses named are now used consistently by the
    hits() method (#195).
  • get_type() for subcorpus_bundle returns NULL if no type is defined as a
    matter of consistency (#169).
  • If an expression for subsetting a corpus/subcorpus includes invalid
    s-attributes, the warning is telling and NULL is returend (#179).
  • The cooccurrences option...
Read more

Putty Knive

29 Sep 15:04
Compare
Choose a tag to compare

New Features

  • A new decode() method for data.table objects shall serve as a more user-friendly access to the efficiency of the RcppCWB::cl_cpos2str() function.
  • The data.frame returned when calling corpus() will now include a column with the encoding of the corpus.

Bug fixes

  • The warn argument of the get_template()-method remained unused, resulting in a warning message even if warn was FALSE, resulting in a set of warning messages when calling corpus(). The argument is used as intended now and defaults to FALSE.
  • The as.markdown()-method for subcorpus objects now uses an (internal) default template accessible via polmineR:::default_template, if no template is defined for a corpus.
  • The registry_get_encoding() function returned a length-one character vector if the regular expression to extract the charset corpus property did not yield a match. To prevent errors, it now returns "latin1" as the CWB standard encoding (#159).

Unicorn Dream

23 Jul 11:14
Compare
Choose a tag to compare

Minor Improvements

  • The knit_print()-method for textstat objects does not accept the three dots argument any more. As an installation of pandoc is necessary to include resulting htmlwidget in an html document, the method will check now whether pandoc is available. If not, a formatted data.table is returned.
  • The knit_print()-method for kwic objects does not have the pagelength argument any more as it has been unused. The pagelength is controlled by the option polmineR.pagelength. Internally, the method will call the method for the textstat superclass of the kwic class, which is newly robust against a missing installation of pandoc.
  • Any Unicode characters that could be detected have been removed from the documentation to avoid warnings on the CRAN Solaris test machine (#156).

Bug Fixes

  • The chisquare() method needs to increase the number of digits temporarily, but failed to revert to the original value as expected. One implication was, that rounding the values in data.table objects would fail, and rounding in general yielded very strange results (#155). Fixed.

Caterpillar Mambo

18 Dec 09:03
Compare
Choose a tag to compare

New Features

  • The corpus class has been put in a shape to become the default point of
    departure of most workflows. All core methods are now available for the
    corpus class, and have been implemented newly if necessary, e.g. show()
    and size()-method. The constructor method for a corpus object, the
    corpus() method, will now check whether the character vector with the corpus
    ID refers to an available corpus, whether all letters are upper case and
    issue informative warnings and error messages.
  • The s_attributes()-method for corpus objects has been reworked: It will decode
    binary files directly, without reliance on the corpus library functions, which is
    significantly faster.
  • The Corpus reference class is now obsolete after the introduction of the
    S4 corpus class. To maintain the functionality not covered otherwise,
    new generics get_info and show_info have been introduced and defined
    for the corpus class.
  • Methods available for the subcorpus class have been expanded so that this
    class can supersede the partition class: Methods newly available are
    cpos(), count(), p_attributes(), s_attributes() get_token_stream(),
    and size(). Technically, there is virtual slice-class, from which
    subcorpus inherits (methods called via callNextMethod()).
  • A new subset()-method for the corpus and subcorpus classes to generate subcorpora
    (i.e. subcorpus objects) has been introduced. It outperforms the
    partition() method. The subset()-method for corpus and subcorpus objects
    will be the default way to work with non standard evaluation in a manner that
    feels "R-ish" (#40).
  • The zoom()-method that has been introduced experimentally has
    been dropped again in favor of the subset()-method to get subcorpus objects
    from corpus and subcorpus objects. A set of experimental methods for an
    initial check of the feasibility of a non-standard evaluation approach to
    the generation of subcorpora has been dropped (methods $, ==, !=,
    zoom for corpus-class).
  • To facilitate the transition from the partition class (inheriting from
    the textstat class) to the subcorpus class (inheriting from the textstat
    class), there is a new coerce()-method to turn a partition object into
    a subcorpus object.
  • A new remote_corpus-class is the basis for accessing remote
    corpora. A remote_subcorpus can be derived from a remote_corpus. Methods
    available for remote corpora und subcorpora remain limited at this stage.
  • Consolidation of the class system: For all the S4 classes in the package, multiple
    contains have been checked, and multiple contains have been removed.
  • The subcorpus_bundle class now inherits from partition_bundle. This is not
    intended to be a long-term solution, but facilitates the implementation of new
    workflows based on the subcorpus class rather than the partition class.
  • Calling the polmineR shiny app via polmineR did not have safeguards if
    the suggested packages shiny and shinythemes were not installed. Now
    there will be a conditional installation of the packages required for running
    the shiny app.
  • The somewhat odd class CorpusOrSubcorpus has been removed. The ngrams-method
    now applies for corpus and subcorpus objects.
  • The pipe operator of the magrittr package is imported now, and magrittr has moved
    from a suggested package to a required package.
  • The label()-method, present for a while, is superseded by a edit()-method now.
    It will call a shiny gadget either using DataTables or Handsontable. The former
    Labels reference class has been turned into a S4 class, because the
    desired reference logic can also be achieved with a data.table in a slot of
    the labels class.
  • The table-slot of the kwic class has been renamed as stat slot (a data.table),
    so that the kwic class can now inherit from the textstat class. The
    enrich()-method for objects of class kwic now includes a new argument
    extra that will add extra tokens to the left of the windows for concordances so
    that qualitative inspections for query hits can work with more context.
  • The as.TermDocumentMatrix() and the as.DocumentTermMatrix()-methods are now
    also defined for kwic objects. They work exactly the same as for the context
    class. To avoid having to write new methods, a new neighborhood virtual class has
    been introduced. The aforementioned methods are defined for the virtual class and
    are available for context and kwic class objects.
  • Added CQP functionality to count tab in shiny app, and to the dispersion tab.
  • There is now a basic implementation of get_token_stream() for a partition_bundle
    object.
  • The Cooccurrences()-method is now available for subcorpus-objects (#88).
  • There is a new coerce method to turn a kwic-object into a context-object.
    The neighborhood virtual class could be discarded again, and a bug could be removed
    that left an enrich()-operation for kwic objects (argument p_attribute)
    ineffectual (#103).

Minor changes

  • Added a new argument regex to the cpos()-method (for corpus objects), which
    will interpret argument query as a regular expression. This may be faster than
    taking query as an outright CQP query.
  • The configure-script in the package that would adjust paths in the registry files
    for the corpora included in the package for documentation and testing purposes has
    been removed. Having switched to a temporary registry directory, it has lost
    its function.
  • The version of the data.table package now required is 1.12.2, because previous
    versions did not allow adding columns to a new data.table.
  • Implemented the possibility to use multiple queries in dispersion-method (#92).
  • To keep up with the renaming of functions and arguments in the package, "sAttributes"
    and "pAttributes" in the polmineR shiny app have been renamed ("s_attributes",
    and "p_attributes", respectively).
  • The shiny app module for kwic output will not show p_attribute and positivelist
    by default.
  • The format()-method is used to create proper output in the cooccurrences of the
    shiny app.
  • User names that include non-ASCII characters were a persistent problem on Windows
    machines (#66). The solution now is to check for non-ASCII characters in the path
    to the data directory, and to use the "old" short DOS path if necessary. The worker is
    a modified registry()-function.
  • The ordering of the table for ll-method had been somewhat mixed up, which is repaired
    now. Tokens with NA values for the ll-test will show up at the end of the table.
  • The registry_move()-function, used only internally at this stage, is exported now
    so that it can be used by other packages.
  • The return value of the get_token_stream()-method for regions objects was a
    data.table. The behavior is now in line with the other get_token_stream() methods
  • The tempcorpus()-method and the tempcorpus class have been removed from the package,
    having become utterly deprecated.
  • The summary()-method for partition-class objects has been turned into a method
    for the count-class, to eliminate an inconsistency. The example of a workflow has been
    moved to the documentation object for the count-class.
  • The browse()-method has not proven to be useful and has been removed from the package.
    A new browse()-function is introduced to throw a warning, if browse should be
    called nevertheless.
  • A refactoring of the split()-method for partition-objects improved the readability
    of the code, but the performance gain is minimal.
  • A new kwic_bundle-class has been introduced, a list of kwic objects can be turned
    into this new class using as.bundle.
  • The context()-method will now take again as input character vectors for the arguments
    left and right to expand to the left and right boundaries of the designated
    region (#87).
  • Rework of the way messages are printed to make it easy to implement notifications in
    the shiny environment.
  • Default highlighting when a positivelist is supplied has been removed from the
    kwic()-method. This ensures that subsequent highlighting operations can assign
    new colors (#38).
  • Implemented feature request for dispersion() that results are reported for all
    values of structural attributes, including those with zero matches. (#104)
  • Performance improved for the cpos-method for matrix which unfolds a matrix with regions
    of corpus positions, useful for operations that require many calls.
  • The count-method for partition_bundle has been reworked and is much faster and more
    memory efficient.
  • as.TermDocumentMatrix() for partition_bundle optimized to work efficiently
    with large corpora.
  • Introduction of a context,matrix-method to have a unified auxiliary function
    to create contexts.
  • The as.corpusEnc()-function uses the localeToCharset()-function from the utils
    package to determine the charset of input strings. On RStudio Server, we have seen
    cases when the return value is NA. Then it will be assumed that the locale is UTF-8.
  • Functionality to highlight terms in kwic display has been restored for the shiny app.

Bug fixes

  • Removed a bug in the context()/kwic() method that led to superfluous words in the
    right context.
  • Removed a bug that occurred with the as.data.frame()-method for kwic-objects
    when no metadata were added.
  • The count()-method for partition_bundle-objects did not perform iconv() if
    necessary - this has been corrected.
  • Indexing the concord...
Read more

Bright Side

15 Jan 12:24
Compare
Choose a tag to compare

polmineR 0.7.11

NEW FEATURES

  • A Cooccurrences()-method and a Cooccurrences-class have been migrated from the (experimental) polmineR.graph package to polmineR to generate and manage all cooccurrences in a corpus/partition. A cooccurrenes()-method produces a subset of Cooccurrences-class object and is the basis for ensuring that results are identical.
  • New functionality to make using corpora more robust when paths include special characters: There is now a temporary data directory which is a subdirectory of the per-session temporary directory. A new function data_dir() will return this temporary data directory. The use()-function will now check for non-ASCII characters in the path to binary corpus data and move the corpus data to the temporary data directory (a subdirectory of the directory returned by data_dir()), if necessary. An argument tmp added to use() will force using a temporary directory. The temporary files are removed when the package is detached.
  • Experimental functionality for a non-standard evaluation approach to create subcorpora via a zoom()-method. See documentation for (new) corpus-class (?"corpus-class") and extended documentation for partition-class (?"partition-class"). A new corpus()-method for character vector serves as a constructor. This is a beginning of somewhat re-arranging the class structure: The regions-class now inherits from the new corpus-class, and a new subcorpus-class inherits from the regions-class.
  • A new function check_cqp_query() offers a preliminary check whether a CQP query may be faulty. It is used by the cpos()-method, if the new argument check is TRUE. All higher-level functions calling cpos() also include this new argument. Faulty queries may still cause a crash of the R session, but the most common source is prevent now, hopefully.
  • A format()-method is defined for textstat, cooccurrences, and features, moving the formatting of tables out of the view(), and print()-methods. This will be useful when including tables in Rmarkdown documents.

MINOR IMPROVEMENTS

  • Startup messages reporting the package version of polmineR and the registry path are omitted now.
  • The functions registry() and data_dir() now accept an argument pkg. The functions will return the path to the registry directory / the data directory within a package, if the argument is used.
  • The data.table-package used to be imported entirely, now the package is imported selectively. To avoid namespace conflicts, the former S4 method as.data.table() is now a S3 method. Warnings appearing if the data.table package is loaded after polmineR are now omitted.
  • The coerce()-methodes to turn textstat, cooccurrences, features and kwic objects into htmlwidgets now set a pageLength.
  • New methods for partition_bundle objects: [[<-, $, $<-
  • Rework of indexing textstat objects.
  • A slot p_attribute has been added to the kwic-class; kwic()-methods and methods to process kwic-objects are now able to use the attribute thus indicated, and not just the p-attribute "word".
  • A new size()-method for context-objects will return the size of the corpus of interest (coi) and the reference corpus (ref).
  • New encoding()-method for character vector.
  • New name()-method for character vector.
  • A new count()-method for context-objects will return the data.table in the stat-slot with the counts for the tokens in the window.
  • The decode()-function replaces a decode()-method and can be applied to partitions. The return value is a data.table which can be coerced to a tibble, serving as an interface to tidytext (#37).
  • The ngrams()-method will work for corpora, and a new show()-method for textstat-object generates a proper output (#27).

BUG FIXES

  • Any usage of tempdir() is wrapped into normalizePath(..., winslash = "/"), to avoid mixture of file separators in a path, which may cause problems on Windows systems.
  • In the calculation of cooccurrences, the node has previously been included in the window size. This has been corrected.
  • The kwic()-method for corpora returned one surplus token to the left and to the right of the query. The excess tokens are not removed.
  • The object returned by the kwic()-method for character-objects method did not include the correct position of matches in the cpos slot. Corrected.
  • Bug removed that occurrs when context window reaches beyond beginning or end of a corpus (#48).
  • When generating a partition_bundle using the as.speeches()-method, an error could occur when an empty partition has been generated accidentaly. Has been removed. (#50)
  • The as.VCorpus()-method is not available if the tm-package has been loaded previously. A coerce method (as(OBJECT, "VCorpus")) solves the issue. The as.VCorpus()`-method is still around, but serves as a wrapper for the formal coerce-method (#55).
  • The argument verbose as used by the use()-method did not have any effect. Now, messages are not reported as would be expected, if verbose is FALSE. On this occasion, we took care that corpora that are activated are now reported in capital letters, which is consistent with the uppercase logic you need to follow when using corpora. (#47)
  • A new check prevents an error that has occurred when a token queried by the context()-method would occurr at the very beginning or very end of a corpus and the window would transgress the beginning / end of the corpus without being checked (#44).
  • The as.speeches()-function caused an error when the type of the partition was not defined. Solved (#57).
  • To deal with issues resulting from an unset locale, there is a check during startup whether the locale is unset (i.e. 'C') (#39).
  • There was a difficulty to generate a TermDocumentMatrix from a partition_bundle if the partitions in the partition_bundle were not named. The fix is to assign integer numbers as names to the partitions (#58).

DOCUMENTATION FIXES

  • Substantial rework of the documentation of the ll(), and chisquare()-methods to make the statistical procedure used transparent.
  • Expanded documentation for cooccurrences()-method to explain subsetting results vs applying positivelist/negativelist (#28).
  • Wrote some documentation for the round()-method for textstat-objects that will show up in documentation of textstat class.
  • Improved documentation of the mail()-method (#31).
  • In the examples for the decode()-function, using the REUTERS corpus replaces the usage
    of the GERMAPARLMINI corpus, to reduce time consumed when checking the package.

Bachelor's Delight

01 Oct 18:00
Compare
Choose a tag to compare

polmineR 0.7.10

NEW FEATURES

  • The package now offers a simplified and seamless workflow for dictionary-based sentiment analysis: The weigh()-method has been implemented for the classes count and count_bundle. Via inheritance, it will also be available for the partition- and partition_bundle-classes. Then, a new summary()-method for partition-class objects is introduced. If the object has been weighed, the list that is returned will include a report on weights. There is an example that explains the workflow.
  • The partition_bundle-method for context-objects has been reworked entirely (and is working again);
    a new partition-method for context-objects has been introduced. Buth steps are intended for workflows for dictionary-based sentiment analysis.
  • The highlight()-method is now implemented for class kwic. You can highlight words in the neighborhood of a node that are part of a dictionaty.
  • A new knit_print()-method for textstat- and kwic-objects offers a seamless inclusion of analyses in Rmarkdown documents.
  • A coerce()-method to turn a kwic-object into a htmlwidget has been singled out from the show()-method for kwic-objects. Now it is possible to generate a htmlwidget from a kwic object, and to include the widget into a Rmarkdown document.
  • A new coerce()-method to turn textstat-objects into an htmlwidget (DataTable), very useful for Rmarkdown documents such as slides.
  • A new argument height for the html()-method will allow to define a scroll box. Useful to embed a fulltext output to a Rmarkdown document.

MINOR IMPROVEMENTS

  • The partition_bundle-class, rather than inheriting from bundle-class directly, will now inherit from the count_bundle-class
  • The use()-function is limited now to activating the corpus in data packages. Having introduced the session registry, switching registry directories is not needed any more.
  • The as.regions()-function has been turned into a as.regions()-method to have a more generic tool.
  • Some refactoring of the context-method, so that full use of data.table speeds up things.
  • The highlight()-method allows definitions of terms to be highlighted to be passed in via three dots (...);
    no explicit list necessary.
  • A new as.character()-method for kwic-class objects is introduced.

BUG FIXES

  • The size_coi-slot (coi for corpus of interest) of the context-object included the node; the node (i.e. matches for queries) is excluded now from the count of size_coi.
  • When calling use(), the registry directory is reset for CQP, so that the corpora in the package that have been activated can be used with CQP syntax.
  • The script configure.win has been removed so that installation works on Windows without an installation of Rtools.
  • Bug removed from s_attributes()-method for partition-objects: "fast track" was activated without preconditions.
  • Bug removed that would swallow metadata/s-attributes to be displayed in kwic-output after highlighting.
  • As a matter of consistency, the argument meta has been renamed to s_attributes for the kwic()-method for context-objects, and for the enrich()-method for kwic-objects.
  • To avoid confusion (with argument s_attributes), the argument s_attribute to check for integrity within
    a struc has been renamed into boundary.

DOCUMENTATION FIXES

  • Documentation for kwic-objects has been reworked thoroughly.

Jeanne d'Arc

09 Jul 14:17
Compare
Choose a tag to compare

The most visible change of polmineR v0.7.9 may be that the packages moves to a snake_case coding style. This is increasingly the state-of-the-art, and feels much more intuitive when working with the arguments 's_attributes' and 'p_attributes' (rather than pAttributes, and sAttributes). Functions/methods are fully backwards compatible, so old code should not break.

The package now uses a session registry directory, which is a subdirectory of the temporary session directory. This has become mandatory, because CRAN policies do not allow to reset paths within a package, once it has been installed. But it is very useful, because now, switching registry directories can be avoided. The use()-function will now add the corpora in a R data package to the session registry. So this is a good start to work with multiple corpora wrapped in various packages. This involves a set of new functions:

  • A (new) registry_move()-function is used to copy files to the tmp registry;
  • The (new) registry()-function will get the temporary registry directory;

A set of changes makes working with bundle objects more versatile and robust:

  • There is a new as.list()-method for bundle objects, to access the list in the slot objects;
  • as.bundle() is more generic now, so that any kind of object can be coerced to a bundle now;
  • The as.speeches()-method turned into function that allows a partition or a corpus as input;

The new version upgrades the count-class. So the count()-method will serve as a constructor for a count object, if no query is provided. This is particularly useful when working with count_bundle-objects.

Minor new features

  • There is a new is.partition()-function (a logical check);
  • A new argument 'type' has been added to partition_bundle()-method;
  • A new method get_type() introduced to make getting corpus type more robust.
  • A new partition_bundle()-method for partition_bundle-objects has been introduced;

Bug fixes

  • s_attributes() for partition-objects in line with RcppCWB requirements (no negative values of strucs);
  • count() repaired for muliple p-attributes;
  • bug removed causing a crash for as.markdown()-method when cutoff is larger than number of tokens;
  • a bug removed that has prevented the name<- method to work properly for bundle objects
  • for count() for partition_bundle-objects, the column 'partition' will be a character vector now (not factor)
  • bug removed that has caused a crash when cutoff is larger than number of tokens in a partition when calling get_token_stream

Enjoy!

Panda Belly

18 May 16:26
Compare
Choose a tag to compare
  • upon loading the package, new check that data directories are set correctly in registry files to make sure that sample data in pre-compiled packages can be used
  • startup messages adjusted slightly
  • first version that works with sample data without complications

v0.7.5

04 Oct 19:57
Compare
Choose a tag to compare
  • class 'Regions' renamed to class 'regions' as a matter of consistency
  • data type of slot cpos of class 'regions' is a matrix now
  • rework and improved documentation for decode- and encode-methods
  • new functions copy.corpus and rename.corpus
  • as.DocumentTermMatrix-method checks for strucs with value -1
  • improved as.speeches-method: reordering of speeches, default values
  • blapply-method: verbose output will be suppressed of progress is TRUE