Skip to content

Releases: stanfordnlp/CoreNLP

v4.5.7 - Constituency to Dependency Converter Upgrades

28 Apr 05:36
Compare
Choose a tag to compare

UD converter upgrades

Inspired by UniversalDependencies/docs#717, although the work is not finished

  • Add an option to use the PTBCorrector, which fixes many (although not all) incorrect POS tags 5e57eab
  • Treat sort of the same as kind of bc4acf1
  • en masse is flat cb338cd
  • dinna is an MWT 1dd746c
  • Use AUX as the POS in the converter when appropriate 30f2f8e
  • Fix (heh) all but and whether or not 2513676
  • Dependency dep -> ccomp for fronted say verbs a76a854

Parser evaluation improvements

  • Include the F1 scores of each tree when scoring a constituency dataset 2725b06

v4.5.6: Lemmatizer & Tokenizer bugfixes

01 Feb 20:39
Compare
Choose a tag to compare

English Lemmatizer upgrades

  • enroll, appall as American spellings, instead of enrol & appal. de- as a verb prefix, blog and xfer as double letter exceptions 8adcbfe
  • cowritten 2dd08da
  • elder / eldest 9b5bec8
  • Yazidi as a demonym 2852da8

Tokenizer upgrades

  • #number as a single thing after an abbreviation #1396 ad37f2a

UD Processing upgrades

  • 'twas and 'tis as MWT in the UD converter b9f19a6
  • Sort morpho features in alphabetical order when writing out UD
    f77a9b4

Other Bugfixes

  • Crash when deleting the endpoints of an IntervalTree #1405 6d17c23
  • Find and remove extraneous uses of yield, which became a keyword: e5c9d44 b084233

Minor API change

  • Updating the text on a CoreLabel no longer wipes out the Lemma c03522b
  • Update to more recent Jakarta Servlet 8a671fd

Ssurgeon

  • UpdateMorphoFeatures edit 27c6703
  • Lemmatize operation (only works on English) c26b25e

v4.5.5: further Ssurgeon upgrades, SceneGraph server module, security bugfix

06 Sep 20:46
Compare
Choose a tag to compare

Ssurgeon updates beyond the capabilities listed in the GURT paper

  • MergeNodes operation: combine two words into one word in a graph. one word must be a leaf headed by the other for this to work 0660fa9
  • CombineMWT operation: mark MWT on two or more words. Stanza will treat these as Token 010a955
  • DeleteLeaf operation: remove a leaf, renumber the subsequent words
    429f61a

Bugfixes

  • fix graph serialization for sentences longer than 128 words (IdentityHashSet doesn't work for integers beyond 128) d8d9d9f
  • fix valueOf for SemanticGraph if a word is just a dash 203eb06
  • fix memory usage of evaluating a PCFG model, which would run out of memory because it was saving all of the charts while evaluating b2e67b0
  • Tregex pattern would not correctly display when using optional patterns: a9965b2 8659653
  • Tregex would infinite loop on certain optional patterns which were theoretically legal cc7983e

Security fixes

English dependency converter fixes

  • addressing issue #1363
  • fix (QP up to ...) 8c46648 9a86ece
  • fix up to 1700 kilograms if misparsed in a predicable manner 6e14527
  • better LST coverage 5745de5
  • vmod/acl when the parser misinterprets NP vs NML ad4556d
  • treat lists of NML as repeated modifiers of a noun, instead of a list, as that is the likely meaning of NML. example: a 72-game, three-month season from PTB 61ef545 5e748dc

Server features

  • Scenegraph endpoint 8b40947 #1346
  • remove one json library to reduce number of json libraries we depend on 357b1bb

Small changes

  • allow fourty as a number in SUTime 7fbb7b8
  • capture forty (40) days as a duration in SUTime b3c47a0
  • feature to print out the feature index of an NER model as a text file f636673
  • clarify the INTJ rule for the ChineseHeadFinder 56cd6bb
  • consider { } as punctuation when scoring English constituency treebanks a606afa
  • fix error in test case, from @tanloong #1373 #1372
  • dead code cleanup 86b6a03

v4.5.4: Minor Ssurgeon updates

16 Mar 01:23
Compare
Choose a tag to compare
  • Minor Ssurgeon bugfixes (make it harder to infinite loop with EditNode or RelabelNamedEdge)
  • Add a ReattachNamedEdge which is a combination of RemoveNamedEdge and AddEdge with new endpoints
  • include the Morphology CLI for using the CoreNLP lemmatizer from elsewhere, such as Python

v4.5.3: Ssurgeon interface, Collinizer fixes

11 Mar 05:40
Compare
Choose a tag to compare

Mostly changes to Semgrex, along with adding Ssurgeon to the download package for general consumption. This involved quite a few changes to classes such as AnnotationLookup. The released version should now match the Semgrex/Ssurgeon paper published at GURT 2023.

Ssurgeon / Semgrex

Bugfixes

  • Fix "Could not match" errors which occurred when scoring treebanks using a tagger that produces non-gold punct tags: #1344
  • Fix typo in KBP children rules: dbdb55b

Minor features

  • Add the choice of dependency graph to output to the TextOutputter 33e6c42 #1339
  • Hopefully minor interface change: make relation in SemanticGraphEdge final, get rid of setRelation e7a7657

v4.5.2: package dependencies, CLI additions

11 Mar 05:32
Compare
Choose a tag to compare

Bugfixes

  • Tokenize c'mon and $$$ 1e216de
  • Tokenize 'email' 76b5a6b #1316
  • Return empty mentions for empty document da08664 #1322
  • Fix CLI protobuf tools running too fast for some network conditions: 412da5c

CLI protobuf tools

  • Add output of lemmatizer to words 71bc95d
  • Convert constituency trees to dependencies b118082

Dependency updates

  • Protobuf 3.19.6 0439b62
  • xom 1.3.8, which no longer automatically includes xalan 3ded6f0

Semgraph / Semgrex improvements

  • Allow reuse of indices in SemanticGraph.valueOf cf97e36
  • Add Semgrex relations to match the capabilities introduced in Spacy 98be52a

v4.5.1: Bugfixes

30 Aug 04:13
Compare
Choose a tag to compare

CoreNLP 4.5.1

Bugfixes!

  • Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word 974383a
  • Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. #1289 6550188
  • Fix \r\n not being properly processed on Windows: #1291 9889f4e
  • Handle one half of surrogate character pairs in the tokenizer w/o crashing #1298 1b12faa
  • Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: #1296 #1229 #1169 f99b5ab

v4.5.0

22 Jul 23:21
Compare
Choose a tag to compare

CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

  • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
    This makes the tokenizer 2% slower, but should avoid issues with resume' for example
    d46fecd

  • log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
    f05cb54

  • Fix NumberFormatException showing up in NER models: #547 5ee2c39

  • Fix "seconds" in the lemmatizer: e7a073b

  • Fix double escaping of & in the online demos: 8413fa1

  • Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0

  • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259

  • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263

  • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64

  • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960

  • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266

  • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2

  • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8

  • Fix NBSP in the Chinese segmenter stanfordnlp/stanza#1052 #1279

v4.4.0

25 Jan 11:49
04408ad
Compare
Choose a tag to compare

Enhancements

  • added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline

  • tsurgeon CLI - python side added to stanza
    #1240

  • sutime WORKDAY definition
    0dfb118

Fixes

  • rebuilt Italian dependency parser using CoreNLP predicted tags

  • XML security issue:
    #1241

  • NER server security issue:
    5ee097d

  • fix infinite loop in tregex:
    #1238

  • json utf-8 output on windows
    #1231
    stanfordnlp/stanza#894

  • fix openie crash in certain unusual graphs
    #1230
    #1082

  • fix nondeterministic results in certain SemanticGraph structures
    #1228
    cc806f2

  • workaround for NLTK sending % unescaped to the server
    #1226
    20fe1e9

  • make TimingTest function on Windows
    4aafb84

v4.3.2

18 Nov 22:42
@J38 J38
Compare
Choose a tag to compare

Fixes

  • fix issues with default Italian pipeline