Skip to content

Releases: allenai/dolma

v1.0.3

10 Apr 21:10
6673ad3
Compare
Choose a tag to compare

What's Changed

  • Fix local shuffling failure by @soldni in #140
  • Fix issue in getting started tutorial using wikipedia data by @RohitRathore1 in #117
  • Add an option to improve tokenization shuffling by @soldni in #141
  • Optionally add total/sum to output of analyzer by @soldni in #144
  • Add extra tests for multi-byte unicode spans in deduper. by @soldni in #145
  • Bump s3 client lib and parameterize region in s3 tests + devcontainer by @undfined in #147

New Contributors

Full Changelog: v1.0.2...v1.0.3

v1.0.2

21 Mar 16:45
4e1d17f
Compare
Choose a tag to compare

What's Changed

  • Taggers for URL filtering by @soldni in #112
  • Updated CFF and Bibtex by @soldni in #118
  • Add preliminary Dolma v1.7 configurations, fix corner case in tokens. by @soldni in #120
  • Update CITATION.cff by @soldni in #126
  • Option to use ngram overlap to dedupe paragraphs by @rodneykinney in #122
  • Tagger modules import (fix for #128) by @soldni in #129
  • Added Support for JQ syntax in include/exclude mixer config by @soldni in #131
  • Added JQ syntax for replacements + added minimum score. by @soldni in #133
  • Bump the cargo group group with 1 update by @dependabot in #132
  • Improves tool to compute statistics; adds deduplication options. by @soldni in #135
  • use precompiled regex when loading url blocklists by @peterbjorgensen in #137

Full Changelog: v1.0.1...v1.0.2

v1.0.1

07 Feb 18:32
f6970d5
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.0.1

v1.0.0

01 Feb 08:42
a74b78a
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.9.4...v1.0.0

v0.9.4

21 Jan 03:34
a44489f
Compare
Choose a tag to compare

What's Changed

  • Bump h2 from 0.3.20 to 0.3.24 by @dependabot in #101
  • BOS/EOS/PAD options in tokens cli; speed up tokenization by segmenting paragraphs. by @soldni in #102
  • Fixed Dangling CLI Options; E2E Tokenizer Tests by @soldni in #103

Full Changelog: v0.9.2...v0.9.4

v0.9.2

17 Jan 05:43
ede739f
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.9.1...v0.9.2

v0.9.1

26 Oct 04:55
2ee1ae2
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.9.0...v0.9.1

v0.9.0

15 Oct 19:26
1728f4f
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.9.0

v0.8.0

18 Aug 13:27
705d358
Compare
Choose a tag to compare

What's Changed

  • Analyzer to save and plot taggers distribution by @soldni in #21
  • Scripts to compute statistics by @soldni in #22

Full Changelog: v0.7.0...v0.8.0

v0.7.0

21 Jul 14:49
a37e7c6
Compare
Choose a tag to compare

What's Changed

  • CLI improvements, remove need of experiment name by @soldni in #20

Full Changelog: v0.6.5...v0.7.0