Skip to content

Latest commit

 

History

History
441 lines (367 loc) · 18.6 KB

CHANGELOG.md

File metadata and controls

441 lines (367 loc) · 18.6 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

Under data/

Changed

  • Fixes table alignment. (#539)
  • Repeats big scrape after #523. (#536)
  • Fixes excessive line wrapping. (#529)
  • Big scrape for 2024. (#514)

Under src/ and elsewhere

  • Upgrades black for Dependabot. (#530)
  • Removes Min Nan (nan) custom selector. (#529)

Added

  • Remove the case-folding attributes for the big scrape. (#469)

Changed.

  • Removed the case-folding test for the big scrape. (#469)

[1.3.1] - 2024-03-02

Under data/

Added

  • Added KPI computation to generate_summary.py. (#465)
  • Added "ː"-suffixed characters to list of valid IPAs. (#497)
  • Added Bengali (ben) phonelist. (#526)

Changed

  • Updated JSON to introduce Bengali dialect (Rarh and Dhaka). (#526)
  • Updated Maltese (mlt) phonelist. (#517)
  • Fixed path bug in generate_summary.py. (#517)
  • Fixed CLI arg bug in list_phones.py. (#516)
  • Big scrape for 2023. (#512)
  • Moved IPAs of words with tildes to multiple lines. (#379)
  • Caught iso639.language.LanguageNotFoundError error in codes.py. (#498)
  • Renamed the two TSV summaries to summary.tsv. (#494)
  • Renamed generate_tsv_summary.py to generate_summary.py. (#492)
  • Upstream cleaning wrt English tie bar. (#491)
  • Upstream cleaning wrt English high vowel and schwa. (#493)
  • Fixed Georgian (kat) phones and rescrapes. (#488)

Under src/ and elsewhere

Added

  • Added not-already-mentioned language names. (#478)

Fixed

  • Fixed dialect selector. (#513)

[1.3.0] - 2022-11-28

Under data/

Added

  • Big scrape for 2023. (#512)
  • Moved IPAs of words with tildes to multiple lines. (#379)
  • Caught iso639.language.LanguageNotFoundError error in codes.py. (#498)
  • Added KPI computation to generate_summary.py. (#465)
  • Added "ː"-suffixed characters to list of valid IPAs. (#497)
  • Renamed the two TSV summaries to summary.tsv. (#494)
  • Renamed generate_tsv_summary.py to generate_summary.py. (#492)
  • Upstream cleaning wrt English tie bar. (#491)
  • Upstream cleaning wrt English high vowel and schwa. (#493)
  • Fixed Georgian (kat) phones and rescrapes. (#488)
  • Big scrape for 2022. (#464)
  • Added the --fresh flag to data/scrape/scrape.py to facilitate running the big scrape in batches. (#464)
  • Added the --exclude flag for excluding one or more languages in data/scrape/scrape.py. (#460)
  • Added data/src/normalize.py. (#356)
  • Updated README.md. (#360)
  • Added data/cg/tsv/geo.tsv. (#367)
  • Added data/morphology. (#369)
  • Added SIGMORPHON 2021 morphology data. (#375)
  • Added data/cg/tsv/jpn_hira.tsv. (#384)
  • Enforced final newlines. (#387)
  • Adds all UniMorph languages to morphology. (#393)
  • Added data/covering_grammar/tsv/fre_latn_phonemic.tsv (#398)
  • Added data/covering_grammar/lib/make_test_file.py (#396, #399)
  • Added Komi-Zyrian (kpv). (#400)
  • Added Makasar (mak). (#415, #419)
  • Added Zou (zom). (#421)
  • Added Wiyot (wiy). (#422)
  • Added Sidamo (sid). (#423)
  • Added Central Atlas Tamazight (tzm). (#429)
  • Added Chibcha (chb). (#430)
  • Added Kashmiri (kas). (#431)
  • Added Malayalam (mal). (#434)
  • Added Dhivehi (div). (#437)
  • Added Akkadian (akk). (#441)
  • Added Central Nahuatl (nhn). (#443)
  • Added Etruscan (ett). (#444)
  • Added Gujarati (guj). (#445)
  • Added Kannada (kan). (#446)
  • Added Karelian (krl). (#447)
  • Added Romagnol (rgn). (#448)
  • Added Southern Yukaghir (yux). (#449)
  • Added Urak Lawoi' (urk). (#451)
  • Added Hausa (ha). (#452)
  • Added Kashubian (csb). (#453)
  • Added Tabaru (tby). (#455)
  • Added West Makian (mqs). (#457)
  • Added Amharic (amh). (#458)
  • Added Livvi (olo). (#459)
  • Added Kalmyk (xal). (#472)
  • Added Ternate (tft). (#473)
  • Added Abkhaz (abk). (#474)
  • Added Farefare (gur). (#475)
  • Added Iban (iba). (#476)
  • Added Laz (lzz). (#477)

Changed

  • Switched to ISO 639-3 language codes. (#468)
  • Updated scraped data in preparation for the SIGMORPHON 2022 shared task: swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl. (#461)
  • Made scripts under data/frequencies/ and data/morphology/ more flexible, especially for the purposes of preparing data for a shared task. (#461)
  • Fixed the --restriction flag for specifying multiple languages in data/scrape/scrape.py. (#460)
  • Added covering grammar coverage error log and specified error_type in error_analysis.py. (#424)
  • Added error log writing in error_analysis.py. (#420)
  • Added new columns in summary tables. (#365)
  • Fixed broken paths in data/src/generate_phones_summary.py and in data/phones/HOWTO.md. (#352)
  • Added Atong (India) (aot). (#353)
  • Added Egyptian Arabic (arz). (#354)
  • Added Lolopo (ycl). (#355)
  • Fixed Unicode normalization in data/phones/slv_phonemic.phones and re-scraped Slovenian data. (#356)
  • Updated data/phones/HOWTO.md to include instructions on applying the NFC Unicode normalization (#357)
  • Updated data/src/normalize.py to be more efficient. (#358)
  • Fixed inaccuracies in data/phones/geo_phonemic.phones. (#367)
  • Fixed typo in data/cg/tsv/geo.tsv and added missing character. (#370)
  • Morphology URLs are now provided as a list. (#376)
  • Configured and scraped Yamphu (ybi). (#380)
  • Configured and scraped Khumi Chin (cnk). (#381)
  • Made summary generation in common_characters.py optional. (#382)
  • Fixed phone counting in data/src/generate_phones_summary.py (#390, #392)
  • Reorganizes scraping scripts under data/scrape (#394)
  • Reorganizes .phones files and related scripts under data/phones (#395)
  • Reorganizes CG files and related scripts under data/covering_grammar (#395)
  • Reorganized data/phones/phones/fre_phonemic.phones (#398)
  • Removed data/src/ (#401)
  • Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead of "phonemic"/"phonetic" (#389, #402, #405)
  • Fixed typo in README.md (#407)
  • Fixed column ordering of the test file read by the script in data/covering_grammar/lib/error_analysis.py (#411)
  • Fixed Common character collection in common_characters.py (#419)
  • Scraping test fixed for blt. (#436)
  • Changed URLs to point at CUNY-CL repo, where applicable. (#438)

Under src/ and elsewhere

Added

  • Adds Python 3.12 support. (#520)
  • Temporarily disables Latin testing in lieu of #514. (#519)
  • Fixed dialect selectors for languages other than Latin. (#511)
  • Moved wikipron/ directory under src/ and adjusted package finding. (#508)
  • Added documentation about selecting transcription level. (#502)
  • Added ckb in languagecodes.py. (#464)
  • Added support for Python 3.10. (#462)
  • Added test of phones list generation in test_data/test_summary.py (#363)
  • Added Min Nan extraction function. (#397)
  • Added Tai Dam extraction function, configuration and initial scrape. (#435)
  • Added test of casefold value for languages in data/scrape/lib/languages.json (#442)
  • Added support for Python 3.11. (#479)
  • Added checks for the Python source distribution and wheel on CI. (#479)
  • Turned on tests for Windows on CI. (#479)

Removed

  • Dropped support for Python 3.6. (#462)
  • Dropped support for Python 3.7. (#479)

Changed

  • Fixed missing logging for proto-languages. (#505)
  • Switched to ISO 639-3 language codes. (#468)
  • Converted setup.py to pyproject.toml. (#479)

[1.2.0] - 2021-01-30

Under data/

Added

  • Added generate_phones_summary.py, generating ./phones/README.md and ./phones/phones_summary.tsv. (#344)
  • Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (#311)
  • Added German whitelists and filtered TSV file. (#285)
  • Added whitelisting capabilities to postprocess. (#152)
  • Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish. (#158, etc.)
  • Logged dialect configuration if specified. (#133)
  • Added typing to big scrape code. (#140)
  • Added argparse to allow limiting 'big scrape' to individual languages with --restriction flag. (#154)
  • Added Manchu (mnc). (#185)
  • Added Polabian (pox). (#186)
  • Added aar, bdq, jje, and lsi. (#202)
  • Added tyv to languagecodes.py (#203, #205)
  • Added bcl, egl, izh, ltg, azg, kir and mga to languagecodes.py. (#205)
  • Added nep to languagecodes.py. (#206)
  • Added Ingrian (izh). (#215)
  • Added French phoneme list and filtered TSV file. (#213, #217)
  • Added Corsican (cos). (#222)
  • Added Middle Korean (okm). (#223)
  • Added Middle Irish (mga). (#224)
  • Added Old Portuguese (opt). (#225)
  • Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
  • Added Tuvan (tyv). (#228)
  • Added Shan (shn) with custom extraction. (#229)
  • Added Northern Kurdish (kmr). (#243)
  • Added a script to facilitate the creation of a .phones file. (#246)
  • Added IPA validity checks for phonemes. (#248)
  • Split multiple pronunciations joined by tilde in eng_us_phonetic.
  • Added Italian phoneme list and filtered TSV file. (#260, #261)
  • Added Adyghe phone list and filtered TSV file. (#262, #263)
  • Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
  • Added Icelandic phoneme list and filtered TSV file. (#269, #270)
  • Added Slovenian phoneme list and filtered TSV file. (#271, #273)
  • Added normalization to list_phones.py. Corrected errors relating to ipapy (#275)
  • Added Welsh .phones lists and filtered TSV files. (#274, #276)
  • Added draft of covering grammar script. (#297)
  • Updated data/phones/README.md with instructions to re-scrape. (#279, #281)
  • Added Vietnamese .phones files and re-scraped and filtered .tsv files. (#278, #283)
  • Added Hindi .phones files and the re-scraped and filtered .tsv files. (#282, #284)
  • Added Old Frisian (ofs). (#294)
  • Added Dungan (dng). (#293)
  • Added Latgalian (ltg). (#296)
  • Added draft of covering grammar script. (#297)
  • Added Portuguese .phones files and re-scraped data. (#290, #304)
  • Added Japanese .phones files and re-scraped data. (#230, #307)
  • Added Moksha (mdf). (#295)
  • Added Azerbaijani .phones files and re-scraped data. (#306, #312)
  • Added Turkish .phones file and re-scraped data. (#313, #314)
  • Added Maltese .phones file and re-scraped data. (#317, #318)
  • Added Latvian .phones file and re-scraped data. (#321, #322)
  • Added Khmer .phones file and re-scraped data. (#324, #327)
  • Added Østnorsk (Bokmål) .phones file and re-scraped data. (#324, #327)
  • Added SIGMORPHON 2021 frequencies JSON. (#332)
  • Several languages added to languagecodes.py. (#334)
  • Configured scripts for Kazakh (kaz). (#345)
  • Added Easten Lawa (lwl). (#346)
  • Configuration for Western Lawa (lcp). (#347)
  • Added Nyahkur (cbn). (#348)
  • Split Tagalog (tgl) scripts into Latin and Baybayin, rescraped. (#351)

Changed

  • Changed the name of the existing ./phones/README.md to ./phones/HOWTO.md. (#344)
  • Edited the name of generate_summary.py to generate_tsv_summary.py.(#344)
  • Edited the output file name of generate_tsv_summary.py to tsv_summary.tsv.(#344)
  • Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (#298)
  • Improved printing in the README table. (#145)
  • Renamed data directory data. (#147)
  • Split may into Latin and Arabic files. (#164)
  • Split pan into Gurmukhi and Shahmukhī. (#169)
  • Split uig into Perso-Arabic and Cyrillic. (#173)
  • Only allowed Latin spellings in Maltese lexicon. (#166).
  • Split mon into Cyrillic and Mongol Bichig (#179).
  • Merged whitelist.py into 'big scrape' script. src scrape.py now checks for existence of whitelist file during scrape to create second filtered TSV. New TSV placed under tsv/\*\_filtered.tsv. (#154).
  • Updated generate_summary.py to reflect presence of 'filtered' tsv. (#154)
  • Imperial Aramaic (arc) split into three scripts properly. (#187)
  • Flattened data directory structure. (#194)
  • Updated Georgian (geo) to take advantage of upstream bot-based consistency fixes. (#138)
  • Split arm into Eastern and Western dialects. (#197)
  • Rescraped files with new whitelists. (#199)
  • Updated logging statements for consistency. (#196)
  • Renamed .whitelist file extension name as .phones. (#207)
  • Split ban into Latin and Balinese scripts. (#214)
  • Split kir into Cyrillic and Arabic. (#216)
  • Split Latin (lat) into its dialects. (#233)
  • Added MyPy coverage for wikipron, tests and data directories. (#247)
  • Modified paths in codes.py, scrape.py, and split.py. (#251, #256)
  • Modified config flags in languages.json and scrape.py. (#258)
  • Edited Serbo-Croatian .phones file to list all vowel/pitch accent combinations. Re-scraped Serbo-Croatian data. (#288)
  • Moved list_phones.py to parent directory. (#265, #266)
  • Moved list_phones.py to src directory. (#297)
  • Frequencies code no longer overwrites TSV files. (#320)
  • Updated data/phones/README.md to specify that .phones files should be in NFC normalization form. (#333)
  • Kurdish (kur) and Opata (opt) removed from languages.json. (#334)
  • Re-scraped Armenian data. Fixed an error in West Armenian phone list. (#338)

Fixed

  • Fixed path issue with phonetic whitelisted files. (#195)

Under wikipron/ and elsewhere

Added

  • Added positive flags for stress, syllable boundaries, tones, segment to cli.py. (#141)
  • Added positive flags for space skipping to cli.py. (#257)
  • Added two Vietnamese dialects to languages.json. (#139)
  • Handled additional language codes. (#132, #148)
  • Added --no-skip-spaces-word and --no-skip-spaces-pron flag. (#135)
  • Allowed ASCII apostrophes (0x27) in spellings. (#172).
  • Added Vietnamese extraction function. (#181).
  • Modified pron selector in Latin extraction function. (#183).
  • Added --no-tone flag. (#188)
  • Customized extractor and new scraped prons for khb. (#219)
  • Added tests/test_data directory containing two tests. (#226, #251)
  • Added HTTP User-Agent header to API calls to Wiktionary. (#234)
  • Added support for python 3.9 (#240)
  • Added black style formatting to .circleci/config.yml. (#242)
  • Added logging for scraping a language with --dialect specified that requires its custom extraction logic. (#245)
  • Improved CircleCI workflow with orbs. (#249)
  • Added test_split.py to tests/test_data. (#256)
  • Handled Cantonese for scraping. (#277)
  • Added exclusion for reconstructions. (#302)
  • Added Vietnamese contour tone grouping test in tests/test_config.py (#308)
  • Added restart functionality. (#340)
  • Added very basic API for script detection. (#341)
  • Added --skip-parens and --no-skip-parens flags. (#343)

Changed

  • Renamed arguments to positive statements in wikipron/config.py and edited _get_process_pron function accordingly. (#141, #257)
  • Changed testing values used in tests/test_config.py in order to accomodate the addition of positive flags. (#141)
  • Specified UTF-8 encoding in handling text files. (#221)
  • Moved previous contents of tests into tests/test_wikipron (#226)
  • Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)
  • Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (#295)
  • Updated segments package version to 2.2.0 (#308)

Removed

  • Moved Wiktionary querying functions from test_languagecodes.py to codes.py (#205)

[1.1.0] - 2020-03-03

Added

  • Added the extraction function for Mandarin Chinese and its scraped data. (#124)
  • Integrated Wortschatz frequencies. (#122)

Changed

  • Updated the Japanese extraction function and Japanese data. (#129)
  • Updated all scraped Wiktionary data and frequency data. (#127, #128)
  • Generalized the splitting script in the big scrape. (#123)
  • Moved small file removal to generate_summary.py. (#119)
  • Updated Russian data. (#115)

Fixed

  • Avoided and logged error in case of pron processing failure. (#130)

[1.0.0] - 2019-11-29

Added

  • Handled Japanese. (#109, #114)
  • Handled Latin, for which the actual graphemes cannot be the Wiktionary page titles and have to come from within the page. (#92, #93)
  • Handled Thai, whose pronunciations are embedded in HTML tables. (#90)
  • Handled Khmer, whose pronunciations are embedded in HTML tables. (#88)
  • IPA segmentation using spaces by default, with the --no-segment flag to optionally turn it off. (#69, #79, #83, #89, #100)
  • Added TSV files for all Wiktionary languages with over 100 entries. (#61, #76, #95, #97, #103, #104)
  • Resolved Wiktionary language names for languages with at least 100 pronunciation entries. (#52, #55)

Changed

  • Removed duplicate <word, pronunciation> pairs in the persisted data. (#85, #111, #116)
  • Split Welsh into Northern Wales and Southern dialects in the persisted data. (#110)
  • Factored out casefolding. (#102)
  • Split Serbo-Croatian into Cyrillic and Latin TSVs. (#96)
  • Generalized word and pronunciation extraction. (#88)

Removed

  • Removed the timeout in smoke tests. (#107)
  • Removed the output option. (#82)
  • Removed the require_dialect_label option. (#77)

Fixed

  • Skipped pronunciations with a dash. (#106)
  • Skipped empty pronunciations in scraping. (#59)
  • Updated the <li> XPath selector for an optional layer of <span> to cover previously unhandled languages (e.g., Korean). (#50)
  • Updated the <li> XPath selector for title="wikipedia:<language> phonology" to cover previously unhandled languages (e.g., Estonian and Slovak). (#49)

Security

  • Avoided using exec to retrieve the version string. Used pkg_resources instead. (#63)

[0.1.1] - 2019-08-14

Fixed

  • Fixed import bug. (#45)

[0.1.0] - 2019-08-14

First release.