Skip to content

Releases: CUNY-CL/wikipron

[1.3.1] - 2024-03-02

02 Mar 23:42
b122b18
Compare
Choose a tag to compare

Under data/

  • Updated Maltese (mlt) phonelist. (#517)
  • Fixed path bug in generate_summary.py. (#517)
  • Fixed CLI arg bug in list_phones.py. (#516)
  • Big scrape for 2023. (#512)
  • Moved IPAs of words with tildes to multiple lines. (#379)
  • Caught iso639.language.LanguageNotFoundError error in codes.py. (#498)
  • Added KPI computation to generate_summary.py. (#465)
  • Added "ː"-suffixed characters to list of valid IPAs. (#497)
  • Renamed the two TSV summaries to summary.tsv. (#494)
  • Renamed generate_tsv_summary.py to generate_summary.py. (#492)
  • Upstream cleaning wrt English tie bar. (#491)
  • Upstream cleaning wrt English high vowel and schwa. (#493)
  • Fixed Georgian (kat) phones and rescrapes. (#488)

Under wikipron/ and elsewhere

  • Added not-already-mentioned language names. (#478)
  • Fixed dialect selector. (#513)

[1.3.0] - 2022-11-28

28 Nov 19:09
f1984b6
Compare
Choose a tag to compare

Under data/

Added

  • Big scrape for 2022. (#464)
  • Added the --fresh flag to data/scrape/scrape.py to facilitate running the big scrape in batches. (#464)
  • Added the --exclude flag for excluding one or more languages in data/scrape/scrape.py. (#460)
  • Added data/src/normalize.py. (#356)
  • Updated README.md. (#360)
  • Added data/cg/tsv/geo.tsv. (#367)
  • Added data/morphology. (#369)
  • Added SIGMORPHON 2021 morphology data. (#375)
  • Added data/cg/tsv/jpn_hira.tsv. (#384)
  • Enforced final newlines. (#387)
  • Adds all UniMorph languages to morphology. (#393)
  • Added data/covering_grammar/tsv/fre_latn_phonemic.tsv (#398)
  • Added data/covering_grammar/lib/make_test_file.py (#396, #399)
  • Added Komi-Zyrian (kpv). (#400)
  • Added Makasar (mak). (#415, #419)
  • Added Zou (zom). (#421)
  • Added Wiyot (wiy). (#422)
  • Added Sidamo (sid). (#423)
  • Added Central Atlas Tamazight (tzm). (#429)
  • Added Chibcha (chb). (#430)
  • Added Kashmiri (kas). (#431)
  • Added Malayalam (mal). (#434)
  • Added Dhivehi (div). (#437)
  • Added Akkadian (akk). (#441)
  • Added Central Nahuatl (nhn). (#443)
  • Added Etruscan (ett). (#444)
  • Added Gujarati (guj). (#445)
  • Added Kannada (kan). (#446)
  • Added Karelian (krl). (#447)
  • Added Romagnol (rgn). (#448)
  • Added Southern Yukaghir (yux). (#449)
  • Added Urak Lawoi' (urk). (#451)
  • Added Hausa (ha). (#452)
  • Added Kashubian (csb). (#453)
  • Added Tabaru (tby). (#455)
  • Added West Makian (mqs). (#457)
  • Added Amharic (amh). (#458)
  • Added Livvi (olo). (#459)
  • Added Kalmyk (xal). (#472)
  • Added Ternate (tft). (#473)
  • Added Abkhaz (abk). (#474)
  • Added Farefare (gur). (#475)
  • Added Iban (iba). (#476)
  • Added Laz (lzz). (#477)

Changed

  • Switched to ISO 639-3 language codes. (#468)
  • Updated scraped data in preparation for the SIGMORPHON 2022 shared task:
    swe nno ger dut ita rum ukr bel tgl ceb ben asm per pus tha lwl. (#461)
  • Made scripts under data/frequencies/ and data/morphology/ more flexible,
    especially for the purposes of preparing data for a shared task. (#461)
  • Fixed the --restriction flag for specifying multiple languages in data/scrape/scrape.py. (#460)
  • Added covering grammar coverage error log and specified error_type in error_analysis.py. (#424)
  • Added error log writing in error_analysis.py. (#420)
  • Added new columns in summary tables. (#365)
  • Fixed broken paths in data/src/generate_phones_summary.py and in
    data/phones/HOWTO.md. (#352)
  • Added Atong (India) (aot). (#353)
  • Added Egyptian Arabic (arz). (#354)
  • Added Lolopo (ycl). (#355)
  • Fixed Unicode normalization in data/phones/slv_phonemic.phones and
    re-scraped Slovenian data. (#356)
  • Updated data/phones/HOWTO.md to include instructions on applying the
    NFC Unicode normalization (#357)
  • Updated data/src/normalize.py to be more efficient. (#358)
  • Fixed inaccuracies in data/phones/geo_phonemic.phones. (#367)
  • Fixed typo in data/cg/tsv/geo.tsv and added missing character. (#370)
  • Morphology URLs are now provided as a list. (#376)
  • Configured and scraped Yamphu (ybi). (#380)
  • Configured and scraped Khumi Chin (cnk). (#381)
  • Made summary generation in common_characters.py optional. (#382)
  • Fixed phone counting in data/src/generate_phones_summary.py (#390, #392)
  • Reorganizes scraping scripts under data/scrape (#394)
  • Reorganizes .phones files and related scripts under data/phones (#395)
  • Reorganizes CG files and related scripts under data/covering_grammar (#395)
  • Reorganized data/phones/phones/fre_phonemic.phones (#398)
  • Removed data/src/ (#401)
  • Renamed TSV files and phonelists to use the terms "broad"/"narrow" instead
    of "phonemic"/"phonetic" (#389, #402, #405)
  • Fixed typo in README.md (#407)
  • Fixed column ordering of the test file read by the script in
    data/covering_grammar/lib/error_analysis.py (#411)
  • Fixed Common character collection in common_characters.py (#419)
  • Scraping test fixed for blt. (#436)
  • Changed URLs to point at CUNY-CL repo, where applicable. (#438)

Under wikipron/ and elsewhere

Added

  • Added ckb in languagecodes.py. (#464)
  • Added support for Python 3.10. (#462)
  • Added test of phones list generation in test_data/test_summary.py (#363)
  • Added Min Nan extraction function. (#397)
  • Added Tai Dam extraction function, configuration and initial scrape. (#435)
  • Added test of casefold value for languages in data/scrape/lib/languages.json (#442)
  • Added support for Python 3.11. (#479)
  • Added checks for the Python source distribution and wheel on CI. (#479)
  • Turned on tests for Windows on CI. (#479)

Removed

  • Dropped support for Python 3.6. (#462)
  • Dropped support for Python 3.7. (#479)

Changed

  • Switched to ISO 639-3 language codes. (#468)
  • Converted setup.py to pyproject.toml. (#479)

[1.2.0] - 2021-01-30

30 Jan 20:59
a586f58
Compare
Choose a tag to compare

Under data/

Added

  • Added Afrikaans whitelists, filtered TSV file, rescraped phonemic and phonetic TSV files. (#311)
  • Added German whitelists and filtered TSV file. (#285)
  • Added whitelisting capabilities to postprocess. (#152)
  • Added whitelists for Dutch, English, Greek, Latin, Korean, and Spanish.
    (#158, etc.)
  • Logged dialect configuration if specified. (#133)
  • Added typing to big scrape code. (#140)
  • Added argparse to allow limiting 'big scrape' to individual languages
    with --restriction flag. (#154)
  • Added Manchu (mnc). (#185)
  • Added Polabian (pox). (#186)
  • Added aar, bdq, jje, and lsi. (#202)
  • Added tyv to languagecodes.py (#203, #205)
  • Added bcl, egl, izh, ltg, azg, kir and mga to languagecodes.py. (#205)
  • Added nep to languagecodes.py. (#206)
  • Added Ingrian (izh). (#215)
  • Added French phoneme list and filtered TSV file. (#213, #217)
  • Added Corsican (cos). (#222)
  • Added Middle Korean (okm). (#223)
  • Added Middle Irish (mga). (#224)
  • Added Old Portuguese (opt). (#225)
  • Added Serbo-Croatian phoneme list and filtered TSV files. (#227)
  • Added Tuvan (tyv). (#228)
  • Added Shan (shn) with custom extraction. (#229)
  • Added Northern Kurdish (kmr). (#243)
  • Added a script to facilitate the creation of a .phones file. (#246)
  • Added IPA validity checks for phonemes. (#248)
  • Split multiple pronunciations joined by tilde in eng_us_phonetic.
  • Added Italian phoneme list and filtered TSV file. (#260, #261)
  • Added Adyghe phone list and filtered TSV file. (#262, #263)
  • Added Bulgarian phoneme list and filtered TSV file. (#264, #267)
  • Added Icelandic phoneme list and filtered TSV file. (#269, #270)
  • Added Slovenian phoneme list and filtered TSV file. (#271, #273)
  • Added normalization to list_phones.py. Corrected errors relating to
    ipapy (#275)
  • Added Welsh .phones lists and filtered TSV files. (#274, #276)
  • Added draft of covering grammar script. (#297)
  • Updated data/phones/README.md with instructions to re-scrape. (#279, #281)
  • Added Vietnamese .phones files and re-scraped and filtered .tsv files.
    (#278, #283)
  • Added Hindi .phones files and the re-scraped and filtered .tsv files.
    (#282, #284)
  • Added Old Frisian (ofs). (#294)
  • Added Dungan (dng). (#293)
  • Added Latgalian (ltg). (#296)
  • Added draft of covering grammar script. (#297)
  • Added Portuguese .phones files and re-scraped data. (#290, #304)
  • Added Japanese .phones files and re-scraped data. (#230, #307)
  • Added Moksha (mdf). (#295)
  • Added Azerbaijani .phones files and re-scraped data. (#306, #312)
  • Added Turkish .phones file and re-scraped data. (#313, #314)
  • Added Maltese .phones file and re-scraped data. (#317, #318)
  • Added Latvian .phones file and re-scraped data. (#321, #322)
  • Added Khmer .phones file and re-scraped data. (#324, #327)
  • Added Østnorsk (Bokmål) .phones file and re-scraped data. (#324, #327)
  • Several languages added to languagecodes.py. (#334)

Changed

  • Edited the arm_e_phonetic.phones and arm_w_phonetic.phones files. (#298)
  • Improved printing in the README table. (#145)
  • Renamed data directory data. (#147)
  • Split may into Latin and Arabic files. (#164)
  • Split pan into Gurmukhi and Shahmukhī. (#169)
  • Split uig into Perso-Arabic and Cyrillic. (#173)
  • Only allowed Latin spellings in Maltese lexicon. (#166).
  • Split mon into Cyrillic and Mongol Bichig (#179).
  • Merged whitelist.py into 'big scrape' script. src scrape.py now checks for
    existence of whitelist file during scrape to create second filtered TSV.
    New TSV placed under tsv/\*\_filtered.tsv. (#154).
  • Updated generate_summary.py to reflect presence of 'filtered' tsv. (#154)
  • Imperial Aramaic (arc) split into three scripts properly. (#187)
  • Flattened data directory structure. (#194)
  • Updated Georgian (geo) to take advantage of upstream bot-based
    consistency fixes. (#138)
  • Split arm into Eastern and Western dialects. (#197)
  • Rescraped files with new whitelists. (#199)
  • Updated logging statements for consistency. (#196)
  • Renamed .whitelist file extension name as .phones. (#207)
  • Split ban into Latin and Balinese scripts. (#214)
  • Split kir into Cyrillic and Arabic. (#216)
  • Split Latin (lat) into its dialects. (#233)
  • Added MyPy coverage for wikipron, tests and data directories. (#247)
  • Modified paths in codes.py, scrape.py, and split.py. (#251, #256)
  • Modified config flags in languages.json and scrape.py. (#258)
  • Edited Serbo-Croatian .phones file to list all vowel/pitch accent
    combinations. Re-scraped Serbo-Croatian data. (#288)
  • Moved list_phones.py to parent directory. (#265, #266)
  • Moved list_phones.py to src directory. (#297)
  • Frequencies code no longer overwrites TSV files. (#320)
  • Updated data/phones/README.md to specify that .phones files should be
    in NFC normalization form. (#333)
  • Kurdish (kur) and Opata (opt) removed from languages.json. (#334)
  • Re-scraped Armenian data. Fixed an error in West Armenian phone list.
    (#338)

Fixed

  • Fixed path issue with phonetic whitelisted files. (#195)

Under wikipron/ and Elsewhere

Added

  • Added positive flags for stress, syllable boundaries, tones, segment to cli.py. (#141)
  • Added positive flags for space skipping to cli.py. (#257)
  • Added two Vietnamese dialects to languages.json. (#139)
  • Handled additional language codes. (#132, #148)
  • Added --no-skip-spaces-word and --no-skip-spaces-pron flag. (#135)
  • Allowed ASCII apostrophes (0x27) in spellings. (#172).
  • Added Vietnamese extraction function. (#181).
  • Modified pron selector in Latin extraction function. (#183).
  • Added --no-tone flag. (#188)
  • Customized extractor and new scraped prons for khb. (#219)
  • Added tests/test_data directory containing two tests. (#226, #251)
  • Added HTTP User-Agent header to API calls to Wiktionary. (#234)
  • Added support for python 3.9 (#240)
  • Added black style formatting to .circleci/config.yml. (#242)
  • Added logging for scraping a language with --dialect specified
    that requires its custom extraction logic. (#245)
  • Improved CircleCI workflow with orbs. (#249)
  • Added test_split.py to tests/test_data. (#256)
  • Handled Cantonese for scraping. (#277)
  • Added exclusion for reconstructions. (#302)
  • Added Vietnamese contour tone grouping test in tests/test_config.py (#308)
  • Added restart functionality. (#340)

Changed

  • Renamed arguments to positive statements in wikipron/config.py and edited _get_process_pron function accordingly. (#141, #257)
  • Changed testing values used in tests/test_config.py in order to accomodate the addition of positive flags. (#141)
  • Specified UTF-8 encoding in handling text files. (#221)
  • Moved previous contents of tests into tests/test_wikipron (#226)
  • Updated the packages version numbers in requirements.txt to their latest according to PyPI (#239)
  • Updated the default pron selector to also look for IPA strings under paragraphs in addition to list items. (#295)
  • Updated segments package version to 2.2.0 (#308)

Removed

  • Moved Wiktionary querying functions from test_languagecodes.py to codes.py (#205)

v1.1.0

03 Mar 15:37
7ea49af
Compare
Choose a tag to compare

[1.1.0] - 2020-03-03

Added

  • Added the extraction function for Mandarin Chinese and its scraped data. (#124)
  • Integrated Wortschatz frequencies. (#122)

Changed

  • Updated the Japanese extraction function and Japanese data. (#129)
  • Updated all scraped Wiktionary data and frequency data. (#127, #128)
  • Generalized the splitting script in the big scrape. (#123)
  • Moved small file removal to generate_summary.py. (#119)
  • Updated Russian data. (#115)

Fixed

  • Avoided and logged error in case of pron processing failure. (#130)