Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian #360

Open
albertvillanova opened this issue Jan 19, 2022 · 0 comments
Labels
data catalog Gathering data from data sources need custodian permission

Comments

@albertvillanova
Copy link
Member

  • uid: malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian
  • type: processed
  • description:
    • name: MALINDOMorph: Morphological dictionary and analyser for Malay/Indonesian
    • description: Malay/Indonesian lacked an open wide-coverage dictionary that can be used for both NLP tasks and non-NLP purposes. The MALINDO Morph morphological dictionary is the first such dictionary. It provides morphological information (root, prefix, suffix, circumfix, reduplication) for roughly 232K surface forms. The entry forms are those found in the authoritative dictionaries in Malaysia (Kamus Dewan4) and Indonesia (Kamus Besar Bahasa Indonesia5) (core dictionary) as well as frequent words in the Leipzig Corpora Collection (Goldhahn et al., 2012) (expanded dictionary). The morphological analyses were checked by hand for all surface forms, except for (i) basic and di-forms in the expanded dictionary whose existence is predicted from the corresponding meN-active forms in the core dictionary and (ii) the case variants of the items in the core dictionary. This paper also discusses the morphological analyser that we developed to create our morphological dictionary. Our morphological analyser is more linguistically rigorous than previous morphological analysers and stemmers/lemmatizers such as MorphInd (Larasati et al., 2011) because it takes into account circumfixes, which have previously been neglected, largely due to a misunderstanding among NLP researchers that circumfixes are no more than combinations of a prefix and a suffix.
    • homepage: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=http%3A%2F%2Flrec-conf.org%2Fworkshops%2Flrec2018%2FW29%2Fpdf%2F8_W29.pdf&clen=201938&chunk=true
    • validated: True
  • languages:
    • language_names:
      • Indonesian
    • language_comments:
    • language_locations:
      • Asia
      • Indonesia
    • validated: False
  • custodian:
    • name: Hiroki Nomoto
    • in_catalogue:
    • type: A university or research institution
    • location: Japan
    • contact_name: Hiroki Nomoto
    • contact_email: nomoto@tufs.ac.jp
    • contact_submitter: False
    • additional: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/viewer.html?pdfurl=http%3A%2F%2Flrec-conf.org%2Fworkshops%2Flrec2018%2FW29%2Fpdf%2F8_W29.pdf&clen=201938&chunk=true
    • validated: False
  • availability:
    • procurement:
      • for_download: No - but the current owners/custodians have contact information for data queries
      • download_url:
      • download_email:
    • licensing:
      • has_licenses: Yes
      • license_text:
      • license_properties:
      • license_list:
    • pii:
      • has_pii: Yes
      • generic_pii_likely:
      • generic_pii_list:
      • numeric_pii_likely:
      • numeric_pii_list:
      • sensitive_pii_likely:
      • sensitive_pii_list:
      • no_pii_justification_class:
      • no_pii_justification_text:
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - their documentation/homepage/description is available
    • primary_license: Unclear / I don't know
    • primary_types:
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
    • text_format:
      • .PDF
    • audiovisual_format:
    • image_format:
    • database_format:
      • other
      • pdf
    • text_is_transcribed: No
    • instance_type:
    • instance_count:
    • instance_size:
    • validated: False
  • fname: malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian.json
@albertvillanova albertvillanova added data catalog Gathering data from data sources need custodian permission labels Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data catalog Gathering data from data sources need custodian permission
Development

No branches or pull requests

1 participant