Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i18n: european portuguese word list #6044

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

lisbonjoker
Copy link

Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/
Needs trim of the words that would be difficult to memorize.

Imported from Dicionários Natura in https://natura.di.uminho.pt/download/sources/Dictionaries/wordlists/
Needs trim of the words that would be difficult to memorize
@lisbonjoker lisbonjoker requested a review from a team as a code owner July 19, 2021 12:00
@gonzalo-bulnes
Copy link
Contributor

Hi @lisbonjoker!

I'd like to understand better how to review your PR. Could you please give some more context on that word list and what made you choose it?

For example, I'd love to hear more about:

  • How is the word list licensed?
  • How is it composed? By whom?
  • What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?
  • What purpose would it fulfill in the context of SecureDrop?

@lisbonjoker
Copy link
Author

How is the word list licensed?

The dictionaries are covered by the GPL, LGPL, and MPL licenses (or at least one of them)

How is it composed? By whom?

The Natura Project is a small research group in Natural Language Processing at the Department of Computer Science, University of Minho. It is part of a larger Language Processing and Specification group.

More in: https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main

Current Management

José João Almeida
Alberto Simões

Other collaborators

Rui Vilela
António Dias
Paulo Rocha
Ulisses Pinto

What words are in it, are those verbs, nouns, adjectives, adverbs? Where do they come from?

List of Portuguese words (including some acronyms, etc).

It contains proper names, acronyms, abbreviations and common loanwords; This list is derived from the Jspell dictionary for morphological analysis.

What purpose would it fulfill in the context of SecureDrop?

For European Portuguese citizens to use in a SecureDrop as there is a big difference in languages between PT BR and PT EU. Some words unused or unrecognized.

@sts10
Copy link
Contributor

sts10 commented Jan 9, 2024

Don't know if this is helpful to this conversation, but:

in an effort to cut this very long list down to a length closer to the existing SecureDrop wordlists, I took the most frequently appearing words from Portuguese Wikipedia articles (with help from this project), then filtered out any and all words NOT on this 994,951-word list.

I then removed any and all words with accented characters or non-UTF-8 characters (I think), all words not between 3 and 15 characters, and any Roman numerals. (Notably I didn't filter out profane words.) I arbitrarily chose to make this new list 10,000 words. The result was this wordlist. Hope this helps -- sorry if it derails things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants