Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish 150 word list and a suggestion #5

Open
fjcaceres opened this issue Jun 29, 2020 · 6 comments
Open

Spanish 150 word list and a suggestion #5

fjcaceres opened this issue Jun 29, 2020 · 6 comments
Labels
trivia Really cool trivia about stuff

Comments

@fjcaceres
Copy link

  1. the Spanish words seems ok You may run in some isues if the word "Ñ" is used, its an n with a litle ~ on top of it

  2. If you can, please consider the country of Colombia, domain is ".co"

Javier

@ignis-sec
Copy link
Owner

Hello! Thanks for the suggestion! During the first stages of this project support for non-asci characters were really limited, and i've written anything that causes an exception into an error file. I've re-added them during the second phase, and i'm sure it'll impact the results of regional passwords greatly.

It was also one of the main reasons why i released only very limited (150 lines) for regional ones.

It will be fixed in the next release of regional passwords.

Meanwhile, please enjoy this small list of passwords (ranked by most common to least common) containing "Ñ".

contraseña
asdfghjklñ
españa
CONTRASEÑA
micontraseña
muñeco
muñeca
Contraseña
cariño
pequeño
lacontraseña
niñas123
nuñez
contrasseña
muñequita
tinerfeño1839
Ciñera00
asdfghjklñ123
contraceña
coqueñaloly
mañana
muñecas
muñoz
teextraño
ññññññ
123contraseña
17años
Magariño

About Colombia, I've refrained from adding any country codes for countries that had multiple languages, so i wouldn't taint the language lists.

I've skipped colombia because it said "Spanish, Castillian" in the language and i thought it was two different languages.

I'll re-add Colombia to the spanish list.

Your feedback and suggestions are very much appreciated. I hope next release will be a big change. Cheers!

@fjcaceres
Copy link
Author

Great, and just as a trivia issue

  1. The real name of the Spanish language is "Castellano" which comes from the Castilla zone of Spain where it originated, Spain in Spanish is España so thats why .es is used for Spain. So when you found two languages for Colombia, what they tried to do I guess was to put the language in its English translation form and in its original form Castellano, although thew wrote Castillian

  2. The number 1 password you found in Spanish was "contraseña" , translation ? "password"

  3. Will we be able to run localized passwords depending on the country ? just like you did with the dutch in sample you tweeted ?

I am fluent in Spanish as you may presume so if you need help let me know by emailing me at fjcaceres@gmail.com

Javier

@ignis-sec
Copy link
Owner

Thanks for the trivia!

About 3., i really want to make that a thing, but i've hit a brick wall about that part of the service i want to provide.

In another issue, i've given some examples for services i want to give access to everyone:

  • Most common 1000 passwords containing word "admin"
  • Most common 1.000.000 passwords matching a certain complexity
  • Most common 100.000 passwords used by email addresses ending with ".tr"
  • Most common 1000 passwords used by accounts from @example.com email provider.

I kind of hit a brick wall about the language one because there are fundamentally difficult parts about that.

  • Leading wildcards such as %.nl is not optimized well.
  • Bulk processing emails are fast. Bulk processing passwords are fast. But when you want to process credentials (such as "all the passwords used by this kind of emails") it has to run a massive join operation, and it takes a while to get the result.

As a result of this, i've been considering implementing a queue system.

Each user puts their request in a queue. They are processed one by one, and result are emailed back or available to the user after login.

A second queue will also keep track of the query results. It'll insert query results to the front of the queue, while last query at the back gets popped. If a new query is requested and its result is already in the queue, it'll be pushed to the front without inserting, so frequent requests wont be repeated.

I'll have to implement this to see if it could create enough of an impact to make it viable first.

But at the time, it looks like i don't have the time nor funds to make this yet :(

If thats okay with you, i'll contact you once collection 2-5 is processed, and i'm ready to update the regional lists with larger versions.

I'll keep this issue open until then.
Cheers!

@fjcaceres
Copy link
Author

3 ideas

  1. Regarding passwords and countries, the first thing is to relate the domain name where password was found and associated passwords. Example Italian sites all end in .it which is what you did with the Irish case,

BUT

many italian sites are not .it but are .com so how can you relate language and passwords ?

You may assume in general terms that you will go to a site that its the same language that you speak,( unless youre multilingual like me and I go to english ans sites in spanish)

So if this is the case you can not go just consider the domain of the site because we already saw that there are italian sites which are in italian language but they are not .it but .com This means that there will be more sites with lang="it" in it's source code than sites with lang="it" and domain .it

The way I would suggest is by going inside the source code of the site where in the first line you have to declare what language the site is in . I am sure that if you go to an italian site the source code will probably say it's language is italian, ( lang="it" ) lang="fr" for french sites an so on.

So the relation would be what is the "lang" parameter found in the source code of the website where you found the password. Some lang parameters are double because you have lang=english but english spoken in uk, australia ( en.us, en,uk, en,au, etc) or in the case of spanish which is spoken in spain, colombia argentina or mexico es.es es.co es.ar. es.mx So the second part of the lang= would give you the most information on the country to where the password is related to. Of course we assume that the site was designed properly Also you may find the country parameter but I don't think is a must when designing your site

  1. Regarding lookup functions for words maybe you can take a look at scrabble sites or dictionary sites and using and OSINT what they use In these site you can build your search. How many characters, starting with, ending with, have a string after x amount of words etc.

  2. Going back to one of your tweets I see that you have as language castilian and have .bo this is for Bolivia that speaks spanish.
    See you

@ignis-sec ignis-sec added the trivia Really cool trivia about stuff label Jun 29, 2020
@ignis-sec
Copy link
Owner

many italian sites are not .it but are .com so how can you relate language and passwords ?

So i guess i was a bit ambiguous there, im not filtering the tld's of leaked websites, im filtering tld's of email addresses.

For now, i've only taken top level domain of the email addresses of users to filter the languages.
I know this really decreases the amount of samples i get because im missing a lot of accounts and being unfair for users using .com accounts, .edu accounts and etc.
That being said, its the only quick solution i had for filtering and only having users of a certain language confidently. (for now)

So for example, I'm betting my money on no french users using .de emails, and no german users using .fr emails. If they are using .com, they are not going to get thrown into the dataset, but at least dataset is not tainted.

The way I would suggest is by going inside the source code of the site where in the first line you have to declare what language the site is in . I am sure that if you go to an italian site the source code will probably say it's language is italian, ( lang="it" ) lang="fr" for french sites an so on.

Its a good idea, but there are a lot of multilingual websites with options to change language, and thats going to taint the dataset.

Regarding lookup functions for words maybe you can take a look at scrabble sites or dictionary sites and using and OSINT what they use In these site you can build your search. How many characters, starting with, ending with, have a string after x amount of words etc.

Its a good idea that i've been considering - with small differences. Some alphabetical characters can hint which language that user uses. For example, a password containing "Ñ" can hint towards spanish.

But i don't think i'll take this approach. This will create a bias towards what kind of passwords make it to the dataset, and which wont. If i start filtering on the passwords and not the leak source, email, and other metadata, it might create a problem.

Going back to one of your tweets I see that you have as language castilian and have .bo this is for Bolivia that speaks spanish.

Yup, i've taken a note of it, and im going to merge castillian and spanish together in the next big release.

I've been a bit hasty reading your comment and replying, so please do not hesitate if i missed something or misunderstood something.

Cheers!

@fjcaceres
Copy link
Author

fjcaceres commented Jun 29, 2020

Not at all, its nice of you to listen opinions

Regarding what you said about

you mentioned:

For now, i've only taken top level domain of the email addresses of users to filter the languages.

  1. The tld of the email address will work if it has one which is two letters long

  2. Consider also the tld of the compromised site. If the TLD is 2 letters you have the country and all emails associated with these sites you can probably say belong to that country

  3. I thing you will limit very very much your findings because to have an email with a TLD, you need to have paid for a domain and host or unless your free email account uses one like yahoo that has yahoo.es, .fr , .it etc. All Gmail accounts which are the most popular free email service on western world are all .com and you would loose all those

That's why i mentioned you could filter on the tld of the leaked website and scraping the source code find the language and probably the country of said website to where the email account is coming from Example if you had a leak from the website http://hogarmania.com how do you know what language was this in_

Looking in the source code, in the first line you will find <html xmlns="https://www.w3.org/1999/xhtml" lang="es">

in this case at least you know it's in Spanish

Or maybe the language is in a metatag that not all have a metatag withth elanguage

look at this one http://webawards.com.au

Because the website ends in a 2 letter domain, au , you'll know it belongs to Australia rightaway

BUT

look at their language meta tag it is ..........

See its in English from Australia "en-AU" you are interested in the country so its here in the metatag and not in the first line where it just said or sometimes both or just one place

So if the email was a gmail account and the website was just webawards.com, unless you looked inside the code you would not guess what country the email account was from, this way you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
trivia Really cool trivia about stuff
Projects
None yet
Development

No branches or pull requests

2 participants