Skip to content

amacinho/Name-Gender-Guesser

Repository files navigation

Copying: Name Gender Guesser is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Name Gender Guesser is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Name Gender Guesser. If not, see <http://www.gnu.org/licenses/>.

Introduction: Name Gender Guesser helps you to find out the gender of a given name. You can either use two provided datasets (or another if you have your own) consisting of common American names with their frequencies in male and female populations, or you can use Yahoo! BOSS API to guess the gender of an unknown name by carrying out some pattern-based searches.

Quick Start: Checkout the code and run example.py

Less Quick Start: This project contains two datasets for gender assocciations of common American names and two scripts, one to handle these datasets, another to carry out a web-based search to guess the gender of unknown names.

First dataset, us_census, comes from the US Census Bureau and constructed as follows:

The names are fetched from the Bureau's web site (http://www.census.gov/genealogy/www/data/1990surnames/names_files.html) and put in two files: us_census_males and us_census_females which contain the
frequency of names for the sample male and female population respective (according to 1990 census).

The second dataset, popular_baby_names, comes from the US Social Security Administration's statistics for popular baby names for every year between 1960 and 2010. The dataset is constructed as follows:

1) Fetch most popular 100 female and male names for every year between 1960 and 2010 from http://www.ssa.gov/cgi-bin/popularnames.cgi
2) For each male and female name calculate the average probability of usage between 1960 and 2010. Missing years are not used in averaging. That implies if a name was in top100 list for only year for the given period, its final score will be its probability for that year.

The class NameGender (contained in name_gender.py) handles with these datasets. If you have your own dataset, you can also use it. The format is trivial (really, check them yourself).

The class WebNameGender does not use any dataset to guess the gender of the name. It simply carries out several web-searches via Yahoo! BOSS API and calculates a gender score according the hit counts. It provides a fallback mechanism if a given name is not contained in the datasets. It also works fairly well for common names in languages other than English (a proper evaluation is yet to be done). You will need a BOSS Application ID to use this class. Two example patterns that WebNameGender uses for a given name X are:

* "X himself", "X herself"
* "husband of X", "wife of X"

In the first case, "X himself" provides evidence that X is a he. In the second case, "husband of X" provides evidence that X is a she. By comparing several pattern pairs like these, WebNameGender computes a gender score for X.

About

Guesses the gender of the names.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages