Update README.rst

appeler · Aug 17, 2023 · 07287f9 · 07287f9
1 parent 96908f9
commit 07287f9
Showing 1 changed file with 14 additions and 303 deletions.
diff --git a/README.rst b/README.rst
@@ -25,318 +25,29 @@ We strongly recommend installing `ethnicolor2` inside a Python virtual environme
 
     pip install ethnicolr2
 
-General API
-------------------
-
-To see the available command line options for any function, please type in 
-``<function-name> --help``
-
-::
-
-   # census_ln --help
-   usage: census_ln [-h] [-y {2000,2010}] [-o OUTPUT] -l LAST input
-
-   Appends Census columns by last name
-
-   positional arguments:
-     input                 Input file
-
-   optional arguments:
-     -h, --help            show this help message and exit
-     -y {2000,2010}, --year {2000,2010}
-                           Year of Census data (default=2000)
-     -o OUTPUT, --output OUTPUT
-                           Output file with Census data columns
-     -l LAST, --last LAST  Name of the column containing the last name
-
-
-Examples
+Example
 ----------
 
-To append census data from 2010 to a `file with column header in the first row <ethnicolr2/data/input-with-header.csv>`__, specify the column name carrying last names using the ``-l`` option, keeping the rest the same:
-
-::
-
-   census_ln -y 2010 -o output-census2010.csv -l last_name input-with-header.csv   
-
-
-To predict race/ethnicity using the Florida Last Name Model, specify the column name of last name and first name by using ``-l`` and ``-f`` flags respectively.
+To predict race/ethnicity using the Florida Last Name Model to a `file with first and last names <ethnicolr2/data/input-with-header.csv>`__
 
 ::
 
-   pred_fl_last_name -o output-wiki-pred-race.csv -l last_name -f first_name input-with-header.csv
-
-
-Functions
-----------
-
-We expose 4 functions, each of which either take a pandas DataFrame or a
-CSV.
-
-- **census\_ln(df, lname_col, year=2000)**
-
-  -  What it does:
-
-     - Removes extra space
-     - For names in the `census file <ethnicolr/data/census>`__, it appends 
-       relevant data of what probability the name provided is of a certain race/ethnicity
-
- +------------+--------------------------------------------------------------------------------------------------------------------------+
- | Parameters |                                                                                                                          |
- +============+==========================================================================================================================+
- |            | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred             |
- +------------+--------------------------------------------------------------------------------------------------------------------------+
- |            | **lname_col** : *{string}* name of the column containing the last name                                                   |
- +------------+--------------------------------------------------------------------------------------------------------------------------+
- |            | **Year** : *{2000, 2010}, default=2000* year of census to use                                                            |
- +------------+--------------------------------------------------------------------------------------------------------------------------+
-
-
--  Output: Appends the following columns to the pandas DataFrame or CSV: 
-   pctwhite, pctblack, pctapi, pctaian, pct2prace, pcthispanic. 
-   See `here <https://github.com/appeler/ethnicolr/blob/master/ethnicolr/data/census/census_2000.pdf>`__ 
-   for what the column names mean.
-
-   ::
-
-      >>> import pandas as pd
-
-      >>> from ethnicolr import census_ln, pred_census_ln
-
-      >>> names = [{'name': 'smith'},
-      ...         {'name': 'zhang'},
-      ...         {'name': 'jackson'}]
-
-      >>> df = pd.DataFrame(names)
-
-      >>> df
-            name
-      0    smith
-      1    zhang
-      2  jackson
-
-      >>> census_ln(df, 'name')
-            name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
-      0    smith    73.35    22.22   0.40    0.85      1.63        1.56
-      1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
-      2  jackson    41.93    53.02   0.31    1.04      2.18        1.53
-
-
--  **pred\_census\_ln(df, lname_col, year=2000, num\_iter=100, conf\_int=1.0)**
-
-   -  What it does:
-
-      -  Removes extra space.
-      -  Uses the `last name census 2000 
-         model <ethnicolr/models/ethnicolr_keras_lstm_census2000_ln.ipynb>`__ or 
-         `last name census 2010 model <ethnicolr/models/ethnicolr_keras_lstm_census2010_ln.ipynb>`__ 
-         to predict the race and ethnicity.
-
-
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   | Parameters   |                                                                                                                     |
-   +==============+=====================================================================================================================+
-   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **namecol** : *{string}* name of the column containing the last name                                                |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **year** : *{2000, 2010}, default=2000* year of census to use                                                       |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **num\_iter** : *int, default=100* number of iterations to calculate uncertainty in model                           |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **conf\_int** : *float, default=1.0* confidence interval in predicted class                                         |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-
-
-   -  Output: Appends the following columns to the pandas DataFrame or CSV:
-      race (white, black, asian, or hispanic), api (percentage chance
-      asian), black, hispanic, white. For each race it will provide the
-      mean, standard error, lower & upper bound of confidence interval
-
-   *(Using the same dataframe from example above)*
-   ::
-
-         >>> census_ln(df, 'name')
-               name pctwhite pctblack pctapi pctaian pct2prace pcthispanic
-         0    smith    73.35    22.22   0.40    0.85      1.63        1.56
-         1    zhang     0.61     0.09  98.16    0.02      0.96        0.16
-         2  jackson    41.93    53.02   0.31    1.04      2.18        1.53
-
-         >>> census_ln(df, 'name', 2010)
-               name   race pctwhite pctblack pctapi pctaian pct2prace pcthispanic
-         0    smith  white     70.9    23.11    0.5    0.89      2.19         2.4
-         1    zhang    api     0.99     0.16  98.06    0.02      0.62        0.15
-         2  jackson  black    39.89    53.04   0.39    1.06      3.12         2.5
-
-         >>> pred_census_ln(df, 'name')
-               name   race       api     black  hispanic     white
-         0    smith  white  0.002019  0.247235  0.014485  0.736260
-         1    zhang    api  0.997807  0.000149  0.000470  0.001574
-         2  jackson  black  0.002797  0.528193  0.014605  0.454405
-
-
--  **pred\_fl\_reg\_ln(df, lname_col, num\_iter=100, conf\_int=1.0)**
-
-   -  What it does?:
-
-      -  Removes extra space, if there.
-      -  Uses the `last name FL registration
-         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_ln.ipynb>`__
-         to predict the race and ethnicity.
-
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   | Parameters   |                                                                                                                     |
-   +==============+=====================================================================================================================+
-   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred        |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **lname_col** : *{string}* name of the column containing the last name                                              |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **num\_iter** : *int, default=100* number of iterations to calculate the uncertainty                                |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-   |              | **conf\_int** : *float, default=1.0* confidence interval                                                            |
-   +--------------+---------------------------------------------------------------------------------------------------------------------+
-
-
-
-   -  Output: Appends the following columns to the pandas DataFrame or CSV:
-      race (white, black, asian, or hispanic), asian (percentage chance
-      Asian), hispanic, nh\_black, nh\_white. For each race it will provide
-      the mean, standard error, lower & upper bound of confidence interval
-
-   ::
-
-      >>> import pandas as pd
-
-      >>> names = [
-      ...             {"last": "sawyer", "first": "john", "true_race": "nh_white"},
-      ...             {"last": "torres", "first": "raul", "true_race": "hispanic"},
-      ...         ]
-      
-      >>> df = pd.DataFrame(names)
-
-      >>> from ethnicolr import pred_fl_reg_ln, pred_fl_reg_name, pred_fl_reg_ln_five_cat, pred_fl_reg_name_five_cat
-
-      >>> odf = pred_fl_reg_ln(df, 'last', conf_int=0.9)
-      ['asian', 'hispanic', 'nh_black', 'nh_white']
-
-      >>> odf
-         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
-      0  Sawyer  john  nh_white    0.009859   0.006819  0.005338  0.019673       0.021488      0.004602     0.014802     0.030148       0.180929      0.052784     0.105756     0.270238       0.787724      0.051082     0.705290     0.860286  nh_white
-      1  Torres  raul  hispanic    0.006463   0.001985  0.003915  0.010146       0.878119      0.021998     0.839274     0.909151       0.013118      0.005002     0.007364     0.021633       0.102300      0.017828     0.075911     0.130929  hispanic
-
-      [2 rows x 20 columns]
-
-      >>> odf.iloc[0]
-      last               Sawyer
-      first                john
-      true_race        nh_white
-      asian_mean       0.009859
-      asian_std        0.006819
-      asian_lb         0.005338
-      asian_ub         0.019673
-      hispanic_mean    0.021488
-      hispanic_std     0.004602
-      hispanic_lb      0.014802
-      hispanic_ub      0.030148
-      nh_black_mean    0.180929
-      nh_black_std     0.052784
-      nh_black_lb      0.105756
-      nh_black_ub      0.270238
-      nh_white_mean    0.787724
-      nh_white_std     0.051082
-      nh_white_lb       0.70529
-      nh_white_ub      0.860286
-      race             nh_white
-      Name: 0, dtype: object
-
-
--  **pred\_fl\_reg\_name(df, lname_col, num\_iter=100, conf\_int=1.0)**
-
-   -  What it does:
-
-      -  Removes extra space.
-      -  Uses the `full name FL
-         model <ethnicolr/models/ethnicolr_keras_lstm_fl_voter_name.ipynb>`__
-         to predict the race and ethnicity.
-
-   +--------------+-------------------------------------------------------------------------------------------------------------------+
-   | Parameters   |                                                                                                                   |
-   +==============+===================================================================================================================+
-   |              | **df** : *{DataFrame, csv}* Pandas dataframe of CSV file contains the names of the individual to be inferred      |
-   +--------------+-------------------------------------------------------------------------------------------------------------------+
-   |              | **namecol** : *{list}* name of the column containing the name.                                                    |
-   +--------------+-------------------------------------------------------------------------------------------------------------------+
-   |              | **num\_iter** : *int, default=100* number of iterations to calculate the uncertainty                              |
-   +--------------+-------------------------------------------------------------------------------------------------------------------+
-   |              | **conf\_int** : *float, default=1.0* confidence interval in predicted class                                       |
-   +--------------+-------------------------------------------------------------------------------------------------------------------+
-
-
-   -  Output: Appends the following columns to the pandas DataFrame or CSV:
-      race (white, black, asian, or hispanic), asian (percentage chance
-      Asian), hispanic, nh\_black, nh\_white. For each race it will provide
-      the mean, standard error, lower & upper bound of confidence interval
-
+   import pandas as pd
+   df = pd.read_csv("ethnicolr2/data/input-with-header.csv")
+   pred_fl_last_name(df, lname_col = "last_name")
    
-   *(Using the same dataframe from example above)*
-   ::
-
-      >>> odf = pred_fl_reg_name(df, 'last', 'first', conf_int=0.9)
-      ['asian', 'hispanic', 'nh_black', 'nh_white']
 
-      >>> odf
-         last first true_race  asian_mean  asian_std  asian_lb  asian_ub  hispanic_mean  hispanic_std  hispanic_lb  hispanic_ub  nh_black_mean  nh_black_std  nh_black_lb  nh_black_ub  nh_white_mean  nh_white_std  nh_white_lb  nh_white_ub      race
-      0  Sawyer  john  nh_white    0.001534   0.000850  0.000636  0.002691       0.006818      0.002557     0.003684     0.011660       0.028068      0.015095     0.011488     0.055149       0.963581      0.015738     0.935445     0.983224  nh_white
-      1  Torres  raul  hispanic    0.005791   0.002906  0.002446  0.011748       0.890561      0.029581     0.841328     0.937706       0.011397      0.004682     0.005829     0.020796       0.092251      0.026675     0.049868     0.139210  hispanic
-
-      >>> odf.iloc[1]
-      last               Torres
-      first                raul
-      true_race        hispanic
-      asian_mean       0.005791
-      asian_std        0.002906
-      asian_lb         0.002446
-      asian_ub         0.011748
-      hispanic_mean    0.890561
-      hispanic_std     0.029581
-      hispanic_lb      0.841328
-      hispanic_ub      0.937706
-      nh_black_mean    0.011397
-      nh_black_std     0.004682
-      nh_black_lb      0.005829
-      nh_black_ub      0.020796
-      nh_white_mean    0.092251
-      nh_white_std     0.026675
-      nh_white_lb      0.049868
-      nh_white_ub       0.13921
-      race             hispanic
-      Name: 1, dtype: object
-
-
-
-Application
---------------
-
-To illustrate how the package can be used, we impute the race of the campaign contributors recorded by FEC for the years 2000 and 2010 and tally campaign contributions by race.
-
-- `Contrib 2000/2010 using census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx-census_ln.ipynb>`__
-- `Contrib 2000/2010 using pred_census_ln <ethnicolr/examples/ethnicolr_app_contrib20xx.ipynb>`__
-- `Contrib 2000/2010 using pred_fl_reg_name <ethnicolr/examples/ethnicolr_app_contrib20xx-fl_reg.ipynb>`__
-
-
-Data
-----------
+   names = [
+    {"last": "sawyer", "first": "john", "true_race": "nh_white"},
+    {"last": "torres", "first": "raul", "true_race": "hispanic"},
+   ]
+   df = pd.DataFrame(names) 
+   df = pred_fl_full_name(df, lname_col = "last", fname_col = "first")
 
-In particular, we utilize the last-name--race data from the `2000
-census <http://www.census.gov/topics/population/genealogy/data/2000_surnames.html>`__
-and `2010
-census <http://www.census.gov/topics/population/genealogy/data/2010_surnames.html>`__,
-the `Wikipedia data <ethnicolr/data/wiki/>`__ collected by Skiena and colleagues,
-and the Florida voter registration data from early 2017.
+         last  first true_race   preds
+   0  sawyer   john  nh_white nh_white
+   1  torres   raul  hispanic hispanic
 
--  `Census <ethnicolr/data/census/>`__
--  `The Wikipedia dataset <ethnicolr/data/wiki/>`__
--  `Florida voter registration database <http://dx.doi.org/10.7910/DVN/UBIG3F>`__
 
 Authors
 ----------