Skip to content

in-rolls/electoral_rolls

Repository files navigation

Indian Electoral Rolls

We have built a dataset of nearly all the Indian electors. Our data includes information on first and last name, gender, polling station (constituency, district, and state), father or husband's name, among other such details. We assembled this data by scraping and parsing the electoral rolls.

This repository includes scripts for downloading the PDF electoral rolls from the various state election commission sites. Parse PDF Rolls has scripts for parsing the electoral rolls, scripts for translating native language rolls to English, and information about the resulting CSVs.

Electoral Rolls

To ameliorate concerns about eligible voters not being on the rolls (and ineligible electors being on the rolls), the Election Commission of India mandates that state election commissions publish electoral rolls. As a result, the 36 different election commissions---29 states and 7 union territories---post electoral rolls for each polling station on their websites. The websites vary enormously in design, in the metadata they provide about the polling stations, and the language in which they provide the electoral rolls. For instance, some commissions provide electoral rolls in English, some in the main native language(s) of the state, and some in both the main native language(s) of the state and English. The only thing that is constant is that these electoral rolls are provided in dense pdfs. So we wrote separate scrapers for downloading the pdfs. In many cases, we also downloaded the metadata for each of the polling stations (pdfs) that was on the website. (A separate repository uses a different source of data to collate metadata on polling stations.) For scripts, information about the source of the electoral rolls, and such, see the table below.

How Do I Get the Electoral Roll PDFs?

Given privacy concerns, we are releasing the data only for research purposes. To access the pdfs, you must agree to take all precautions to maintain the privacy of Indian electors. (There is a difference between data being available in pdfs, split across different sites, sometimes behind CAPTCHA, and a common data dump.) If you would like access to the electoral rolls, please fill out the following form.

You will need to also get IRB approval from your respective university or institution. The IRB-approved proposal should include:

  1. Case for why the data are necessary
  2. Acknowledgment that the data will be kept in a secure environment
  3. All the people who will have access to the data
  4. That the data will only be used on projects with IRB approval
  5. That data won't be shared with people who are not identified in 3.
  6. That publications and presentations will not reveal identifying individual information: only statistical summaries will be presented.

Accessing the Data

The data are available on Harvard Dataverse and via Google Coldline Storage. The GCS buckets are setup as requester pays. So you need to create a project that will be used for billing.

To access data from GCS, you will need to do the following:

gsutil -u projectname_for_billing ls gs://in-electoral-rolls/
gs://in-electoral-rolls/andaman.tar.gz
gs://in-electoral-rolls/andhra_pdfs.tar.gz
gs://in-electoral-rolls/arunachal.tar.gz
gs://in-electoral-rolls/assam.tar.gz
gs://in-electoral-rolls/bihar.tar.gz
gs://in-electoral-rolls/chandigarh_pdfs.tar.gz
gs://in-electoral-rolls/dadra_pdfs.tar.gz
gs://in-electoral-rolls/daman_2015.tar.gz
gs://in-electoral-rolls/daman_2016.tar.gz
...

If you would like access to CSVs from parsing the electoral roll pdfs, check out https://github.com/in-rolls/parse_elex_rolls. The data are posted on the Harvard Dataverse at http://dx.doi.org/10.7910/DVN/MUEGDT.

Suggested Citation

Gaurav Sood and Atul Dhingra. 2018. Indian Electoral Rolls PDF Corpus. https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OG47IV

Scripts and Information by State

State Year(s) Language(s)
Andaman & Nicobar Islands 2017 English
Andhra Pradesh 2017 Telugu, English
Arunachal Pradesh 2017 English
Assam 2018 Bengali
Bihar* 2017 Hindi
Chattisgarh--- Not reachable -- --
Chandigarh 2018 Hindi
Dadra & Nagar Haveli 2017 Gujarati, English
Daman & Diu 2017 Gujarati, English
Goa 2018 English
Gujarat 2017 Gujarati
Haryana 2018 Hindi
Himachal Pradesh 2017 Hindi
Jammu & Kashmir 2018 Hindi, English, and Urdu
Jharkhand 2018 Hindi
Lakshadweep 2017 Malayalam
Karnataka 2018 Kannada
Kerala 2018 Malayalam, English
Madhya Pradesh 2017 Hindi
Maharashtra 2018 Marathi
Manipur 2018 Manipuri, English
Meghalaya 2018 English
Mizoram 2018 English
Nagaland 2018 English
NCT OF Delhi 2018 Hindi, English
Odisha 2018 Odia
Punjab 2018 Punjabi
Puducherry 2018 Tamil, English
Rajasthan 2014 Hindi
Sikkim 2018 English
Tamil Nadu 2018 Tamil
Telangana 2017 Telugu
Tripura 2018 Bengali
Uttar Pradesh 2018 Hindi
Uttarakhand 2017 Hindi
West Bengal 2018 Bengali

Archives and 2020

State Year(s) Language(s)
Bihar (see acknowledgments) 2015 Hindi
Bihar 2020 Hindi
Daman 2015--2016 English, Gujarati
Karnataka 2015--2017 Kannada
Kerala 2011-2016 Malyalam
Uttarakhand 2007--2016 Hindi

Acknowledgments

  • Bihar 2015 electoral rolls were contributed by Aaditya Dar. Aaditya also pointed us the right way to setup a data access procedure where researchers need to get IRB approval.
  • The specifics of IRB are 'inspired' by http://adfdell.pstc.brown.edu/arisreds_data/readme.txt
  • Elian Carsenat helped us craft better directions for how to access data on GOOG storage.

License

The scripts are provided under the MIT license.

Releases

No releases published

Packages

No packages published