Skip to content

RPGroup-PBoC/human_impacts

Repository files navigation

Anthroponumbers.org DOI

Welcome! This repository collects and annotates data sets pertaining to the impacts humans have on the earth writ large and serves as the central data source for Anthroponumbers.org, the Human Impacts Database. The repository is open to the public, though please contact Griffin if you are interested in becoming a contributing member to the effort. You do not need special permissions to clone or fork this repository or to submit pull requests. As this is a living, breathing research repository, the structure and scope of the repository is subject to change at any point in time without warning.

Using The Repository

If you are interested in contributing to this repository, please contact Griffin. You do not require any special permissions to fork the repository and submit pull requests.

Being able to use Git and GitHub is vital to keep track of the data. There are many tutorials showing how to use both of these efficiently. If you are not familiar with Git and GitHub, please review the following tutorials:

Note that in using this repository, it is imperative that you give coherent and meaningful commit messages.

Repository Structure

⚠️ An example data file has been added to flora_fauna/ if you want to learn by example how to manage this repository.

The general layout of this repository follows the ReproducibleResearch template. All primary data are stored in the data directory of the root folder. Data sources are categorized into a primary type of human and Earth-system interaction (such as land use, agriculture, water use, etc). These categorizations all all preliminary and subject to change.

Within each folder, you will find a README.md file explaining the purpose of the folder and other information that will help you manage the repository. The structure of this repository is very much in its infancy and may change! Keep track of this README.md file to see if things are changing.

If you are confused with the layout of the repository or have comments on how things could be changed to improve it, please open an issue on this repository.

This repository houses all raw data sets collected from databases, primary scientific literature, organizational and/or governmental reports, and industry datasets. This repository is broken down into four separate subdirectories. Of course, some data sets will fall into multiple categories. When adding your data to the repository, choose what you think is the best match.

  • agriculture: All data sets related to food generation and consumption.
  • water: All data sets related to water consumption and usage.
  • atmosphere_biogeochemistry: All data sets related to atmospheric impacts and biogeochemical cycles.
  • land_use: All data sets related to land usage. This may include data relating to anthropomass, farmland, city sizes, etc.
  • flora_fauna: All data sets pertaining to humans impact on the biosphere.
  • energy: All datasets pertaining to energy generation, consumption, and harvesting.
  • anthropocentric: All datasets specifically pertaining to the anthroposphere, including global populations, anthropomass production, etc.
  • other: All data sets which do not fit into any of the above categories.

Within each of these subdirectories, there will be yet another directory for a given data set. The naming of this subdirectory is much less clear cut and will depend on the nature of the data set. That being said, you should name the folder of your deposited data set in a clear manner. For example, say you collect a data set which tabulates the global population by country for the years 1820 - 2020. You would deposit this data set under flora_fauna in another directory titled world_population_1820-2020. Within this folder, you would deposit the data along with a README.md file, as is described in the next section.

Data Types

Data can be added to these folders in a variety of formats, but they must be text-based. This means that spreadsheets (.xslx or .numbers) should be converted to plain-text data formats (.txt or .csv) whenever possible. Large data sets (≥ 100 MB) will need to be added using the GitHub Large File Storage system. Email Griffin (gchure@caltech.edu) for guidance on how to deal with this. Some of the data you collect in this data set may be a single number or statistic. Even if this is the case, it should be added to this folder as a text-based file.

In general, data should be stored in a longform, tidy format. In this format, each measurement will get one row. Unfortunately, the raw data you get from the internet will rarely be in this very convenient format meaning that you will have to reformat your data to be "tidy".

For example, Consider the world population in our fictional data set described above. The raw data set may look something like the following:

country 1820 1821 1822 ...
Afghanistan 3290000 3300000 3310000 ...
Albania 437000 439000 441000 ...
... ... ... ... ...

This is not in tidy format. Rather than having each year as a column, each year would be a row. Transforming the data into tidy format would make it look like the following:

country year population
Afghanistan 1820 32900000
Afghanistan 1821 3300000
Afghanistan 1822 3310000
... ... ...
Albania 1820 437000
Albania 1821 439000
Albania 1822 441000
... ... ...

To ensure that things are properly transformed, you should include the raw data and the tidy data in the repository. The original data for our hypothetical example would be saved as

world_population_1820-2020_raw.csv

with the tidy format being renamed to

world_population_1820-2020_tidy.csv

For large data sets, this is impossible to do manually and this may require some computational cleaning of data.

Annotation

Every data set you add to this repository MUST be accompanied by README.md file describing several features of the data set. This is not optional and MUST follow a particular structure outlined in README_TEMPLATE.md.

The data set readme file has several fields which you will need to populate when you add your data set. This will make curating the data sets much more manageable and human readable. Please see the README_TEMPLATE.md file for more information about what to include.

Software

Any code you write to analyze a data set must be housed within a code subdirectory in your deposited data folder. Consider our example of a hypothetical data set containing the global population from the years 1820-2020 by country. Say that you wanted to generate a plot of the population of Brazil over these years using python. The script you use to generate the plot (brazil_population.py) and the generated plot itself (brazil_population.pdf) would be housed in the associated folders as follows.

flora_fauna /
|
|--> world_population_1820-2020/
     |
     | --> world_population_1820-2020_raw.csv
     | --> world_population_1820-2020_tidy.csv
     | --> code/
           |
           |--> brazil_population.py
     | --> media/
           |
           |--> brazil_population.pdf

Please see the data set already present in this directory for a complete example.

anthro software module

This project comes with a custom Python software package titled anthro which will be used repeatedly to clean, annotate, collate, and present the data contained in this repository.

To run many of the scripts involved in cleaning and presentation of data within this repository, you will need to have the anthro package locally installed. Assuming you have cloned this repository, you can install the package using the following command,

pip install -e .

assuming you are in the root directory.

License

All data sources carry with them the licenses they had when they were released. Please see the README.md files within each dataset directory for information regarding the original data licensing and preferred citations.

All software within this repository that originates from the class should be licensed under the standard MIT license as is given below.

All creative work generated specifically for this work (e.g. writing and graphics) is similarly licensed under a Creative Commons CC-BY 4.0 permissive license. All data, creative works, and software not originating from this course carry the original license and copyright as is present in the source material.

Copyright 2020 The Authors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.