docs/dataset_addition_guide.html


<!DOCTYPE html>

<html>
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Guide to adding datasets to matminer &#8212; matminer 0.7.8 documentation</title>
    <link rel="stylesheet" href="_static/nature.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/doctools.js"></script>
    <script src="_static/language_data.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
 
<link href='https://fonts.googleapis.com/css?family=Lato:400,700' rel='stylesheet' type='text/css'>

  </head><body>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">matminer 0.7.8 documentation</a> &#187;</li>
        <li class="nav-item nav-item-this"><a href="">Guide to adding datasets to matminer</a></li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="guide-to-adding-datasets-to-matminer">
<h1>Guide to adding datasets to matminer<a class="headerlink" href="#guide-to-adding-datasets-to-matminer" title="Permalink to this headline">¶</a></h1>
<p><em>All information current as of 10/24/2018</em></p>
<p>In addition to providing tools for retrieving current data from several standard
materials science databases, matminer also provides a suite of static datasets
pre-formatted as pandas DataFrame objects and stored as compressed JSON files.
These files are stored on Figshare, a long term academic data storage platform,
and include metadata making each dataset discoverable and clearly linked to the
research and contributors that generated said data.</p>
<p>This functionality serves two purposes. First as a way for others to quickly
and easily access data used in research at the Hacking Materials group and
second to provide the community with a set of standard datasets that can be
used for benchmarking purposes. As the application of machine learning in
materials science matures, there is a growing need for a group of benchmark
datasets that researchers can use as standard training and testing sets for
comparing model performance against.</p>
<p>To add a dataset to the collection currently supported by matminer there are
six primary steps:</p>
<div class="section" id="fork-the-matminer-repository-on-github">
<h2>1. Fork the matminer repository on GitHub<a class="headerlink" href="#fork-the-matminer-repository-on-github" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li><p>The matminer code base, including the metadata that defines how matminer handles datasets,
is available <a class="reference external" href="https://github.com/hackingmaterials/matminer">on GitHub.</a></p></li>
<li><p>All editing should take place either within the dev_scripts/dataset_management
folder or within matminer/datasets</p></li>
</ul>
</div>
<div class="section" id="prepare-the-dataset-for-long-term-hosting">
<h2>2. Prepare the dataset for long term hosting<a class="headerlink" href="#prepare-the-dataset-for-long-term-hosting" title="Permalink to this headline">¶</a></h2>
<p>To work properly with matminer’s loading functions it is assumed that
all datasets are pandas DataFrame objects stored as JSON files using the
MontyEncoder encoding scheme available in the monty Python package. Any
datasets added to matminer should ensure this requirement.</p>
<p>The script <code class="docutils literal notranslate"><span class="pre">prep_dataset_for_figshare.py</span></code> was written to expedite and
standardize this process. If the dataset being uploaded needs no
modification from the contents stored in the file, one can simply run this
script to convert your dataset to the desired format like so:
<code class="docutils literal notranslate"><span class="pre">python</span> <span class="pre">prep_dataset_for_figshare.py</span> <span class="pre">-fp</span> <span class="pre">/path/to/dataset(s)</span>
<span class="pre">-ct</span> <span class="pre">(compression_type:</span> <span class="pre">gz</span> <span class="pre">or</span> <span class="pre">bz2)</span></code> This script can take multiple file
paths and/or directory names. If given a directory name it will crawl
the directory and try to process all files within. The prepped files will
then be available in <code class="docutils literal notranslate"><span class="pre">~/dataset_to_json/</span></code> along with a .txt file containing
some metadata on the files which will be used later on.</p>
<p>If modification does need to be made to the dataset or if you would
like the dataset name to be separate from that of the file being converted,
users will need to make a small modification to this script prior to
running it on their selected datasets.</p>
<div class="section" id="to-update-the-script-to-preprocess-your-dataset">
<h3><strong>To update the script to preprocess your dataset:</strong><a class="headerlink" href="#to-update-the-script-to-preprocess-your-dataset" title="Permalink to this headline">¶</a></h3>
<ul>
<li><p>Write a preprocessor for the dataset in <code class="docutils literal notranslate"><span class="pre">prep_dataset_for_figshare.py</span></code></p>
<p>The preprocessor should take the dataset path, do any necessary
preprocessing to turn it into a usable dataframe, and return a tuple
of the form <code class="docutils literal notranslate"><span class="pre">(string_of_dataset_name,</span> <span class="pre">dataframe)</span></code>. If the preprocessor
produces more than one dataset it should return a tuple of two lists of the form
<code class="docutils literal notranslate"><span class="pre">([dataset_name_1,</span> <span class="pre">dataset_name_2,</span> <span class="pre">…],</span> <span class="pre">[df_1,</span> <span class="pre">df_2,</span> <span class="pre">…])</span></code></p>
<p>For example:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">_preprocess_heusler_magnetic</span><span class="p">(</span><span class="n">file_path</span><span class="p">):</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">_read_dataframe_from_file</span><span class="p">(</span><span class="n">file_path</span><span class="p">)</span>

    <span class="n">dropcols</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;gap width&#39;</span><span class="p">,</span> <span class="s1">&#39;stability&#39;</span><span class="p">]</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">dropcols</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">return</span> <span class="s2">&quot;heusler_magnetic&quot;</span><span class="p">,</span> <span class="n">df</span>
</pre></div>
</div>
<p>Here <code class="docutils literal notranslate"><span class="pre">_read_dataframe_from_file()</span></code> is a simple utility function that
determines what pandas loading function to use based on the file type
of the path passed to it. Keyword arguments passed to this function are
passed on to the underlying pandas loading functions.</p>
<p>An example for preprocessors that return multiple datasets:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">_preprocess_double_perovskites_gap</span><span class="p">(</span><span class="n">file_path</span><span class="p">):</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="s1">&#39;bandgap&#39;</span><span class="p">)</span>

    <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">&#39;A1_atom&#39;</span><span class="p">:</span> <span class="s1">&#39;a_1&#39;</span><span class="p">,</span> <span class="s1">&#39;B1_atom&#39;</span><span class="p">:</span> <span class="s1">&#39;b_1&#39;</span><span class="p">,</span>
    <span class="s1">&#39;A2_atom&#39;</span><span class="p">:</span> <span class="s1">&#39;a_2&#39;</span><span class="p">,</span> <span class="s1">&#39;B2_atom&#39;</span><span class="p">:</span> <span class="s1">&#39;b_2&#39;</span><span class="p">})</span>
    <span class="n">lumo</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="s1">&#39;lumo&#39;</span><span class="p">)</span>

    <span class="k">return</span> <span class="p">[</span><span class="s2">&quot;double_perovskites_gap&quot;</span><span class="p">,</span> <span class="s2">&quot;double_perovskites_gap_lumo&quot;</span><span class="p">],</span> <span class="p">[</span><span class="n">df</span><span class="p">,</span> <span class="n">lumo</span><span class="p">]</span>
</pre></div>
</div>
</li>
<li><p>Add the preprocessor function to a dictionary which maps file names to preprocessors in <code class="docutils literal notranslate"><span class="pre">prep_dataset_for_figshare.py</span></code></p>
<p>The prep script identifies datasets by their file name, a dictionary
called <code class="docutils literal notranslate"><span class="pre">_datasets_to_preprocessing_routines</span></code> maps these dataset names
to their preprocessor and should be updated like so:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">_datasets_to_preprocessing_routines</span> <span class="o">=</span> <span class="p">{</span>
<span class="s2">&quot;elastic_tensor_2015&quot;</span><span class="p">:</span> <span class="n">_preprocess_elastic_tensor_2015</span><span class="p">,</span>
<span class="s2">&quot;piezoelectric_tensor&quot;</span><span class="p">:</span> <span class="n">_preprocess_piezoelectric_tensor</span><span class="p">,</span>
<span class="o">.</span>
<span class="o">.</span>
<span class="o">.</span>
<span class="s2">&quot;wolverton_oxides&quot;</span><span class="p">:</span> <span class="n">_preprocess_wolverton_oxides</span><span class="p">,</span>
<span class="s2">&quot;m2ax_elastic&quot;</span><span class="p">:</span> <span class="n">_preprocess_m2ax</span><span class="p">,</span>
<span class="n">YOUR_DATASET_FILE_NAME</span><span class="p">:</span> <span class="n">YOUR_PREPROCESSOR</span><span class="p">,</span>
<span class="p">}</span>
</pre></div>
</div>
</li>
</ul>
<p>Once this is done the preprocessor is ready to use.</p>
</div>
</div>
<div class="section" id="upload-the-dataset-to-long-term-hosting">
<h2>3. Upload the dataset to long term hosting<a class="headerlink" href="#upload-the-dataset-to-long-term-hosting" title="Permalink to this headline">¶</a></h2>
<p>Once the dataset file is ready, it should be hosted on Figshare
or a comparable open access academic data hosting service. The Hacking Materials
group maintains a collective Figshare account and follows the following procedure
when adding a dataset. Other contributors should follow a similar protocol.</p>
<ul class="simple">
<li><p>Add the dataset compressed json file as well as the original file as an entry in the “matminer datasets” figshare project</p></li>
<li><p>Fill out <strong>ALL</strong> metadata carefully, see existing entries for examples of expected quality of citations and descriptions.</p></li>
<li><p>If the dataset was originally generated from a source outside the group the source should be thoroughly cited within the dataset description and metadata</p></li>
</ul>
</div>
<div class="section" id="update-the-matminer-dataset-metadata-file">
<h2>4. Update the matminer dataset metadata file<a class="headerlink" href="#update-the-matminer-dataset-metadata-file" title="Permalink to this headline">¶</a></h2>
<p>Matminer stores a file called <code class="docutils literal notranslate"><span class="pre">dataset_metadata.json</span></code> which contains
information on all datasets available in the package. This file is
automatically checked by CircleCI for proper formatting and the
available datasets are regularly checked to ensure they match the
descriptors contained in this metadata. While the appropriate metadata
can be added manually, it is preferable to run the helper script
<code class="docutils literal notranslate"><span class="pre">modify_dataset_metadata.py</span></code> to do the bulk of interfacing with this file
to prevent missing data or formatting mistakes.</p>
<ul>
<li><p>Run the <code class="docutils literal notranslate"><span class="pre">modify_dataset_metadata.py</span></code> file and add the appropriate
metadata, see existing metadata as a guideline for new datasets.</p>
<p>The url attribute should be filled with a figshare download link for
the individual file on figshare. Other items will be dataset specific
or included in the .txt file produced in step 2.</p>
</li>
<li><p>Replace the metadata file in matminer/datasets with the newly generated file (should be done automatically)</p></li>
<li><p>Look over the modified <code class="docutils literal notranslate"><span class="pre">dataset_metadata.json</span></code> file and fix mistakes if necessary.</p></li>
</ul>
</div>
<div class="section" id="update-the-dataset-tests-and-loading-code">
<h2>5. Update the dataset tests and loading code<a class="headerlink" href="#update-the-dataset-tests-and-loading-code" title="Permalink to this headline">¶</a></h2>
<p>Dataset testing uses unit tests to ensure dataset metadata and dataset content
is formatted properly and available. When adding new datasets these tests
need updated. In addition matminer provides a set of convenience functions that
explicitly load a single dataset as opposed to the keyword based generic loader.
These convenience functions provide additional post processing options for
filtering or modifying data in the dataset after it has been loaded. A
convenience function should be added alongside dataset tests.</p>
<ul>
<li><p>Update dataset names saved in <cite>matminer/datasets/tests/base.py</cite></p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">DatasetTest</span><span class="p">(</span><span class="n">unittest</span><span class="o">.</span><span class="n">TestCase</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">setUp</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="o">.</span><span class="n">dataset_names</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s1">&#39;flla&#39;</span><span class="p">,</span>
        <span class="s1">&#39;elastic_tensor_2015&#39;</span><span class="p">,</span>
        <span class="s1">&#39;piezoelectric_tensor&#39;</span><span class="p">,</span>
        <span class="o">.</span>
        <span class="o">.</span>
        <span class="o">.</span>
        <span class="n">YOUR</span> <span class="n">DATASET</span> <span class="n">NAME</span> <span class="n">HERE</span>
        <span class="p">]</span>
</pre></div>
</div>
</li>
<li><p>Write a test for loading the dataset in test_datasets.py.</p>
<p>These tests ensure that the dataset is downloadable and that its data matches what
is described in the file metadata. See prior datasets for examples and
the .txt file from step 2 for column type info. A typical test consists of
a call to a universal test function that only needs specifiers of what
dataframe columns should be of what type and dataset specific type tests
if necessary.</p>
<p>Example:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">test_dielectric_constant</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
    <span class="n">object_headers</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;material_id&#39;</span><span class="p">,</span> <span class="s1">&#39;formula&#39;</span><span class="p">,</span> <span class="s1">&#39;structure&#39;</span><span class="p">,</span>
                      <span class="s1">&#39;e_electronic&#39;</span><span class="p">,</span> <span class="s1">&#39;e_total&#39;</span><span class="p">,</span> <span class="s1">&#39;cif&#39;</span><span class="p">,</span> <span class="s1">&#39;meta&#39;</span><span class="p">,</span>
                      <span class="s1">&#39;poscar&#39;</span><span class="p">]</span>

    <span class="n">numeric_headers</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;nsites&#39;</span><span class="p">,</span> <span class="s1">&#39;space_group&#39;</span><span class="p">,</span> <span class="s1">&#39;volume&#39;</span><span class="p">,</span> <span class="s1">&#39;band_gap&#39;</span><span class="p">,</span>
                       <span class="s1">&#39;n&#39;</span><span class="p">,</span> <span class="s1">&#39;poly_electronic&#39;</span><span class="p">,</span> <span class="s1">&#39;poly_total&#39;</span><span class="p">]</span>

    <span class="n">bool_headers</span> <span class="o">=</span> <span class="p">[</span><span class="s1">&#39;pot_ferroelectric&#39;</span><span class="p">]</span>

    <span class="c1"># Unique Tests</span>
    <span class="k">def</span> <span class="nf">_unique_tests</span><span class="p">(</span><span class="n">df</span><span class="p">):</span>
     <span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="nb">type</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">&#39;structure&#39;</span><span class="p">][</span><span class="mi">0</span><span class="p">]),</span> <span class="n">Structure</span><span class="p">)</span>

    <span class="c1"># Universal Tests</span>
    <span class="bp">self</span><span class="o">.</span><span class="n">universal_dataset_check</span><span class="p">(</span>
        <span class="s2">&quot;dielectric_constant&quot;</span><span class="p">,</span> <span class="n">object_headers</span><span class="p">,</span> <span class="n">numeric_headers</span><span class="p">,</span>
        <span class="n">bool_headers</span><span class="o">=</span><span class="n">bool_headers</span><span class="p">,</span> <span class="n">test_func</span><span class="o">=</span><span class="n">_unique_tests</span>
    <span class="p">)</span>
</pre></div>
</div>
</li>
<li><p>Write a convenience function for the dataset in <cite>convenience_loaders.py</cite></p>
<p>This can be as simple as just returning the results of load_dataset
or provide the user with extra options to return only subsets
of the dataset with certain properties.</p>
<p>Example:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">load_elastic_tensor</span><span class="p">(</span><span class="n">version</span><span class="o">=</span><span class="s2">&quot;2015&quot;</span><span class="p">,</span> <span class="n">include_metadata</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">data_home</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span>
                        <span class="n">download_if_missing</span><span class="o">=</span><span class="kc">True</span><span class="p">):</span>
    <span class="sd">&quot;&quot;&quot;</span>
<span class="sd">    Convenience function for loading the elastic_tensor dataset.</span>

<span class="sd">    Args:</span>
<span class="sd">    version (str): Version of the elastic_tensor dataset to load</span>
<span class="sd">    (defaults to 2015)</span>

<span class="sd">    include_metadata (bool): Whether or not to include the cif, meta,</span>
<span class="sd">    and poscar dataset columns. False by default.</span>

<span class="sd">    data_home (str, None): Where to loom for and store the loaded dataset</span>

<span class="sd">    download_if_missing (bool): Whether or not to download the dataset if</span>
<span class="sd">    it isn&#39;t on disk</span>

<span class="sd">    Returns: (pd.DataFrame)</span>
<span class="sd">    &quot;&quot;&quot;</span>
    <span class="n">df</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s2">&quot;elastic_tensor&quot;</span> <span class="o">+</span> <span class="s2">&quot;_&quot;</span> <span class="o">+</span> <span class="n">version</span><span class="p">,</span> <span class="n">data_home</span><span class="p">,</span>
                      <span class="n">download_if_missing</span><span class="p">)</span>

    <span class="k">if</span> <span class="ow">not</span> <span class="n">include_metadata</span><span class="p">:</span>
        <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s1">&#39;cif&#39;</span><span class="p">,</span> <span class="s1">&#39;kpoint_density&#39;</span><span class="p">,</span> <span class="s1">&#39;poscar&#39;</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">df</span>
</pre></div>
</div>
</li>
<li><p>Write a test for the added convenience function</p>
<p>These tests can be simple and will depend on the options provided in the
convenience function. See existing tests for examples.</p>
</li>
</ul>
</div>
<div class="section" id="make-a-pull-request-to-the-matminer-github-repository">
<h2>6. Make a Pull Request to the matminer GitHub repository<a class="headerlink" href="#make-a-pull-request-to-the-matminer-github-repository" title="Permalink to this headline">¶</a></h2>
<ul class="simple">
<li><p>Make a commit describing the added dataset</p></li>
<li><p>Make a pull request from your fork to the primary repository</p></li>
</ul>
</div>
</div>


            <div class="clearer"></div>
          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Guide to adding datasets to matminer</a><ul>
<li><a class="reference internal" href="#fork-the-matminer-repository-on-github">1. Fork the matminer repository on GitHub</a></li>
<li><a class="reference internal" href="#prepare-the-dataset-for-long-term-hosting">2. Prepare the dataset for long term hosting</a><ul>
<li><a class="reference internal" href="#to-update-the-script-to-preprocess-your-dataset"><strong>To update the script to preprocess your dataset:</strong></a></li>
</ul>
</li>
<li><a class="reference internal" href="#upload-the-dataset-to-long-term-hosting">3. Upload the dataset to long term hosting</a></li>
<li><a class="reference internal" href="#update-the-matminer-dataset-metadata-file">4. Update the matminer dataset metadata file</a></li>
<li><a class="reference internal" href="#update-the-dataset-tests-and-loading-code">5. Update the dataset tests and loading code</a></li>
<li><a class="reference internal" href="#make-a-pull-request-to-the-matminer-github-repository">6. Make a Pull Request to the matminer GitHub repository</a></li>
</ul>
</li>
</ul>

  <div role="note" aria-label="source link">
    <h3>This Page</h3>
    <ul class="this-page-menu">
      <li><a href="_sources/dataset_addition_guide.rst.txt"
            rel="nofollow">Show Source</a></li>
    </ul>
   </div>
<div id="searchbox" style="display: none" role="search">
  <h3 id="searchlabel">Quick search</h3>
    <div class="searchformwrapper">
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" aria-labelledby="searchlabel" />
      <input type="submit" value="Go" />
    </form>
    </div>
</div>
<script>$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">matminer 0.7.8 documentation</a> &#187;</li>
        <li class="nav-item nav-item-this"><a href="">Guide to adding datasets to matminer</a></li> 
      </ul>
    </div>

    <div class="footer" role="contentinfo">
        &#169; Copyright 2015, Anubhav Jain.
      Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
    </div>

  </body>
</html>