Skip to content

Docstring Best Practice

Brandon Lockhart edited this page Feb 21, 2021 · 5 revisions

Dataprep uses a few sphinx packages to accelerate docstring writing, thus brings in additional best practices. Here lists all these best practices and please kindly give it a read.

  • Automatic parameter type inference.

    Dataprep strongly enforces typing for all the functions, classes and variables. When writing function parameters, the convention from NumPy says you should write the parameter type after a :. Here, we don't, as long as the type is annotated correctly in the function signature. Take dataprep.eda.basic.plot as an example: Since we have the function signature typed,

      def plot(
          df: Union[pd.DataFrame, dd.DataFrame],
          x: Optional[str] = None,
          y: Optional[str] = None,
          *,
          bins: int = 10,
          ngroups: int = 10,
          largest: bool = True,
          nsubgroups: int = 5,
          bandwidth: float = 1.5,
          sample_size: int = 1000,
          value_range: Optional[Tuple[float, float]] = None,
          yscale: str = "linear",
          tile_size: Optional[float] = None,
      ) -> Figure:
          ...
    1. No Type for Function Parameters

      In the docstring you don't need to write type for a parameter

      Parameters
      ----------
      df
        Dataframe from which plots are to be generated
      

      we already have the type of df from the signature. Also, the documentation will be generated correctly as:

      Generated parameter df

    2. Give the Type for Default Values

      Alternatively, you can still write the parameter type to override the auto-generated one. A very good use case would be default values:

      Parameters
      ----------
      x: Optional[str], default None
          A valid column name from the dataframe.
      

      This gives you

      Notice that how the parameter type changes from bold to italic - this is the sign of ** overridden** parameter types.

    3. No Returns Unless for Comments

      We can also infer the function return type from the signature! This means no need for docstrings like this:

      Returns
      -------
      Figure
        An object of figure
      

      , unless you want to write some meaningful comments for the return type:

      Returns
      -------
      Figure
        A meaningful message!!!
      
  • Make class members private by a leading _.

    Remember all the members without a leading underscore will be shown in the documentation!

DataPrep.Clean Docstring Recommendations:

  1. Module Docstring: one short description of the main purpose of the file. E.g.,
"""Clean and validate a DataFrame column containing geographic coordinates."""
  1. Function Docstring

    a. Start with a high-level, one-sentence description of the function. E.g,

    """
    Clean and standardize latitude and longitude coordinates.
    

    b. Optionally, further relevant information can be given in paragraphs under the first sentence.

    c. If there exists an associated User Guide, the last sentence before the parameter descriptions should reference it. E.g.,

    Read more in the :ref:`User Guide <clean_lat_long_user_guide>`.
    
    Parameters
    ----------
    
  2. Parameter Descriptions

    a. If a parameter defines a format, an example should be given. E.g.

    output_format
        The desired format of the coordinates.
            - 'dd': decimal degrees (51.4934, 0.0098)
            - 'ddh': decimal degrees with hemisphere ('51.4934° N, 0.0098° E')
            - 'dm': degrees minutes ('51° 29.604′ N, 0° 0.588′ E')
            - 'dms': degrees minutes seconds ('51° 29′ 36.24″ N, 0° 0′ 35.28″ E')
    
        (default: 'dd')
    

    b. The default value should be specified after a blank line at the end of the parameter description. E.g.,

    report
        If True, output the summary report. Otherwise, no report is outputted.
    
        (default: True)
    

    c. If a parameter has the exact same functionality as in other functions, the description should be the same. E.g., the report parameter above.

  3. Examples: after defining the parameters, include a short example that demonstrates the function. E.g.

    Examples
    --------
    Split a column containing latitude and longitude strings into separate
    columns in decimal degrees format.

    >>> df = pd.DataFrame({'coordinates': ['51° 29′ 36.24″ N, 0° 0′ 35.28″ E', '51.4934° N, 0.0098° E']})
    >>> clean_lat_long(df, 'coordinates', split=True)
                            coordinates  latitude  longitude
    0  51° 29′ 36.24″ N, 0° 0′ 35.28″ E   51.4934     0.0098
    1             51.4934° N, 0.0098° E   51.4934     0.0098

Notes:

  1. Each statement should begin with a capital letter and end with a period.
  2. All internal functions should begin with an underscore so they do not appear in the documentation.
  3. Please use single quotes for text (i.e., 'US' not "US") in the docstring.

Auto-generating the API Reference

To add a file to appear in the API reference section of the documentation, add it in alphabetical order here.

Reference a User Guide from a Docstring

To create a link to a user guide from a docstring, follow the instructions here.

Reference a DataPrep Function from a User Guide

To link to the API reference of a function from a user guide, first set the raw NBConvert format of the cell to reST as explained in the previous section. Then use the syntax :func:`Text you want to link <full path to function>` to reference the function's API docstring. For example :func:`clean_country() <dataprep.clean.clean_country.clean_country>`. Please link to a function when it's first introduced in the user guide.

Generate the Documentation Locally

To preview the documentation, run poetry run sphinx-build -M html docs/source docs/build in your dataprep directory. A local copy of the main page can then be accessed from docs/build/html/index.html.