Skip to content

Commit

Permalink
Badge + docs updates (#348)
Browse files Browse the repository at this point in the history
* Badge on docs updates

* default to CCDS template

* Style updates

* darken links a bit

* block quotes

* formatting and bare ccds

* Apply suggestions from code review

Co-authored-by: Chris Kucharczyk <chris@drivendata.org>

---------

Co-authored-by: Chris Kucharczyk <chris@drivendata.org>
  • Loading branch information
pjbull and chrisjkuch committed Mar 16, 2024
1 parent e51553f commit 6b9eb7c
Show file tree
Hide file tree
Showing 9 changed files with 267 additions and 26 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ pip install cookiecutter-data-science
To start a new project, run:

```bash
ccds https://github.com/drivendata/cookiecutter-data-science
ccds
```

[![asciicast](https://asciinema.org/a/244658.svg)](https://asciinema.org/a/244658)
Expand Down
13 changes: 12 additions & 1 deletion ccds/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,18 @@
from cookiecutter import cli
from cookiecutter import main as api_main # noqa: F401 referenced by tests

main = cli.main

def default_ccds_main(f):
"""Set the default for the cookiecutter template argument to the CCDS template."""

def _main(*args, **kwargs):
f.params[1].default = "https://github.com/drivendata/cookiecutter-data-science"
return f(*args, **kwargs)

return _main


main = default_ccds_main(cli.main)


if __name__ == "__main__":
Expand Down
Binary file added docs/docs/ccds.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 75 additions & 1 deletion docs/docs/css/extra.css
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
:root {
--md-primary-fg-color: #328F97;
--md-primary-fg-color--light: #328F97;
--md-primary-fg-color--dark: #328F97;

--md-accent-fg-color: #328F97;

--md-footer-bg-color: white;
--md-footer-fg-color: #222;
--md-footer-fg-color--light: #222;
--md-footer-fg-color--lighter: #222;
}

.md-typeset {
-webkit-print-color-adjust: exact;
color-adjust: exact;
font-size: 0.85rem;
line-height: 1.4;
}

.md-typeset h1 {
font-weight: 800;
color: #222;
Expand All @@ -10,6 +30,47 @@
color: #222;
}

.md-typeset a {
color: #297c82;
word-break: break-word;
}

.md-typeset code {
font-size: .8em;
background-color: #f5f5f5;
color: #193d3d;
}

.md-typeset .admonition.info,
.md-typeset details.info {
border-color: #328F97;
}
.md-typeset .info > .admonition-title,
.md-typeset .info > summary {
background-color: #328F9726;
}

.md-typeset .info > .admonition-title::before,
.md-typeset .info > summary::before {
background-color: #328F97;
}

.md-header__title {
font-family: "Space Mono";
font-weight: 400;
font-style: normal;
font-size: 0.9rem;
}

.md-typeset > p, .md-typeset > ul, .md-typeset > ol, .md-typeset > blockquote, .md-typeset > div.admonition {
max-width: 35rem;
}

.md-typeset blockquote {
font-size: 1.0rem;
font-weight: 300;
}

#termynal {
/* 40 lines of 2ex */
height: 80ex !important;
Expand All @@ -31,4 +92,17 @@
.inline-input,
.default-text {
display: inline-block !important;
}
}

.md-logo img {
height: 3rem !important;
}

.md-header, .md-footer, .md-footer-meta {
color: #222;
background-color: white;
}

.md-nav__link--active {
font-weight: 600;
}
44 changes: 24 additions & 20 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,21 @@

_A logical, flexible, and reasonably standardized project structure for doing and sharing data science work._

[![tests](https://github.com/drivendata/cookiecutter-data-science/workflows/tests/badge.svg?branch=v2)](https://github.com/drivendata/cookiecutter-data-science/actions/workflows/tests.yml?query=branch%3Av2)
<a target="_blank" href="https://cookiecutter-data-science.drivendata.org/">
<img src="https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter" />
</a>

## Quickstart

!!! info "Changes in v2"

Cookiecutter Data Science v2 now requires installing the new `cookiecutter-data-science` Python package, which extends the functionality of the [`cookiecutter`](https://cookiecutter.readthedocs.io/en/stable/README.html) templating utility. Use the provided `ccds` command-line program instead of `cookiecutter`.
Cookiecutter Data Science v2 requires Python 3.7+. Since this is a cross-project utility application, we recommend installing it with [pipx](https://pypa.github.io/pipx/). Installation command options:

=== "With pipx (recommended)"

```bash
pipx install cookiecutter-data-science

# From the parent directory where you want your project
ccds https://github.com/drivendata/cookiecutter-data-science
ccds
```

=== "With pip"
Expand All @@ -25,7 +25,7 @@ _A logical, flexible, and reasonably standardized project structure for doing an
pip install cookiecutter-data-science
`
# From the parent directory where you want your project
ccds https://github.com/drivendata/cookiecutter-data-science
ccds
```

=== "With conda (coming soon!)"
Expand All @@ -34,7 +34,7 @@ _A logical, flexible, and reasonably standardized project structure for doing an
# conda install cookiecutter-data-science -c conda-forge

# From the parent directory where you want your project
# ccds https://github.com/drivendata/cookiecutter-data-science
# ccds
```

=== "Use the v1 template"
Expand All @@ -46,33 +46,37 @@ _A logical, flexible, and reasonably standardized project structure for doing an
cookiecutter https://github.com/drivendata/cookiecutter-data-science -c v1
```

## Installation

Cookiecutter Data Science v2 requires Python 3.7+. Since this is a cross-project utility application, we recommend installing it with [pipx](https://pypa.github.io/pipx/). Installation command options:
!!! info "Changes in v2"

```bash
# With pipx from PyPI (recommended)
pipx install cookiecutter-data-science
Cookiecutter Data Science v2 now requires installing the new `cookiecutter-data-science` Python package, which extends the functionality of the [`cookiecutter`](https://cookiecutter.readthedocs.io/en/stable/README.html) templating utility. Use the provided `ccds` command-line program instead of `cookiecutter`.

# With pip from PyPI
pip install cookiecutter-data-science

# With conda from conda-forge (coming soon)
# conda install cookiecutter-data-science -c conda-forge
```

## Starting a new project

Starting a new project is as easy as running this command at the command line. No need to create a directory first, the cookiecutter will do it for you.

```bash
ccds https://github.com/drivendata/cookiecutter-data-science
ccds
```

The `ccds` commandline tool defaults to the Cookiecutter Data Science template, but you can pass your own template as the first argument if you want.


## Example

<!-- TERMYNAL OUTPUT -->


Now that you've got your project, you're ready to go! You should do the following:

- **Check out the directory structure** below so you know what's in the project and how to use it.
- **Read the [opinions](opinions.md)** that are baked into the project so you understand best practices and the philosophy behind the project structure.
- **Read the [using the template](using-the-template.md) guide** to understand how to get started on a project that uses the template.


Enjoy!


## Directory structure

The directory structure of your new project will look something like this (depending on the settings that you choose):
Expand Down
136 changes: 136 additions & 0 deletions docs/docs/using-the-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Using the template

You've [created](index.md#starting-a-new-project) your project. You've [read the opinions section](opinions.md). You're ready to start doing some work.

Here's a quick guide of the kinds of things we do once our project is ready to go. We'll walk through this example using git and GitHub for version control and jupyter notebooks for exploration, but you can use whatever tools you like.

## Set up version control

Often, we start by initializing a `git` repository to track the code we write in version control and collaborate with teammates. At the command line, you can do this with the following commands which do the following: turn the folder into a git repository, add all of the files and folders created by CCDS into source control (except for what is in the `.gitignore` file), and then make a commit to the repository.

```bash
# From inside your newly created project directory
git init
git add .
git commit -m "CCDS defaults"
```

We usually commit the entire default CCDS structure so it is easy to track the changes we make to the structure in version history.

Now that the default layout is committed, you should push it to a shared repository. You can do this through the interface of whatever source control platform you use. This may be GitHub, GitLab, Bitbucket, or something else.

If you use GitHub and have the [gh cli tool](https://cli.github.com/) you can easily create a new repository for the project from the commandline.

```bash
gh repo create
```

You'll be asked a series of questions to set up the repository on GitHub. Once you're done you'll be able to push the changes in your local repository to GitHub.

## Create a Python virutal environment

We often use Python for our data science projects. We use a virtual environment to manage the packages we use in our project. This is a way to keep the packages we use in our project separate from the packages we use in other projects. This is especially important when we are working on multiple projects at the same time.

Cookicutter Data Science supports [a few options](opinions.md#build-from-the-environment-up) for Python virtual environment management, but no matter which you choose, you can create an environment with the following commands:

```bash
make create_environment
```

Once the environment is created, you'll want to make sure to activate it. You'll have to do this following the instructions for your specific environment manager. We recommend using a shell prompt that shows you which environment you are in, so you can easily tell if you are in the right environment, for example [starship](https://starship.rs/). You can also use the command `which python` to make sure that your shell is pointing to the version of Python associated with your virtual environment.

Once you are sure that your environment is activated in your shell, you can install the packages you need for your project. You can do this with the following command:

```bash
make requirements
```

## Add your data

There's no universal advice for how to manage your data, but here are some recommendations for starting points depending on where the data comes from:

- **Flat files (e.g., CSVs or spreadsheets) that are static** - Put these files into your `data/raw` folder and then run `make sync_data_up` to push the raw data to your cloud provider.
- **Flat files that change and are extracted from somewhere** - Add a Python script to your source module in `data/make_dataset.py` that downloads the data and puts it in the `data/raw` folder. Then you can use this to get the latest and push it up to your cloud host as it changes (be careful not to [override your raw data](opinions.md/#data-analysis-is-a-directed-acyclic-graph)).
- **Databases you connect to with credentials** - Store your credentials in `.env`. We recommend adding a `db.py` file or similar to your `data` module that connects to the database and pulls data. If your queries generally fit into memory, you can just have functions in the `db.py` to load data that you use in analysis. If not, you'll want to add a script like above to download the data to the `data/raw` folder.

## Check out a branch

We'll talk about code review later, but it's a good practice to use feature branches and pull requests to keep your development organized. Now that you have source control configured, you can check out a branch to work with:

```
git checkout -b initial-exploration
```

## Open a notebook

!!! note

The following assumes you're using a Jupyter notebook, but while the specific commands for another notebook tool may look a little bit different, the process guidance still applies.

Now you're ready to do some analysis! Make sure that your project-specific environment is activated (you can check with `which jupyter`) and run `jupyter notebook notebooks` to open a Jupyter notebook in the `notebooks/` folder. You can start by creating a new notebook and doing some exploratory data analysis. We often name notebooks with a scheme that looks like this:

```
0.01-pjb-data-source-1.ipynb
```

- `0.01` - Helps leep work in chronological order. The structure is `PHASE.NOTEBOOK`. `NOTEBOOK` is just the Nth notebook in that phase to be created. For phases of the project, we generally use a scheme like the following, but you are welcome to design your own conventions:
- `0` - Data exploration - often just for exploratory work
- `1` - Data cleaning and feature creation - often writes data to `data/processed` or `data/interim`
- `2` - Visualizations - often writes publication-ready viz to `reports`
- `3` - Modeling - training machine learning models
- `4` - Publication - Notebooks that get turned directly into reports
- `pjb` - Your initials; this is helpful for knowing who created the notebook and prevents collisions from people working in the same notebook.
- `data-source-1` - A description of what the notebook covers

Now that you have your notebook going, start your analysis!

## Refactoring code into shared modules

As your project goes on, you'll want to refactor your code in a way that makes it easy to share between notebooks and scripts. We recommend creating a module in the `{{ cookiecutter.module_name }}` folder that contains the code you use in your project. This is a good way to make sure that you can use the same code in multiple places without having to copy and paste it.

Because the default structure is a Python package and is installed by default, you can do the following to make that code available to you within a Jupyter notebook.

First, we recommend turning on the `autoreload` extension. This will make Jupyter always go back to the source code for the module rather than caching it in memory. If your notebook isn't reflecting the latest changes from your changes to a `.py` file, try restarting the kernel and make sure `autoreload` is on. We add a cell at the top of the notebook with the following:

```
%load_ext autoreload
%autoreload 2
```

Now all your code should be importable. At the start of the CCDS project, you picked a module name. It's the same name as the folder that is in the root project directory. For example, if the module name were `my_project` you could use code by importing it like:

```python
from my_project.data import make_dataset

data = make_dataset()
```

Now it should be easy to do any refactoring you need to do to make your code more modular and reusable.


## Make your code reviewable

We try to review every line of code written at DrivenData. Data science code in particular has the risk of executing without erroring, but not being "correct" (for example, you use standard deviation in a calculation rather than variance). We've found the best way to catch these kinds of mistakes is a second set of eyes looking at the code.

Right now on GitHub, it is hard to observe and comment on changes that happen in Jupyter notebooks. We develop and maintain a tool called [`nbautoexport`](https://nbautoexport.drivendata.org/stable/) that automatically exports a `.py` version of your Jupyter noteobok every time you save it. This means that you can commit both the `.ipynb` and the `.py` to source control so that reviewers can leave line-by-line comments on your notebook code. To use it, you will need to add `nbautoexport` to your requirements file and then run `make requirements` to install it.

Once `nbautoexport` is installed, you can setup the nbautoexport tool for your project with the following commands at the commandline:

```
nbautoexport install
nbautoexport configure notebooks
```

Once you're done with your work, you'll want to add it to a commit and push it to GitHub so you can open a pull request. You can do that with the following commandline commands

```
git add . # stage all changed files to include them in the commit
git commit -m "Initial exploration" # commit the changes with a message
git push # publish the changes
```

Now you'll be able to [create a Pull Request in GitHub](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).

## Changing the `Makefile`

There's no magic in the Makefile. We often add project-specific commands or update the existing ones over the course of a project. For example, we've added scripts to generate reports with pandoc, build and serve documentation, publish static sites from assets, package code for distribution, and more.
10 changes: 8 additions & 2 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,20 @@ theme:
features:
- navigation.instant
- toc.integrate
logo: logo.svg
logo: ccds.png
name: material
custom_dir: overrides
palette:
primary: black
primary: custom
accent: custom
font:
text: Work Sans
code: Space Mono
nav:
- Home: index.md
- Why ccds?: why.md
- Opinions: opinions.md
- Using the template: using-the-template.md
- Contributing: contributing.md
- Related projects: related.md
- v1 Template: v1.md
Expand Down
7 changes: 7 additions & 0 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{% extends "base.html" %}

{% block extrahead %}
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Space+Mono:ital,wght@0,400;0,700;1,400;1,700&display=swap" rel="stylesheet">
{% endblock %}

0 comments on commit 6b9eb7c

Please sign in to comment.