Skip to content

krvkir/iimport

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

iimport

A tool for making reusable code from linear ipython notebooks.

Problem

  1. When you write linear code in the notebook, it’s easy to do experiments and interactively debug your code, but you can’t reuse the code and long notebooks usually are hard to read and comprehend.
  2. In contrast, when you use structural/object-oriented programming approach and shape your code into objects, functions and modules, you can reuse it but you’ll need special tools to debug, hence prototyping and experimenting become more complicated.

Both the ease of experiment and reusability are valuable:

  • If code is shaped in functions you can for example make a luigi pipeline or dask computational graph out of it, implement results caching to shorten computation time, launch your pipeline with different inputs and so on.
  • If code is in linear notebook form you can explore data, do complex visualizations, make interactive widgets.

Solution

This module attempts to solve this experimentability vs reusability tradeoff, allowing you to write your code in the notebook in the exploratory manner, and when you fell that it has reached maturity – to highlight logically independent pieces of code with macro commands and to import these pieces to other places without diverging from the linear notebook structure. You will still be able to continue with the notebook in the linear manner.

Two things are implemented:

  • very simple markup language that defines procedures and excludes unwanted (debug, visualization) code;
  • import mechanics that process marked-up notebook file into a set of procedures.

There are no intermediate python files, you work with the same code both in notebook interface and as imported procedures.

How it works

The module registers IPython extension. When loaded with %load_ext iipython statement, it enables markup filter and provides you with %iimport magic to import other notebooks.

In import mode (when you do %iimport notebook as nb) it processes the input file line by line and passes it through a pipeline of filters. These filters collect all procedure definitions and their code. When procedure ends, filters pass its body to the output so the procedure is declared as top-level function.

In the notebook it simply ignores all markup commands so you can execute marked up code as if there’s no markup.

For technical details see:

Installation

$ pip install git+https://github.com/krvkir/iimport.git

Examples

Simple example

Suppose you’ve saved the following piece of code in notebook1.ipynb:, and the code was prototyped, debugged and tested on one data set (say, on scenario 1).

import pandas as pd

f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'

df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
df.info()

sums = df.groupby('a')['b'].sum()
sums.plt.hist(bins=50)

Now you want to repeat those sums calculations on some other data, and perhaps path to that data is different. You modify your notebook as follows:

import pandas as pd
%load_ext iimport

f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'

%def calc_sums(f1_path, f2_path):
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
%- df.info()

sums = df.groupby('a')['b'].sum()
%- sums.plt.hist(bins=50)
%return sums

Those modifications won’t affect interactive execution of your code: you can still run your cells and won’t notice any difference.

But now you can import your notebook and reuse its code in another file:

%load_ext iimport
%iimport notebook1 as nb1

f1_path = './scenario2/file1.csv'
f2_path = './scenario2/file2.csv'

sums = nb1.calc_sums(f1_path, f2_path)

If you want to see the acual code from nb1 module, you can either enable debug logging (import logging; logging.basicConfig(level=logging.DEBUG)) and the code will be printed on %iimport execution, or you may set nb1.__source__ variable, then you’ll see this output:

import pandas as pd
get_ipython().magic('load_ext iimport')

f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'

def calc_sums(f1_path, f2_path):
    df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
    df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
    df = df1.join(df2, on='id2', how='inner').dropna(how='any')
    
    sums = df.groupby('a')['b'].sum()
    return sums

Note the following code transformations:

  • the code between %def and %end lines became a function;
  • return statement was inserted at the end of it;
  • lines starting with %- were excluded from the code.

Advanced example

Let’s consider somewhat more complicated code:

import os
import json
import pandas as pd
import matplotlib.pyplot as plt

# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'

# Load data
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
df.info()

sums = df.groupby('a')['b'].sum()

# Make complicated plot that should not appear in imported code
sums.plt.hist(bins=50)
plt.title('Histogram of sums by a of column b')
plt.xlim(0, 10)
plt.ylim(-3, 3)
plt.grid()

# Load important reference and prepare it for usage
ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
# Drop rows using some condition
ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
ref = ref[ref.calculated_field > 10]

# Load and process files from disk
datas = {}
for ix, row in df.iterrows():
    fpath = dir_path + row['file_path']
    if os.path.exists(fpath):
        with open(fpath, 'r') as f:
            # Load an object from the file
            obj = json.load(f)
            # Remove some unused fields if any
            if 'garbage' in obj:
                del obj['garbage']
            if 'trash' in obj:
                del obj['trash']
            # Load some data from reference table into an object
            if 'ref_id' in obj:
                obj['ref'] = ref.loc[obj['ref_id']]
        datas[ix] = obj

After placing tokens the code should look like this:

import os
import json
import pandas as pd
import matplotlib.pyplot as plt
%load_ext iimport

# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'

%def calc_sum(f1_path=f1_path, f2_path=f2_path):
%def load_df(f1_path=f1_path, f2_path=f2_path):
# Load data
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
%- df.info()
%return df

sums = df.groupby('a')['b'].sum()
%return sums

%/*
# Make complicated plot that should not appear in imported code
sums.plt.hist(bins=50)
plt.title('Histogram of sums by a of column b')
plt.xlim(0, 10)
plt.ylim(-3, 3)
plt.grid()
%*/

%def load_objs(df, ref_path=ref_path, dir_path=dir_path):

%def load_ref(ref_path=ref_path):
# Load important reference and prepare it for usage
ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
# Drop rows using some condition
ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
ref = ref[ref.calculated_field > 10]
%return ref

# Load and process files from disk
objs = {}
for ix, row in df.iterrows():
    fpath = dir_path + row['file_path']
    if os.path.exists(fpath):

        %def load_obj(fpath, ref):
        with open(fpath, 'r') as f:
            # Load an object from the file
            obj = json.load(f)
            # Remove some unused fields if any
            if 'garbage' in obj:
                del obj['garbage']
            if 'trash' in obj:
                del obj['trash']
            # Load some data from reference table into an object
            if 'ref_id' in obj:
                obj['ref'] = ref.loc[obj['ref_id']]
        %return obj

        objs[ix] = obj
%return objs

This is what happened:

  • The code for plotting sums histogram was excluded from import by marking it with multiline exclusion tag (%/*%*/), so it won’t clutter the output
  • we used nested functions:
    • load_df inside calc_sum
    • load_ref and load_obj inside load_objs
  • we set default values for procedure parameters

Now let’s see what we get on import time:

import os
import json
import pandas as pd
import matplotlib.pyplot as plt


# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'


def load_df(f1_path=f1_path, f2_path=f2_path):
    """
    :param f1_path=f1_path
    :param f2_path=f2_path
    Returns: df
    """
    # Load data
    df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
    df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
    df = df1.join(df2, on='id2', how='inner').dropna(how='any')
    return df

def calc_sum(f1_path=f1_path, f2_path=f2_path):
    """
    :param f1_path=f1_path
    :param f2_path=f2_path
    Returns: sum
    """
    df = load_df(f1_path, f2_path)

    sums = df.groupby('a')['b'].sum()
    return sum


def load_ref(ref_path=ref_path):
    """
    :param ref_path=ref_path
    Returns: ref
    """
    # Load important reference and prepare it for usage
    ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
    # Drop rows using some condigion
    ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
    ref = ref[ref.calculated_field > 10]
    return ref


def load_obj(fpath, ref):
    """
    :param fpath
    :param dir_path=dir_path
    Returns: obj
    """
    with open(fpath, 'r') as f:
        # Load an object from the file
        obj = json.load(f)
        # Remove some unused fields if any
        if 'garbage' in obj:
            del obj['garbage']
        if 'trash' in obj:
            del obj['trash']
        # Load some data from reference table into an object
        if 'ref_id' in obj:
            obj['ref'] = ref.loc[obj['ref_id']]
    return obj


def load_objs(df, ref_path=ref_path, dir_path=dir_path):
    """
    :param df
    :param ref_path=ref_path
    Returns: objs
    """

    ref = load_ref(ref_path)
    # Load and process some files from disk
    objs = {}
    for ix, row in df.iterrows():
        fpath = dir_path + row['file_path']
        if os.path.exists(fpath):

            obj = load_obj(fpath, ref)

            objs[ix] = obj
    return objs

Note that all the procedures (including nested ones) became top-level functions, and that these procedures folded into function calls. Now these functions can be easily chained together:

%load_ext iimport
%iimport notebook as nb
from dask import delayed

df = delayed(nb.load_df())
objs = delayed(nb.load_objs(df))
objs.compute()

References

List of tokens

  • Beginning of procedure: %def
  • End of procedure: %return

Note that beginning and ending tokens may be placed in different notebook cells, so that you can split a procedure into several cells.

  • Skip this line on import: %- or %//
  • Skip multiple lines on import: %/*%*/
  • TODO Include this line on import (but skip in the notebook): %+

List of commands

  • %iimport – import ipynb file. Examples of correct commands:
    • %iimport notebook1;
    • %iimport notebook1 as nb1;
    • TODO %iimport ../notebooks/2017 Some notebook as some_nb;
    • import 2017_Some_notebook as some_nb – regular import statement works too.

    Note that file extension (.ipynb) should be omitted.

  • %iimport_enabled 1 – enable parsing of the code and defining functions inside current notebook. Useful for debugging, by default is switched off.

Releases

No releases published

Packages

No packages published