A tool for making reusable code from linear ipython notebooks.
- When you write linear code in the notebook, it’s easy to do experiments and interactively debug your code, but you can’t reuse the code and long notebooks usually are hard to read and comprehend.
- In contrast, when you use structural/object-oriented programming approach and shape your code into objects, functions and modules, you can reuse it but you’ll need special tools to debug, hence prototyping and experimenting become more complicated.
Both the ease of experiment and reusability are valuable:
- If code is shaped in functions you can for example make a
luigi
pipeline ordask
computational graph out of it, implement results caching to shorten computation time, launch your pipeline with different inputs and so on. - If code is in linear notebook form you can explore data, do complex visualizations, make interactive widgets.
This module attempts to solve this experimentability vs reusability tradeoff, allowing you to write your code in the notebook in the exploratory manner, and when you fell that it has reached maturity – to highlight logically independent pieces of code with macro commands and to import these pieces to other places without diverging from the linear notebook structure. You will still be able to continue with the notebook in the linear manner.
Two things are implemented:
- very simple markup language that defines procedures and excludes unwanted (debug, visualization) code;
- import mechanics that process marked-up notebook file into a set of procedures.
There are no intermediate python files, you work with the same code both in notebook interface and as imported procedures.
The module registers IPython extension. When loaded with %load_ext iipython
statement, it enables markup filter and provides you with %iimport
magic to import other notebooks.
In import mode (when you do %iimport notebook as nb
) it processes the input file line by line and passes it through a pipeline of filters. These filters collect all procedure definitions and their code. When procedure ends, filters pass its body to the output so the procedure is declared as top-level function.
In the notebook it simply ignores all markup commands so you can execute marked up code as if there’s no markup.
For technical details see:
- PEP 342 – Coroutines via Enhanced Generators
- IPython docs:
- Jupyter notebook docs:
$ pip install git+https://github.com/krvkir/iimport.git
Suppose you’ve saved the following piece of code in notebook1.ipynb
:, and the code was prototyped, debugged and tested on one data set (say, on scenario 1).
import pandas as pd
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
df.info()
sums = df.groupby('a')['b'].sum()
sums.plt.hist(bins=50)
Now you want to repeat those sums
calculations on some other data, and perhaps path to that data is different. You modify your notebook as follows:
import pandas as pd
%load_ext iimport
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
%def calc_sums(f1_path, f2_path):
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
%- df.info()
sums = df.groupby('a')['b'].sum()
%- sums.plt.hist(bins=50)
%return sums
Those modifications won’t affect interactive execution of your code: you can still run your cells and won’t notice any difference.
But now you can import your notebook and reuse its code in another file:
%load_ext iimport
%iimport notebook1 as nb1
f1_path = './scenario2/file1.csv'
f2_path = './scenario2/file2.csv'
sums = nb1.calc_sums(f1_path, f2_path)
If you want to see the acual code from nb1
module, you can either enable debug logging (import logging; logging.basicConfig(level=logging.DEBUG)
) and the code will be printed on %iimport
execution, or you may set nb1.__source__
variable, then you’ll see this output:
import pandas as pd
get_ipython().magic('load_ext iimport')
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
def calc_sums(f1_path, f2_path):
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
sums = df.groupby('a')['b'].sum()
return sums
Note the following code transformations:
- the code between
%def
and%end
lines became a function; return
statement was inserted at the end of it;- lines starting with
%-
were excluded from the code.
Let’s consider somewhat more complicated code:
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'
# Load data
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
df.info()
sums = df.groupby('a')['b'].sum()
# Make complicated plot that should not appear in imported code
sums.plt.hist(bins=50)
plt.title('Histogram of sums by a of column b')
plt.xlim(0, 10)
plt.ylim(-3, 3)
plt.grid()
# Load important reference and prepare it for usage
ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
# Drop rows using some condition
ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
ref = ref[ref.calculated_field > 10]
# Load and process files from disk
datas = {}
for ix, row in df.iterrows():
fpath = dir_path + row['file_path']
if os.path.exists(fpath):
with open(fpath, 'r') as f:
# Load an object from the file
obj = json.load(f)
# Remove some unused fields if any
if 'garbage' in obj:
del obj['garbage']
if 'trash' in obj:
del obj['trash']
# Load some data from reference table into an object
if 'ref_id' in obj:
obj['ref'] = ref.loc[obj['ref_id']]
datas[ix] = obj
After placing tokens the code should look like this:
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
%load_ext iimport
# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'
%def calc_sum(f1_path=f1_path, f2_path=f2_path):
%def load_df(f1_path=f1_path, f2_path=f2_path):
# Load data
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
%- df.info()
%return df
sums = df.groupby('a')['b'].sum()
%return sums
%/*
# Make complicated plot that should not appear in imported code
sums.plt.hist(bins=50)
plt.title('Histogram of sums by a of column b')
plt.xlim(0, 10)
plt.ylim(-3, 3)
plt.grid()
%*/
%def load_objs(df, ref_path=ref_path, dir_path=dir_path):
%def load_ref(ref_path=ref_path):
# Load important reference and prepare it for usage
ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
# Drop rows using some condition
ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
ref = ref[ref.calculated_field > 10]
%return ref
# Load and process files from disk
objs = {}
for ix, row in df.iterrows():
fpath = dir_path + row['file_path']
if os.path.exists(fpath):
%def load_obj(fpath, ref):
with open(fpath, 'r') as f:
# Load an object from the file
obj = json.load(f)
# Remove some unused fields if any
if 'garbage' in obj:
del obj['garbage']
if 'trash' in obj:
del obj['trash']
# Load some data from reference table into an object
if 'ref_id' in obj:
obj['ref'] = ref.loc[obj['ref_id']]
%return obj
objs[ix] = obj
%return objs
This is what happened:
- The code for plotting sums histogram was excluded from import by marking it with multiline exclusion tag (
%/*
…%*/
), so it won’t clutter the output - we used nested functions:
load_df
insidecalc_sum
load_ref
andload_obj
insideload_objs
- we set default values for procedure parameters
Now let’s see what we get on import time:
import os
import json
import pandas as pd
import matplotlib.pyplot as plt
# Configure input data
f1_path = './scenario1/file1.csv'
f2_path = './scenario1/file2.csv'
ref_path = './some_useful_reference.csv'
dir_path = './some_dir/'
def load_df(f1_path=f1_path, f2_path=f2_path):
"""
:param f1_path=f1_path
:param f2_path=f2_path
Returns: df
"""
# Load data
df1 = pd.read_csv(f1_path, sep=';').set_index('id1')
df2 = pd.read_csv(f2_path, sep=',').set_index('id2')
df = df1.join(df2, on='id2', how='inner').dropna(how='any')
return df
def calc_sum(f1_path=f1_path, f2_path=f2_path):
"""
:param f1_path=f1_path
:param f2_path=f2_path
Returns: sum
"""
df = load_df(f1_path, f2_path)
sums = df.groupby('a')['b'].sum()
return sum
def load_ref(ref_path=ref_path):
"""
:param ref_path=ref_path
Returns: ref
"""
# Load important reference and prepare it for usage
ref = pd.read_csv(ref_path, sep=';', encoding='cp1251').set_index('ref_id')
# Drop rows using some condigion
ref['calculated_field'] = ref['field_a'] * ref['field_b'] + ref['field_c']
ref = ref[ref.calculated_field > 10]
return ref
def load_obj(fpath, ref):
"""
:param fpath
:param dir_path=dir_path
Returns: obj
"""
with open(fpath, 'r') as f:
# Load an object from the file
obj = json.load(f)
# Remove some unused fields if any
if 'garbage' in obj:
del obj['garbage']
if 'trash' in obj:
del obj['trash']
# Load some data from reference table into an object
if 'ref_id' in obj:
obj['ref'] = ref.loc[obj['ref_id']]
return obj
def load_objs(df, ref_path=ref_path, dir_path=dir_path):
"""
:param df
:param ref_path=ref_path
Returns: objs
"""
ref = load_ref(ref_path)
# Load and process some files from disk
objs = {}
for ix, row in df.iterrows():
fpath = dir_path + row['file_path']
if os.path.exists(fpath):
obj = load_obj(fpath, ref)
objs[ix] = obj
return objs
Note that all the procedures (including nested ones) became top-level functions, and that these procedures folded into function calls. Now these functions can be easily chained together:
%load_ext iimport
%iimport notebook as nb
from dask import delayed
df = delayed(nb.load_df())
objs = delayed(nb.load_objs(df))
objs.compute()
- Beginning of procedure:
%def
- End of procedure:
%return
Note that beginning and ending tokens may be placed in different notebook cells, so that you can split a procedure into several cells.
- Skip this line on import:
%-
or%//
- Skip multiple lines on import:
%/*
…%*/
- TODO Include this line on import (but skip in the notebook):
%+
%iimport
– import ipynb file. Examples of correct commands:%iimport notebook1
;%iimport notebook1 as nb1
;- TODO
%iimport ../notebooks/2017 Some notebook as some_nb
; import 2017_Some_notebook as some_nb
– regular import statement works too.
Note that file extension (
.ipynb
) should be omitted.%iimport_enabled 1
– enable parsing of the code and defining functions inside current notebook. Useful for debugging, by default is switched off.