Jupyter
The Jupyter ecosystem provides a wide range of tools to support data analysis and reproducible research, at scale, locally or virtually. Jupyter also extols the use a document format, Jupyter notebooks / .ipynb
, that encourages a literate programming approach. This notebook style of computing supports code and code output based communications and is well suited to documentation, education and training purposes.
OpenRefine can integrate with the Jupyter ecosystem in two main ways:
- as a graphical browser-based application that forms part of a Jupyter mediated multi-application workbench environment;
- as a (potentially headless) client application to support notebook fronted computation.
The Jupyter ecosystem encompasses a range of applications, protocols, and document formats. Originally focused around the Jupyter notebook user interface, JupyterLab is developing as a structured IDE whilst still supporting notebook documents and notebook style interactions.
One popular way of deploying Jupyter notebooks is using repo2docker
which builds containerized environments that run a single user Jupyter notebook server. The environments are defined by the contents of a Github repository or local directory. The repository may contain configuration files that describe the required computational environment in terms of required Linux and Python packages, for example, as well as "content" in the form of programme or documentation files and notebooks.
The repo2docker
application is a key part of the Binderhub service, which supports the building of Docker images from a repository and the running of container instances generated from them using scalable webservices.
The MyBinder service provides a publicly available Binderhub service that allows anyone to build and run a temporary, disposable containerized service from a public Github repository.
Within this operational context, Jupyter provides two standardised mechanisms for integrating third-party applications:
- a notebook UI extension mechanism that allows third party services to be launched from a notebook server UI;
- a server proxy that allows third-party web services to be accessed in the same web context as the notebook server.
To install OpenRefine support within a Jupyter notebook server, see betatim/openrefineder
. That repository has been "Binderised" so you can use it to build and launch a container on MyBinder that runs Jupyter notebooks and OpenRefine.
Jupyter notebooks can be used as an interactive programming environment for a wide variety of languages, not just the triumvirate of Julia, Python, and R that give Jupyter its name.
Within the Python context, a Python OpenRefine client allows a user to script interactions within a Jupyter notebook against an OpenRefine application instance, essentially as a headless service (although workflows are possible where both notebook-scripted and live interactions take place.
An example of a demo notebook using a fork of a popular OpenRefine python client can be found here. The repo has the OpenRefine service port number hard-coded using the default OpenRefine port number (3333) and can be run via MyBinder:
The OpenRefine server can be automatically started as a headless service by adding a start
config file to the repository containing a start-up invocation of the form:
#!/bin/bash
#Start OpenRefine
OPENREFINE_DIR="$HOME/openrefine"
mkdir -p $OPENREFINE_DIR
nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &
#Do the normal Binder start thing here...
exec "$@"
Several ideas exploring possible OpenRefine/Jupyter integration points are described in OpenRefine integration ideas with R lang and Jupyter. This section complements that page.
JupyterLab provides a panel based display that allows multiple document views within a single window. The Jupyter ecosystem supports a wide range of interactive widgets that support rapid application development and might be used to create simplified interactive user interfaces running within their own panels exposing some or all OpenRefine services against an OpenRefine backend application.
Behind the interactive Jupyter notebook code cells lies a Jupyter kernel, the computational language environment with which code is executed and state maintained. Jupyter kernels include kernels capable of running Java and other JVM based language environments. Demonstration kernels exist that allow code cells to be used to run code directly against APIs or command line applications (for example, a Gnuplot kernel that allows users to embed an interactive form in a web page and run Gnuplot commands against on on-demand launched MyBinder backend). At this point, we might ask the question of what form an OpenRefine kernel might take and whether it makes sense to have a script-based interface to OpenRefine.
OpenRefine supports scripting in a variety of languages including GREL and Jython. If the expression language editor supported the Jupyter client protocol, then code could be executed using arbitrary Jupyter kernels (though some means would have to be found for communicating OpenRefine data/state to the kernel as well as the code).
Related:
- SoS — Script of Scripts — polyglot notebooks address concerns relating to moving state between different language kernels;
- ThebeLab, Juniper and nbinteract are JavaScript libraries that demonstrate how to work with session-based remote kernels launched via MyBinder from within a web page.