Skip to content
Felix Lohmeier edited this page Jul 29, 2019 · 6 revisions

The Jupyter ecosystem provides a wide range of tools to support data analysis and reproducible research, at scale, locally or virtually. Jupyter also extols the use a document format, Jupyter notebooks / .ipynb, that encourages a literate programming approach. This notebook style of computing supports code and code output based communications and is well suited to documentation, education and training purposes.

OpenRefine can integrate with the Jupyter ecosystem in two main ways:

  • as a graphical browser-based application that forms part of a Jupyter mediated multi-application workbench environment;
  • as a (potentially headless) client application to support notebook fronted computation.

As a Graphical Application Within a Jupyter Mediated Environment

The Jupyter ecosystem encompasses a range of applications, protocols, and document formats. Originally focused around the Jupyter notebook user interface, JupyterLab is developing as a structured IDE whilst still supporting notebook documents and notebook style interactions.

One popular way of deploying Jupyter notebooks is using repo2docker which builds containerized environments that run a single user Jupyter notebook server. The environments are defined by the contents of a Github repository or local directory. The repository may contain configuration files that describe the required computational environment in terms of required Linux and Python packages, for example, as well as "content" in the form of programme or documentation files and notebooks.

The repo2docker application is a key part of the Binderhub service, which supports the building of Docker images from a repository and the running of container instances generated from them using scalable webservices.

The MyBinder service provides a publicly available Binderhub service that allows anyone to build and run a temporary, disposable containerized service from a public Github repository.

Within this operational context, Jupyter provides two standardised mechanisms for integrating third-party applications:

  • a notebook UI extension mechanism that allows third party services to be launched from a notebook server UI;
  • a server proxy that allows third-party web services to be accessed in the same web context as the notebook server.

Example of OpenRefine Launcher from Jupyter notebook

To install OpenRefine support within a Jupyter notebook server, see betatim/openrefineder. That repository has been "Binderised" so you can use it to build and launch a container on MyBinder that runs Jupyter notebooks and OpenRefine.

As a Backend Service

Jupyter notebooks can be used as an interactive programming environment for a wide variety of languages, not just the triumvirate of Julia, Python, and R that give Jupyter its name.

Within the Python context, a Python OpenRefine client allows a user to script interactions within a Jupyter notebook against an OpenRefine application instance, essentially as a headless service (although workflows are possible where both notebook-scripted and live interactions take place.

An example of a demo notebook using a fork of a popular OpenRefine python client can be found here. The repo has the OpenRefine service port number hard-coded using the default OpenRefine port number (3333) and can be run via MyBinder: Binder

The OpenRefine server can be automatically started as a headless service by adding a start config file to the repository containing a start-up invocation of the form:

#!/bin/bash
 
#Start OpenRefine
OPENREFINE_DIR="$HOME/openrefine"
mkdir -p $OPENREFINE_DIR
nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &
 
#Do the normal Binder start thing here...
exec "$@"

Possible Futures

Several ideas exploring possible OpenRefine/Jupyter integration points are described in OpenRefine integration ideas with R lang and Jupyter. This section complements that page.

JupyterLab Integration

JupyterLab provides a panel based display that allows multiple document views within a single window. The Jupyter ecosystem supports a wide range of interactive widgets that support rapid application development and might be used to create simplified interactive user interfaces running within their own panels exposing some or all OpenRefine services against an OpenRefine backend application.

OpenRefine Jupyter Kernel

Behind the interactive Jupyter notebook code cells lies a Jupyter kernel, the computational language environment with which code is executed and state maintained. Jupyter kernels include kernels capable of running Java and other JVM based language environments. Demonstration kernels exist that allow code cells to be used to run code directly against APIs or command line applications (for example, a Gnuplot kernel that allows users to embed an interactive form in a web page and run Gnuplot commands against on on-demand launched MyBinder backend). At this point, we might ask the question of what form an OpenRefine kernel might take and whether it makes sense to have a script-based interface to OpenRefine.

Jupyter Kernels for OpenRefine Scripting

OpenRefine supports scripting in a variety of languages including GREL and Jython. If the expression language editor supported the Jupyter client protocol, then code could be executed using arbitrary Jupyter kernels (though some means would have to be found for communicating OpenRefine data/state to the kernel as well as the code).

Related:

Clone this wiki locally