Distributed Image Processing with OpenCV and Pachyderm

Intro

In this tutorial we're going to do edge detection using the Canny edge detection algorithm implemented in OpenCV. We'll be deploying our code as a Pachyderm pipeline which means it will be both streaming and distributed.

This tutorial assumes that you've already got a Pachyderm cluster up and running and that you can talk to it with the pachctl CLI tool. You'll know it's working if pachctl version returns without any errors. If not head on over to the local installation or production deployment docs to get a cluster up and running. You should also clone the repo as we'll use the code and files provided.

Load Images Into Pachyderm

The first thing we'll need to do is create a repo to store our images in, we'll call the repo "images".

$ pachctl create-repo images

Now we need to put some images in that repo. You can do this with put-file:

$ pachctl put-file images master -c -i images.txt

With this command you're telling Pachyderm that you want to put files in the images repo on the branch master. You're also passing two flags, -c and -i.-c means that you want Pachyderm to start and finish the commit for you. If you wanted to have multiple put-files write to the same commit you'd use start-commit and finish-commit. -i lets you specify a line-delimited input file. Each line can be a URL to scrape or a local file to add. In our case, we've got a file images.txt with a an image URL that we'll add to Pachyderm.

Build and Distribute the Docker Image

Now that you've got some data in Pachyderm, it's time to process it. First, you need to create a Docker image containing the code we will use to perform our edge detection. For this example, you can use the supplied Dockerfile. You can build the Docker image with:

$ docker build -t opencv .

To understand what's going on, take a look at the Dockerfile and edges.py.

Distribute the Docker Image

You'll need to push the image to a registry, such as DockerHub, so Pachyderm can find it. You can do this with:

$ docker tag opencv <your-docker-hub-username>/opencv
$ docker push <your-docker-hub-username>/opencv

Now the image can be referenced on any Docker host as <your-docker-hub-username>/opencv and Docker will be able to find it.

Deploy the Pipeline

Now that you have an image, you need to tell Pachyderm how to run it. To do this, you'll need to create a pipeline. The pipeline for this example is edges.json.

First, notice that this file references the image you just created in the field transform.image. Since you just pushed your image to Dockerhub, you'll need to tell Pachyderm to go there to find it. Open edges.json in a text editor, and change the following line:

"image": "opencv"
## should become ##
"image": "<your-docker-hub-username>/opencv"

Now that edges.json points to your image, you can create the pipeline with:

# this cmd assumes you've cloned our repo. If not, use the GitHub raw file URL since `-f` can take a URL as well.
$ pachctl create-pipeline -f edges.json

You can check on the status of the pipeline with:

$ pachctl inspect-pipeline edges

You should see that a single job has been run (it was run on the initial commit you made in the images repo).

Inspect the Results

Results from a pipeline are stored in an output repo whose name matches the name of the pipeline. You can see it with list-repo

$ pachctl list-repo
NAME    CREATED             SIZE
edges   11 minutes ago      557.7 KiB
images  15 minutes ago      1.136 MiB

You can view this data by mounting it locally:

$ mkdir /tmp/pfs
$ pachctl mount /tmp/pfs &
# This command will block (so we run in the background)

While the mount is up navigate to file:///tmp/pfs in your web browser. Here you can browse the contents of pfs which should be the raw images and the results of the edge detection.

Stream More Data In

Pipelines are smart. They don't just process the input data that's present when they're created, they also process any new data that's added later. You can process any image on the internet or your local disk, all you have to do is put it in the images repo (with put-file) and your pipeline will automatically trigger and run the edge detection code.

Next Steps

OpenCV can do a lot more than edge detection, and now that you've built the container image, you can use anything in OpenCV on Pachyderm. OpenCV has several Python tutorials available online if you're looking for inspiration.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Dockerfile		Dockerfile
README.md		README.md
demo-script.md		demo-script.md
edges.json		edges.json
edges.py		edges.py
images.txt		images.txt
images2.txt		images2.txt
images3.txt		images3.txt
images4.txt		images4.txt
montage.json		montage.json
setup-demo.sh		setup-demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfile

Dockerfile

README.md

README.md

demo-script.md

demo-script.md

edges.json

edges.json

edges.py

edges.py

images.txt

images.txt

images2.txt

images2.txt

images3.txt

images3.txt

images4.txt

images4.txt

montage.json

montage.json

setup-demo.sh

setup-demo.sh

Repository files navigation

Distributed Image Processing with OpenCV and Pachyderm

Intro

Load Images Into Pachyderm

Build and Distribute the Docker Image

Distribute the Docker Image

Deploy the Pipeline

Inspect the Results

Stream More Data In

Next Steps

About

Releases

Packages

Languages

JoeyZwicker/Pachyderm-opencv

Folders and files

Latest commit

History

Repository files navigation

Distributed Image Processing with OpenCV and Pachyderm

Intro

Load Images Into Pachyderm

Build and Distribute the Docker Image

Distribute the Docker Image

Deploy the Pipeline

Inspect the Results

Stream More Data In

Next Steps

About

Resources

Stars

Watchers

Forks

Languages