Arc Starter

A starter project to begin coding an Arc job using the Jupyter Notebook interface.

Running

Clone this repository then run the included shells script. The user interface will then be available at http://localhost:8888 and the token will be printed to the console.

./.develop.sh

The .develop.sh script contains a hard coded memory allocation for Apache Spark via the Java Virtual Machine which should be configured for your specific environment. e.g. to change from 4 Gigabytes to 8 Gigabytes:

-e JAVA_OPTS="-Xmx4g" \

to

-e JAVA_OPTS="-Xmx8g" \

How to execute

By default everything will be executed as an Arc stage.

If needed SQL can be executed directly by using the Jupyter %sql magic which can speed development:

%sql numRows=10 truncate=100 outputView=green_tripdata0
SELECT * 
FROM green_tripdata0_raw
WHERE fare_amount < 10

numRows specifies number of rows to display in the table
truncate specifies the maximum character length of any output strings
outputView allows registration of a Spark view so it can be referenced in later stages.

These other 'magics' have been defined:

%env which allows setting job variables via the notebook (e.g. %env ETL_CONF_KEY0=value0 ETL_CONF_KEY1=value1). These can be used in both %arc and %sql stages.
%metadata which will try to create and print the correct Arc metadata file for the supplied view.
%printschema which will print the Spark schema in a simple text mode.
%schema which will print the Spark schema of a view.
%summary which will print summary statistics of a view.
%version which will print relevant versions.

Exporting

To export an Arc job an option has been provided in the File\Download as menu which will export all the Arc stages from the notebook and create a job file. Note that Jupyter Notebooks has been modified so that the .ipynb file will not save any output datasets to prevent data from being accidentally committed to version control.

Issues

Important:

If you are running Docker For Mac or Docker for Windows ensure that the Docker memory allocation is large enough to support the memory -Xmx4g requested:

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.img		.img
tutorial		tutorial
.develop.sh		.develop.sh
.gitattributes		.gitattributes
.gitignore		.gitignore
.run.sh		.run.sh
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.img

.img

tutorial

tutorial

.develop.sh

.develop.sh

.gitattributes

.gitattributes

.gitignore

.gitignore

.run.sh

.run.sh

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

Arc Starter

Running

How to execute

Exporting

Issues

Screenshot

About

Releases

Packages

Contributors 3

Languages

License

seddonm1/arc-starter

Folders and files

Latest commit

History

Repository files navigation

Arc Starter

Running

How to execute

Exporting

Issues

Screenshot

About

Resources

License

Stars

Watchers

Forks

Languages