Skip to content

seddonm1/arc-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arc Starter

A starter project to begin coding an Arc job using the Jupyter Notebook interface.

Running

Clone this repository then run the included shells script. The user interface will then be available at http://localhost:8888 and the token will be printed to the console.

./.develop.sh

The .develop.sh script contains a hard coded memory allocation for Apache Spark via the Java Virtual Machine which should be configured for your specific environment. e.g. to change from 4 Gigabytes to 8 Gigabytes:

-e JAVA_OPTS="-Xmx4g" \

to

-e JAVA_OPTS="-Xmx8g" \

How to execute

By default everything will be executed as an Arc stage.

If needed SQL can be executed directly by using the Jupyter %sql magic which can speed development:

%sql numRows=10 truncate=100 outputView=green_tripdata0
SELECT * 
FROM green_tripdata0_raw
WHERE fare_amount < 10
  • numRows specifies number of rows to display in the table
  • truncate specifies the maximum character length of any output strings
  • outputView allows registration of a Spark view so it can be referenced in later stages.

These other 'magics' have been defined:

  • %env which allows setting job variables via the notebook (e.g. %env ETL_CONF_KEY0=value0 ETL_CONF_KEY1=value1). These can be used in both %arc and %sql stages.
  • %metadata which will try to create and print the correct Arc metadata file for the supplied view.
  • %printschema which will print the Spark schema in a simple text mode.
  • %schema which will print the Spark schema of a view.
  • %summary which will print summary statistics of a view.
  • %version which will print relevant versions.

Exporting

To export an Arc job an option has been provided in the File\Download as menu which will export all the Arc stages from the notebook and create a job file. Note that Jupyter Notebooks has been modified so that the .ipynb file will not save any output datasets to prevent data from being accidentally committed to version control.

Download as

Issues

Important:

If you are running Docker For Mac or Docker for Windows ensure that the Docker memory allocation is large enough to support the memory -Xmx4g requested:

Docker For Mac Memory Docker For Windows Memory

Screenshot

ARC in Jupyter Notebooks

About

A starter project to define Arc Data Transformation Pipelines (https://aglenergy.github.io/arc/) within Jupyter Notebooks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published