Skip to content


Repository files navigation

Local Docker container to develop in AWS glue

This container was created due to problems with the official image to run on ARM environments like Raspberry Pi and MacBook with M1

With this container, you can run Spark code with Python or Scala and use AWS Glue context and AWS libraries

For example, You can do the following:

  • Create DynamicFrame
  • Read and write in S3
  • Read tables in Athena
  • Use AWS Services
  • Etc

Tested in:

  • AMD64
  • ARM64

Based on AWS official documentation Development local python

Docker image in DockerHub: aws-glue-local-interpreter


  • Aws-cli
    • MacOS :
      brew install awscli
    • Ubuntu :
      sudo apt-get install awscli
    • Windows : AWS docs
  • You must have AWS credentials in the path: ~/.aws

There must be two files:

  • config
  • credentials


  • ~/.aws/config
  • [default]
    region = us-east-1
    output = json
  • ~/.aws/credentials
  • aws_access_key_id = XXXXXXXXXXXXXXXXXXXXXXXX
    aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXX

    If you would like to check the access to S3:

    aws s3 ls

    You should get a list of buckets in S3.

    How to use:

    • By bash command
    • With docker-compose

    Both using:


    Bash command

    docker run  -p 8888:8888 -v ~/.aws:/root/.aws:ro --name aws-glue-local-interpreter  anthonypernia/aws-glue-local-interpreter

    It will create the container was-glue-local-interpreter and a volume to share path ~/.aws in /root/.aws to use the same credentials


    Must have a file name:

    • docker-compose.yml

    With this content

    version: '3'
        image: "anthonypernia/aws-glue-local-interpreter"
          - ~/.aws:/root/.aws
          - ~/aws-glue-developments:/root/developments ##(OPTIONAL)
          - "8888:8888"

    Then, you need to use:

    docker-compose up

    You can add another volume where the script will be stored and edited locally and executed in the container. in this case, the folder "aws-glue-developments" is used.

    When the container is running, go to :

  • http://localhost:8888/
  • or
  • {serverIP}:8888
  • And you should see a Jupyter notebook running.

    Using VSCODE to develop with Containers

    Using the VSCODE Remote Development extension, we could launch VScode inside the container


    Creating GlueContext and SparkContext

    Create a notebook and run the code:

    from pyspark import SparkContext
    from awsglue.context import GlueContext
    from pyspark.sql import SQLContext
    def get_gluecontext() -> GlueContext:
        """Get the glue context
            GlueContext: Glue context
        sc = SparkContext.getOrCreate()
        return GlueContext(sc)
    def get_spark_context() -> SparkContext:
        """Get the spark context
            SparkContext: Spark context
        return SparkContext.getOrCreate()
    def get_spark_sql_context(sparkContext: SparkContext) -> SQLContext:
        """Get the spark sql context
            sparkContext (SparkContext): Spark context
            SQLContext: Spark sql context
        return SQLContext(sparkContext)
    glueContext: GlueContext = get_gluecontext()
    sparkContext: SparkContext = get_spark_context()
    sqlContext: SQLContext = get_spark_sql_context(sparkContext)
    GlueContext SparkContext SparkSQLContext

    Creating DynamicFrame and DataFrame

    After creating the contexts:

    from awsglue.dynamicframe import DynamicFrame
    from pyspark.sql import DataFrame
    def create_df_from_path(glueContext: GlueContext, path: str, format_file: str) -> DynamicFrame:
        """Create a dataframe from a path
            glueContext (GlueContext): Glue context
            path (str): Path to read
            format_file (str): Format of the file
            DynamicFrame: DynamicFrame
        return glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [path]}, format = format_file)
    def create_spark_df_from_path(sqlContext: SQLContext, path: str, format_file: str) -> DataFrame:
        """Create a spark dataframe from a path
            sqlContext (SQLContext): Spark sql context
            path (str): Path to read
            format_file (str): Format of the file
            DataFrame: Spark dataframe
    path: str = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
    format_file: str = "json"
    df: DynamicFrame = create_df_from_path(glueContext, path, format_file)
    spark_df: DataFrame = create_spark_df_from_path(sqlContext, path, format_file)
    DynamicFrame DataFrame

    Write data in S3

    def write_spark_df(df: DataFrame, bucket: str, key: str) -> None:
        """Write a dataframe to S3 in parquet format
            df (DataFrame): Dataframe to write
            bucket (str): S3 bucket
            key (str): S3 key
        df.write.parquet(f"s3://{bucket}/{key}", mode="overwrite")
    bucket: str = "example-bucket-demo-aws"
    key: str = "test-folder-output"
    write_spark_df(spark_df, bucket, key)
    AWS console - data that was written