Skip to content

anthonypernia/aws-glue-local-interpreter

Repository files navigation

Local Docker container to develop in AWS glue

This container was created due to problems with the official image to run on ARM environments like Raspberry Pi and MacBook with M1

With this container, you can run Spark code with Python or Scala and use AWS Glue context and AWS libraries

For example, You can do the following:

  • Create DynamicFrame
  • Read and write in S3
  • Read tables in Athena
  • Use AWS Services
  • Etc

Tested in:

  • AMD64
  • ARM64

Based on AWS official documentation Development local python

Docker image in DockerHub: aws-glue-local-interpreter

Requirements:

  • Aws-cli
    • MacOS :
      brew install awscli
    • Ubuntu :
      sudo apt-get install awscli
    • Windows : AWS docs
  • You must have AWS credentials in the path: ~/.aws

There must be two files:

  • config
  • credentials

Examples:

  • ~/.aws/config
  • [default]
    region = us-east-1
    output = json
    
  • ~/.aws/credentials
  • aws_access_key_id = XXXXXXXXXXXXXXXXXXXXXXXX
    aws_secret_access_key = XXXXXXXXXXXXXXXXXXXXXXXX
    

    If you would like to check the access to S3:

    aws s3 ls
    

    You should get a list of buckets in S3.

    How to use:

    • By bash command
    • With docker-compose

    Both using:

    anthonypernia/aws-glue-local-interpreter

    Bash command

    docker run  -p 8888:8888 -v ~/.aws:/root/.aws:ro --name aws-glue-local-interpreter  anthonypernia/aws-glue-local-interpreter
    

    It will create the container was-glue-local-interpreter and a volume to share path ~/.aws in /root/.aws to use the same credentials

    Docker-compose

    Must have a file name:

    • docker-compose.yml

    With this content

    version: '3'
    services:
      aws-glue-local-interpreter:
        image: "anthonypernia/aws-glue-local-interpreter"
        volumes:
          - ~/.aws:/root/.aws
          - ~/aws-glue-developments:/root/developments ##(OPTIONAL)
        ports:
          - "8888:8888"
    

    Then, you need to use:

    docker-compose up
    

    You can add another volume where the script will be stored and edited locally and executed in the container. in this case, the folder "aws-glue-developments" is used.

    When the container is running, go to :

  • http://localhost:8888/
  • or
  • {serverIP}:8888
  • And you should see a Jupyter notebook running.

    Using VSCODE to develop with Containers

    Using the VSCODE Remote Development extension, we could launch VScode inside the container

    GlueContext

    Creating GlueContext and SparkContext

    Create a notebook and run the code:

    from pyspark import SparkContext
    from awsglue.context import GlueContext
    from pyspark.sql import SQLContext
    
    
    def get_gluecontext() -> GlueContext:
        """Get the glue context
        Returns:
            GlueContext: Glue context
        """    
        sc = SparkContext.getOrCreate()
        return GlueContext(sc)
    
    
    def get_spark_context() -> SparkContext:
        """Get the spark context
        Returns:
            SparkContext: Spark context
        """    
        return SparkContext.getOrCreate()
    
    
    def get_spark_sql_context(sparkContext: SparkContext) -> SQLContext:
        """Get the spark sql context
        Args:
            sparkContext (SparkContext): Spark context
        Returns:
            SQLContext: Spark sql context
        """    
        return SQLContext(sparkContext)
    
    
    glueContext: GlueContext = get_gluecontext()
    sparkContext: SparkContext = get_spark_context()
    sqlContext: SQLContext = get_spark_sql_context(sparkContext)
    
    GlueContext SparkContext SparkSQLContext

    Creating DynamicFrame and DataFrame

    After creating the contexts:

    from awsglue.dynamicframe import DynamicFrame
    from pyspark.sql import DataFrame
    
    
    def create_df_from_path(glueContext: GlueContext, path: str, format_file: str) -> DynamicFrame:
        """Create a dataframe from a path
        Args:
            glueContext (GlueContext): Glue context
            path (str): Path to read
            format_file (str): Format of the file
        Returns:
            DynamicFrame: DynamicFrame
        """    
        return glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [path]}, format = format_file)
    
    
    def create_spark_df_from_path(sqlContext: SQLContext, path: str, format_file: str) -> DataFrame:
        """Create a spark dataframe from a path
        Args:
            sqlContext (SQLContext): Spark sql context
            path (str): Path to read
            format_file (str): Format of the file
        Returns:
            DataFrame: Spark dataframe
        """    
        return sqlContext.read.format(format_file).load(path)
    
    
    path: str = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
    format_file: str = "json"
    
    df: DynamicFrame = create_df_from_path(glueContext, path, format_file)
    
    spark_df: DataFrame = create_spark_df_from_path(sqlContext, path, format_file)
    
    DynamicFrame DataFrame

    Write data in S3

    def write_spark_df(df: DataFrame, bucket: str, key: str) -> None:
        """Write a dataframe to S3 in parquet format
        Args:
            df (DataFrame): Dataframe to write
            bucket (str): S3 bucket
            key (str): S3 key
        """    
        df.write.parquet(f"s3://{bucket}/{key}", mode="overwrite")
    
    
    bucket: str = "example-bucket-demo-aws"
    key: str = "test-folder-output"
    write_spark_df(spark_df, bucket, key)
    
    AWS console - data that was written