Skip to content

pavel-filatov/yelp-challenge

Repository files navigation

yelp-challenge

Project to demonstrate basic data engineering skills.

How to Use It (The Easy Path)

  1. Clone the repository:

    git clone git@github.com:pavel-filatov/yelp-challenge.git

    ... or download just a single bash script run_docker_and_prepare_environment.sh.

  2. Download the Yelp dataset.

  3. Run bash script to download the Docker image and got prepared to the work:

    bash run_docker_and_prepare_environment.sh </yelp/data/directory/path.tar>

    This script will:

    1. run the Docker container in detach mode, publishing port 4040 to inspect Spark jobs from host,
    2. copy Yelp data into the container,
    3. run container in interactive mode using bash

    Note that Docker image used there (pfilatov/spark-cassandra) will be downloaded if not presented in the Docker scope.

  4. Inside a container, run:

    bash ingest_yelp_data_into_cassandra.sh

    What this script do:

    1. creates keyspace and tables inside the Cassandra
    2. runs Spark application for data ingestion

    IMPORTANT: This script may fail for several times with the following message:

    Connection error: ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9042)]. Last error: Connection refused")})

    This behavior occurs when Cassandra has not ran yet. Please be patient and run the script a bit later.

  5. Once the ingestion app completed, you may explore the data within Cassandra unsing cqlsh.

  6. To exit the container type exit.

  7. To run the container again, use docker exec -it spark-cassandra bash.

  8. To stop container (without removing the data), use docker stop spark-cassandra.

  9. To start it again, use docker start spark-cassandra.

  10. To remove container completely (including the data), use docker rm -f spark-cassandra.

About

Project to demonstrate basic data engineering skills

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published