Skip to content

ArmanShakeri/Data-Engineering-challenge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Engineering-challenge

This Python code is for Data Engineering challenge. Points to consider:

  • The given impressions sample file has many duplicate records and my assumption was that "id" column must be unique so I used distinct and I dropped duplicated records.
  • Based on json schema some records are ignored and they are displayed on console as a warning when the code is executed.
  • In the final result of section 2, clicks count of some aggregated records are greater than count of impressions and it is known that it doesn't seem logical, so I assumed either there are some inconsistencies in the given sample file or I am not familiar with the logic behind it.
  • The input files must be placed in the input directory.
  • the result of the challenge will be placed in the output directory.

Step to run

Step 1 install python3.10 and packages in requirements.txt

pip3 install -r requirements.txt

Step 2 place impression and click files in input directory.

Step 3 Change directory to project location and run this syntax:

python3.10 main.py

Step 4 Enter file names and seperate them with commas. for example: file1.json,file2.json,file3.json This code gets two lists of files: impressions and clicks.

Step 5 See the result in output directory. The report of section 2 is like section2_YYYYMMDDHHMISS.json and section 3 is like section3_YYYYMMDDHHMISS.json.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages