Data-Engineering-challenge

This Python code is for Data Engineering challenge. Points to consider:

The given impressions sample file has many duplicate records and my assumption was that "id" column must be unique so I used distinct and I dropped duplicated records.
Based on json schema some records are ignored and they are displayed on console as a warning when the code is executed.
In the final result of section 2, clicks count of some aggregated records are greater than count of impressions and it is known that it doesn't seem logical, so I assumed either there are some inconsistencies in the given sample file or I am not familiar with the logic behind it.
The input files must be placed in the input directory.
the result of the challenge will be placed in the output directory.

Step to run

Step 1 install python3.10 and packages in requirements.txt

pip3 install -r requirements.txt

Step 2 place impression and click files in input directory.

Step 3 Change directory to project location and run this syntax:

python3.10 main.py

Step 4 Enter file names and seperate them with commas. for example: file1.json,file2.json,file3.json This code gets two lists of files: impressions and clicks.

Step 5 See the result in output directory. The report of section 2 is like section2_YYYYMMDDHHMISS.json and section 3 is like section3_YYYYMMDDHHMISS.json.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
output		output
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input

input

output

output

LICENSE

LICENSE

README.md

README.md

main.py

main.py

Repository files navigation

Data-Engineering-challenge

Step to run

About

Releases

Packages

Languages

License

ArmanShakeri/Data-Engineering-challenge

Folders and files

Latest commit

History

Repository files navigation

Data-Engineering-challenge

Step to run

About

Resources

License

Stars

Watchers

Forks

Languages