Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge: NEAZYIT submission. #168

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

NEAZYIT
Copy link

@NEAZYIT NEAZYIT commented Feb 25, 2024

No description provided.

- Introduced a modular data processing pipeline using Dask for efficient handling of large datasets.
- Implemented the following functions for distinct tasks:

    1. 
ead_data(input_file): Reads CSV data using Dask, handling decoding issues and creating a Dask DataFrame.

    2. �alidate_prices(data): Filters out prices outside the range of 1.00 to 100.00.

    3. clean_data(data): Removes missing values and duplicate entries, providing a cleaned Dask DataFrame.

    4. group_and_aggregate(data): Groups data by city and computes the total prices, returning a Dask Series.

    5. sort_and_select_top_products(data, cheapest_city): Filters data for the cheapest city, sorts, and selects the top 5 products.

    6. write_results(output_file, cheapest_city, total_price, sorted_products): Writes results to an output file.

- The main process_data(input_file, output_file) function orchestrates these steps, ensuring a structured flow.

- Implemented a logging system that captures errors and important information for debugging and monitoring.

- Maintained clarity and adherence to best practices for readability and maintainability.

To test, run the provided example with 'input.txt' as the input file and 'output.txt' as the output file.
- Introduced a modular data processing pipeline using Dask for efficient handling of large datasets.
- Implemented the following functions for distinct tasks:

    1. 
ead_data(input_file): Reads CSV data using Dask, handling decoding issues and creating a Dask DataFrame.

    2. �alidate_prices(data): Filters out prices outside the range of 1.00 to 100.00.

    3. clean_data(data): Removes missing values and duplicate entries, providing a cleaned Dask DataFrame.

    4. group_and_aggregate(data): Groups data by city and computes the total prices, returning a Dask Series.

    5. sort_and_select_top_products(data, cheapest_city): Filters data for the cheapest city, sorts, and selects the top 5 products.

    6. write_results(output_file, cheapest_city, total_price, sorted_products): Writes results to an output file.

- The main process_data(input_file, output_file) function orchestrates these steps, ensuring a structured flow.

- Implemented a logging system that captures errors and important information for debugging and monitoring.

- Maintained clarity and adherence to best practices for readability and maintainability.

To test, run the provided example with 'input.txt' as the input file and 'output.txt' as the output file.
@aboullaite
Copy link
Collaborator

Please clean up your PR! You are adding too many unnecessary files. Follow the submission guide in the readme

@aboullaite aboullaite added the invalid This doesn't seem right label Feb 25, 2024
@NEAZYIT NEAZYIT closed this Feb 25, 2024
@NEAZYIT NEAZYIT reopened this Feb 25, 2024
@NEAZYIT
Copy link
Author

NEAZYIT commented Feb 25, 2024

I have cleaned up my PR. Could you please check it now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants