Why I made this code

This code was written because of the need to count groupings of a large number of user events generated by a web app. See the section "What it is and isn't" before trying to use this code.

Quickstart

Assuming you have MongoDB, you need to set the addresses, database names, and collection names in both common_multicore.py and common_item_analysis_agg.py

Running ./run_it_all.sh from the root folder of this repo will run the complete workflow on test data located in the /test/input directory. In that folder is also a script to generate the test data in case you want to vary the parameters used to generate the test data. Outputs of the workflow will be generated in /test/output.

What it is and isn't

This repo uses python multiprocessing and MongoDB to accomplish a (local) MapReduce job in which the frequencies of items and sets of items in a list of transactions are measured. This code can be easily modified to accomplish a number of MapReduce type jobs including word count, large data-set filtering, and joining a large dataset with one or multiple small datasets.

This code is not meant to provide a generalized framework for parellel processing or MapReduce. Instead it is meant to provide an example of a targeted solution for accomplishing such tasks when complete code transparency and accessibility is necessary. I recommend you try using a library like mrjob or a framework like Spark before trying to write your own parallelization code. Asking questions of data is typically most successful when done quickly, and relying on libraries and frameworks to deal with as much of the complexity as possible typically facilitates this speed.

Code structure

The core of this code lives in two files: common_multicore.py and common_item_analysis_agg.py

There is some information that is necessarily redundant in these two files that if out of agreement will break things. Ideally this redundency will be removed in the future for robustness, but currently the redundancy is present to keep things simple. The focus of development at the moment on this would be to measure performance and take advantage of quick gains in this regard and increasing ease of use by decreasing chance of human error.

In both of the two files mentioned at the beginning of this section a Mongo database name, and set of collection names is specified. These must agree for common_item_analysis_agg.py to continue working on the data loaded by common_multicore.py.

Future and contributions

Use Apriori algorithm to prune higher-order combinations
Benchmark (single vs multiple processes)
Allow for arbitrary jobs (seperate job code .py file? Is this too much like a framework?)
Use a common config file to specify Mongo database name and collection names

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
test		test
LICENSE		LICENSE
README.md		README.md
common_items_analysis_agg.py		common_items_analysis_agg.py
common_multicore.py		common_multicore.py
run_it_all.sh		run_it_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test

test

LICENSE

LICENSE

README.md

README.md

common_items_analysis_agg.py

common_items_analysis_agg.py

common_multicore.py

common_multicore.py

run_it_all.sh

run_it_all.sh

Repository files navigation

Why I made this code

Quickstart

What it is and isn't

Code structure

Future and contributions

About

Releases

Packages

Languages

License

bgrayburn/itemSetCount

Folders and files

Latest commit

History

Repository files navigation

Why I made this code

Quickstart

What it is and isn't

Code structure

Future and contributions

About

Resources

License

Stars

Watchers

Forks

Languages