GitHub - EkanshdeepGupta/frequent-itemsets: DMML Assignment 1

Frequent Itemsets

Project submitted as part of the Data Mining and Machine Learning course by Madhavan Mukund.

We created a Python3 program to compute Frequent Itemsets using the Apriori Algorithm on the Bag of Words dataset.

We read as input the bag of words and computed frequent itemsets of size upto 10 for different frequencies for the three datasets: Kos, Enron Emails & NIPS full papers.

Parameters

For the different datasets, we used different frequencies discovered experimentally:

For Kos, we used the frequencies 0.5, 0.4, 0.3, 0.25, 0.2, 0.15, 0.1
For Enron, we used the frequencies 0.5, 0.4, 0.3, 0.2, 0.15, 0.1, 0.05, 0.04, 0.03
For NIPS, we used the frequencies 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45

For each frequency, the number of frequent itemsets, running time and the itemsets are all stored in the corresponding output files.

Dependencies

math: for math.ceil function.
time: for measuring running time.
copy: to make copies of mutable data types.

Analysis

We have fixed K to be 10. For almost all the frequencies, we observed that some frequent itemsets of size k < 10 were empty.

For different datasets, we found that different frequencies were of interest. For Kos, at frequency 0.5, we only found 1 frequent itemset of size 1. Whereas, for NIPS even at frequency 0.95 we found frequent itemsets. On the other hand, we had to lower the frequency all the way to 0.15 for Enron to get any itemsets. We think the above behaviour is explained by the nature of the datasets themselves.

Kos is a dataset of US-based daily news articles during the 2004 US presidential elections. This is reflected by the occurence of words like "Bush", "Kerry", "democracy", "general", "war", etc. Since many of the articles followed a similar theme, we found frequent itemsets from frequency 0.5. By 0.1, the information wasn't meaningful.

NIPS dataset consists of entire research papers from the conference on Neural Information Processing Systems. Since the size of the documents themselves was much larger, and most papers have common words like "references", "abstract", "neural", "function", "result" etc, we found frequent itemsets at much higher frequencies, starting at 0.95. We lowered the frequencies till 0.45, at which point we got 78,399 5-sized frequent itemsets.

The Enron dataset consists of e-mails of the Enron Corporation which were released to the public during the investigation by the FERC after the company's collapse. Since e-mails are much smaller, and much more varied in content, we expected the meaningful frequencies to be much lower. As such, we found the first frequent itemsets at frequency 0.15. They were common corporate jargon like "attached", "market", "meeting", "number", "think" etc. Due to the sheer size of the dataset, computation was very slow. We lowered the frequency down to 0.3 at which point the computation was too time-consuming to be carried on.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
docword.enron.txt		docword.enron.txt
docword.kos.txt		docword.kos.txt
docword.nips.txt		docword.nips.txt
frequentItemDatasets.py		frequentItemDatasets.py
output-enron		output-enron
output-kos		output-kos
output-nips		output-nips
vocab.enron.txt		vocab.enron.txt
vocab.kos.txt		vocab.kos.txt
vocab.nips.txt		vocab.nips.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

docword.enron.txt

docword.enron.txt

docword.kos.txt

docword.kos.txt

docword.nips.txt

docword.nips.txt

frequentItemDatasets.py

frequentItemDatasets.py

output-enron

output-enron

output-kos

output-kos

output-nips

output-nips

vocab.enron.txt

vocab.enron.txt

vocab.kos.txt

vocab.kos.txt

vocab.nips.txt

vocab.nips.txt

Repository files navigation

Frequent Itemsets

Parameters

Dependencies

Analysis

About

Releases

Packages

Languages

EkanshdeepGupta/frequent-itemsets

Folders and files

Latest commit

History

Repository files navigation

Frequent Itemsets

Parameters

Dependencies

Analysis

About

Resources

Stars

Watchers

Forks

Languages