[WIP] Implement apriori-gen as in original paper #646

dbarbier · 2019-12-18T10:31:20Z

Description

Implement apriori-gen as in original Apriori paper.
This is a draft PR for discussion; different changes are proposed, they should be benchmarked.

First commit implements join step
Second commit implements prune step
Third commit enforces low_memory=True processing; thanks to previous optimizations, it is now as fast as low_memory=False and requires less memory; frequent itemsets are stored as list of tuples instead of Numpy arrays
Fourth commit replaces trie implementation by set
Fifth commit replaces _support function, which was really slow on some test cases

Related issues or pull requests

Reported in #644.

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

Checklist empty for now since this is a draft pull request.

dbarbier · 2019-12-18T10:40:42Z

Grrr, UI for drafts PR is terrible, I forgot to change PR type to draft :-/

pep8speaks · 2019-12-20T13:52:05Z

Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-06 13:19:09 UTC

dbarbier · 2019-12-20T13:57:01Z

Here are some benchmark results; data must first be downloaded from http://fimi.uantwerpen.be/data/ and put inside a data subdirectory. Some runs take a very long time, thus I decided to set a timeout of 120s, in which case NA is reported in tables below, with the itemset size being processed. Script below only works on Linux, I am not sure how to implement timeout on Windows.

Benchmark script

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd
import numpy as np
import gzip
import os
from time import time
import signal
from contextlib import contextmanager


@contextmanager
def timeout(time):
    # Register a function to raise a TimeoutError on the signal.
    signal.signal(signal.SIGALRM, raise_timeout)
    # Schedule the signal to be sent after ``time``.
    signal.alarm(time)

    try:
        yield
    except TimeoutError:
        pass
    finally:
        # Unregister the signal so it won't be triggered
        # if the timeout is not reached.
        signal.signal(signal.SIGALRM, signal.SIG_IGN)


def raise_timeout(signum, frame):
    raise TimeoutError


files = [
    #"chess.dat.gz", 
    "connect.dat.gz",
    "mushroom.dat.gz", 
    "pumsb.dat.gz",
    "pumsb_star.dat.gz",
    # "T10I4D100K.dat.gz",  these 3 files are too large 
    # "T40I10D100K.dat.gz",
    # "kosarak.dat.gz"
]

# Modify these 2 variables
sparse = False
low_memory = True

for filename in files:
    with gzip.open(os.path.join("data", filename)) if filename.endswith(
        ".gz" 
    ) else open(os.path.join("data", filename)) as f:
        data = f.readlines() 
 
    dataset = [list(map(int, line.split())) for line in data] 
    items = np.unique([item for itemset in dataset for item in itemset])
    print(f"{filename} contains {len(dataset)} transactions and {len(items)} items")

    te = TransactionEncoder()
    te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
    if sparse:
        try:
            df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
        except AttributeError:
            # pandas < 0.25
            df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
    else:
        df = pd.DataFrame(te_ary, columns=te.columns_)
    df.columns = ["c"+str(i) for i in df.columns]

    for min_support in [0.5, 0.3, 0.1, 0.05, 0.03, 0.01, 0.005]:
        tick = time()
        with timeout(120):
            print(apriori(df, min_support=min_support, verbose=1, use_colnames=False, low_memory=low_memory))
        print(f"\nmin_support={min_support} temps: {time() - tick}\n")

Some commits have either an asterisk (*) or a plus sign (+), as well as (F) or (T).

An asterisk tells that code has been modified to operate on smaller matrix blocks, because there would otherwise be memory errors on my machine. All such commits have the same block size, so their relative speed is still relevant.
A plus sign tells that 1 should be added to number between parentheses; the reason is that commit 4c82dcf displays the number of combinations after processing them, instead of before, so the last number is missing when timeout is reached.
F or T refers to low_memory argument (either False or True)

With dense DataFrame:

connect.dat.gz 67557 transactions and 129 items

min_support	master*(F)	master(T)	`96dfd4d`*(F)	`4efeb8c`*(F)	`b90a146`*(F)	`1fabe25`*(F)	`1fabe25`(T)	`58c95f1`(F)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	NA(5)	NA(4)	NA(5)	NA(5)	NA(5)	NA(6)	NA(4)	NA(7)	NA(7)	NA+(6)	NA+(6)
0.3	NA(5)	NA(4)	NA(4)	NA(5)	NA(5)	NA(6)	NA(4)	NA(6)	NA(6)	NA+(5)	NA+(5)
0.1	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(5)	NA(4)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.05	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.03	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.01	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.005	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(3)	NA(4)	NA(4)	NA+(4)	NA+(4)

mushroom.dat.gz 8124 transactions and 119 items

min_support	master*(F)	master(T)	`96dfd4d`*(F)	`4efeb8c`*(F)	`b90a146`*(F)	`1fabe25`*(F)	`1fabe25`(T)	`58c95f1`(F)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	0.04	0.05	0.04	0.04	0.04	0.02	0.09	0.02	0.02	0.02	0.02
0.3	1.24	0.64	0.38	0.35	0.37	0.13	0.89	0.11	0.11	0.13	0.10
0.1	NA(7)	NA(12)	107	103	109	35	89	32	33	38	25
0.05	NA(6)	NA(7)	NA(7)	NA(7)	NA(7)	NA(7)	NA(7)	NA(9)	NA(9)	NA+(8)	NA+(9)
0.03	NA(5)	NA(6)	NA(6)	NA(6)	NA(6)	NA(7)	NA(7)	NA(7)	NA(7)	NA+(7)	NA+(7)
0.01	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(6)	NA(6)	NA(6)	NA(6)	NA+(5)	NA+(6)
0.005	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA(5)	NA+(5)	NA+(5)

pumsb.dat.gz 49046 transactions and 2113 items

min_support	master*(F)	master(T)	`96dfd4d`*(F)	`4efeb8c`*(F)	`b90a146`*(F)	`1fabe25`*(F)	`1fabe25`(T)	`58c95f1`(F)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	NA(5)	NA(3)	NA(5)	NA(5)	NA(5)	NA(6)	NA(3)	NA(6)	NA(6)	NA+(6)	NA+(6)
0.3	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(5)	NA(2)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.1	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.05	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)
0.005	NA(3)	NA(2)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)

pumsb_star.dat.gz 49046 transactions and 2088 items

min_support	master*(F)	master(T)	`96dfd4d`*(F)	`4efeb8c`*(F)	`b90a146`*(F)	`1fabe25`*(F)	`1fabe25`(T)	`58c95f1`(F)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`
0.5	3.76	9.40	0.76	0.73	0.59	0.46	53	0.34	0.37	0.34	0.35
0.3	NA(5)	NA(4)	NA(7)	NA(7)	NA(7)	66	NA(3)	35	36	41	31
0.1	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(5)	NA(2)	NA(5)	NA(5)	NA+(5)	NA+(5)
0.05	NA(4)	NA(3)	NA(4)	NA(4)	NA(4)	NA(4)	NA(2)	NA(5)	NA(5)	NA+(4)	NA+(4)
0.03	NA(3)	NA(3)	NA(4)	NA(3)	NA(4)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(4)	NA+(4)
0.01	NA(3)	NA(3)	NA(3)	NA(3)	NA(3)	NA(4)	NA(2)	NA(4)	NA(4)	NA+(3)	NA+(3)
0.005	NA(3)	NA(2)	NA(3)	NA(3)	NA(3)	NA(3)	NA(2)	NA(3)	NA(3)	NA+(3)	NA+(3)

With sparse DataFrame

connect.dat.gz 67557 transactions and 129 items

min_support	master(T)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	NA(4)	NA(4)	NA+(4)	NA+(4)
0.3	NA(4)	NA(4)	NA+(3)	NA+(3)
0.1	NA(3)	NA(3)	NA+(3)	NA+(3)
0.05	NA(3)	NA(3)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA+(3)	NA+(3)
0.005	NA(3)	NA(3)	NA+(3)	NA+(3)

mushroom.dat.gz 8124 transactions and 119 items

min_support	master(T)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	0.18	0.18	0.07	0.06
0.3	3.07	3.19	0.92	0.84
0.1	NA(6)	NA(6)	NA+(7)	NA+(8)
0.05	NA(5)	NA(5)	NA+(5)	NA+(5)
0.03	NA(5)	NA(5)	NA+(5)	NA+(5)
0.01	NA(4)	NA(4)	NA+(4)	NA+(4)
0.005	NA(4)	NA(4)	NA+(4)	NA+(4)

pumsb.dat.gz 49046 transactions and 2113 items

min_support	master(T)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	NA(4)	NA(4)	NA+(3)	NA+(3)
0.3	NA(3)	NA(3)	NA+(3)	NA+(3)
0.1	NA(3)	NA(3)	NA+(2)	NA+(2)
0.05	NA(3)	NA(3)	NA+(2)	NA+(2)
0.03	NA(3)	NA(2)	NA+(2)	NA+(2)
0.01	NA(2)	NA(2)	NA+(2)	NA+(2)
0.005	NA(2)	NA(2)	NA+(2)	NA+(2)

pumsb_star.dat.gz 49046 transactions and 2088 items

min_support	master(T)	`58c95f1`(T)	`4c82dcf`+	`423f9e7`+
0.5	2.97	3.08	1.16	1.16
0.3	NA(5)	NA(5)	NA+(5)	NA+(5)
0.1	NA(3)	NA(3)	NA+(3)	NA+(3)
0.05	NA(3)	NA(3)	NA+(3)	NA+(3)
0.03	NA(3)	NA(3)	NA+(3)	NA+(3)
0.01	NA(3)	NA(3)	NA+(2)	NA+(2)
0.005	NA(2)	NA(2)	NA+(2)	NA+(2)

dbarbier · 2019-12-20T14:03:46Z

More tests should be run, maybe with kosarak.dat; when the best option is decided, I will remove/squash/rearrange commits and update docstrings. IMHO the best option is with current head.

dbarbier · 2019-12-20T17:32:11Z

Okay, I ran some tests with kosarak.dat and sparse=False. For small inputs, there is a noticeable overhead, which is caused by np.asfortranarray. All these tests had been run with a DataFrame wih row major storage format (order='C' in Numpy). In tables below, an asterisk means that run is performed with a DataFrame already in column storage format, so that np.asfortranarry becomes a noop; this is achieved by adding

df = pd.DataFrame({col: df[col] for col in df.columns})

(F) and (T) still refers to low_memory=False and low_memory=True arguments respectively.

1k first lines of kosarak.dat.gz 1000 transactions and 3259 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	`423f9e7`	`423f9e7`*
0.5	0.01	0.01	0.01	0.01	0.01	0.01
0.3	0.01	0.01	0.01	0.02	0.01	0.01
0.1	0.01	0.01	0.01	0.02	0.01	0.01
0.05	0.02	0.02	0.02	0.04	0.02	0.02
0.03	0.02	0.02	0.02	0.04	0.02	0.02
0.01	0.11	0.04	0.06	0.10	0.07	0.07
0.005	3.58	0.39	1.48	0.68	0.83	0.89

10k first lines of kosarak.dat.gz 10000 transactions and 10094 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	`423f9e7`	`423f9e7`*
0.5	0.10	0.15	0.10	0.22	0.33	0.09
0.3	0.10	0.17	0.10	0.47	0.33	0.09
0.1	0.10	0.19	0.10	0.70	0.33	0.09
0.05	0.11	0.20	0.11	1.24	0.33	0.09
0.03	0.13	0.22	0.11	1.77	0.33	0.09
0.01	0.69	0.30	0.24	4.93	0.37	0.13
0.005	8.43	0.54	2.16	11.4	0.61	0.40

50k first lines of kosarak.dat.gz 50000 transactions and 18936 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	`423f9e7`	`423f9e7`*
0.5	0.78	1.00	1.17	1.90	2.70	0.86
0.3	0.78	1.20	0.74	4.23	2.71	0.86
0.1	0.78	1.37	1.17	6.35	2.71	0.86
0.05	0.82	1.70	0.75	11.7	2.71	0.87
0.03	0.90	1.92	1.21	16.9	2.71	0.87
0.01	5.02	2.95	1.51	46.4	2.77	0.92
0.005	51.3	4.96	12.8	107	3.12	1.30

100k first lines of kosarak.dat.gz 100000 transactions and 23496 items, sparse=False

min_support	master(F)	master(T)	master*(F)	master*(T)	`423f9e7`	`423f9e7`*
0.5	-	2.83	-	4.79	7.51	2.89
0.3	-	3.02	-	10.7	6.43	2.89
0.1	-	3.41	-	16.1	6.43	2.90
0.05	-	4.44	-	29.9	6.43	2.90
0.03	-	4.75	-	43.0	6.43	2.90
0.01	-	7.46	-	116	6.50	2.97
0.005	-	13.8	-	NA(2)	7.03	3.52

With all these resuls, it is not clear to me whether 1fabe25 is such a good idea, an alternative is to make users aware of the importance of storage order, and they have to check which one is best adapted to their use case.

rasbt · 2019-12-28T08:44:29Z

Sorry, I have been busy with grading and was then traveling over Xmas. The work in this PR is amazing, thanks a lot! Regarding the row vs column format, like you suggest, maybe it would be better to add this prominently in the docstring but have users decide what type of format they provide via the DataFrame.

rasbt · 2019-12-28T08:48:55Z

I just see the unit tests failing. I don't know why this happens to Appveyor, but it seems that switching installations from pip to conda helped. Maybe something got changed in the backend.

Regarding Travis CI, some discrepancies occurred after things got switched to scikit-learn 0.22 in the "latest" version. I addressed this now as well. This has been done in #652. I can do a rebase of this PR if you like, or if you can do it yourself if you prefer -- wanted to ask before I started messing around here ;)

dbarbier · 2019-12-28T18:48:08Z

I just rebased but won't be able to take care of failures during 24h, feel free to push fixes.

rasbt · 2019-12-29T12:22:01Z

All good. Based on the error logs, these are "natural" ones due to WIP and not issues with the CI.

dbarbier · 2019-12-30T10:24:10Z

There were indeed some bugs; because of these fixes, timings may be slightly different, I will rerun benchmarks in few days.

The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), itemset p+(q.k-1,) cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.

rasbt · 2020-01-02T17:14:51Z

What do you think about adding a/the benchmark script(s) to mlxtend/frequent_patterns/tests/benchmark.py (not test_benchmark.py so it doesn't get executed by default when running pytest?

It may be useful to have this as a reference in case of future modifications to the codebase.

rasbt · 2020-01-02T17:15:50Z

*sry, not sure why this PR was closed. I must have hit some keyboard combination -- never happened before. This was certainly not intentional.

The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; the first step had been implemented in previous commit. The second step of apriori-gen function is called prune step, it takes candidates c from first step and check that all (k-1) tuples built by removing any single element from c is in L(k-1). As Numpy arrays are not hashable, we cannot use set() for itemset lookup, and define a very simple prefix tree class.

Thanks to previous optimizations, processing with low_memory=True is now as efficient as with low_memory=False, and allows to process much larger datasets. Removing processing with low_memory=False makes code simpler. The downside is that we do not know in advance the number of itemsets to process, thus it is displayed afterwards. We now display the number of itemsets after prune step. Note that commit 2f928cb introduced a bug, the number of processing combinations was multiplied by itemset's length. Since vectorized operations are no more performed on frequent itemsets, they are stored as list of tuples.

This is now possible because tuples are hashable.

For unknbown reasons, np.sum is slow on a very large boolean array.

dbarbier · 2020-01-03T16:19:39Z

I rearranged commits, they look good now IMHO. About benchmark script, I do not know how to do that, there are many parameters: data files, sparse=True/False, column_major=True/False, and list of min_support argument (which may depend on data files). Anyway, it has been committed.

Should data files be copied into mlxtend/data/data with a Python function to load them? Here are their size (in bytes):

  chess.dat.gz         14K
  connect.dat.gz      362K
  kosarak-1k.dat.gz    14K
  kosarak-10k.dat.gz  126K
  kosarak-50k.dat.gz  616K
  kosarak-100k.dat.gz 1,3M
  kosarak-200k.dat.gz 2,5M
  kosarak.dat.gz       13M
  mushroom.dat.gz      34K
  pumsb.dat.gz        1,3M
  pumsb_star.dat.gz   1,2M
  T10I4D100K.dat.gz   1,4M
  T40I10D100K.dat.gz  4,8M

dbarbier · 2020-01-03T17:58:31Z

Here are more benchmarks. In these tables, s= means sparse variable, c= is col_major and T/F is True/False.

T10I4D100K.dat.gz 100000 transactions and 870 items, low_memory=True

min_support	0.05	0.03	0.01	0.005	0.003	0.001
master (s=F,c=T)	0.26	0.92	4.0	7.2	14.5	37.5
master (s=T,c=T)	0.06	0.24	3.9	8.9	29.3	NA(3)
`eb80667`(s=F,c=T)	0.01	0.05	2.0	4.8	7.3	11.8
`eb80667`(s=T,c=T)	0.05	0.44	11.3	23.5	33.5	57.2

T40I10D100K.dat.gz 100000 transactions and 942 items, low_memory=True

Note that timeout had been expanded to 300s when running commit eb80667, which explains values above 120s.

min_support	0.1	0.05	0.03	0.01	0.005	0.003	0.001
master (s=F,c=T)	2.0	7.2	15.0	NA(2)	NA(2)	NA(2)	NA(2)
master (s=T,c=T)	0.67	5.4	13.3	NA(2)	NA(2)	NA(2)	NA(2)
`eb80667`(s=F,c=T)	0.01	1.3	3.5	20.7	176	NA(6)	NA(2)
`eb80667`(s=T,c=T)	1.32	14.0	32.9	232	NA(2)	NA(2)	NA(2)

kosarak-*k.dat.gz, low_memory=True

min_support	0.5	0.3	0.1	0.05	0.03	0.01	0.005	0.003	0.001
kosarak-1k.dat.gz
`eb80667`(s=F,c=T)	0.005	0.006	0.007	0.008	0.01	0.27	0.31	NA(11)	NA(2)
`eb80667`(s=F,c=F)	0.009	0.009	0.01	0.01	0.01	0.05	0.60	NA(9)	NA(2)
`eb80667`(s=T,c=T)	0.08	0.08	0.09	0.09	0.09	0.14	0.92	NA(8)	NA(2)
kosarak-10k.dat.gz
`eb80667`(s=F,c=T)	0.02	0.02	0.02	0.02	0.02	0.04	0.14	0.58	NA(5)
`eb80667`(s=F,c=F)	0.26	0.26	0.26	0.26	0.27	0.36	1.02	4.4	NA(4)
`eb80667`(s=T,c=T)	0.26	0.26	0.26	0.27	0.28	0.36	0.80	2.4	NA(4)
kosarak-50k.dat.gz
`eb80667`(s=F,c=T)	0.09	0.09	0.09	0.09	0.09	0.12	0.33	1.06	61.1
`eb80667`(s=F,c=F)	2.0	2.0	2.0	2.0	2.0	2.6	6.7	20.1	NA(1)
`eb80667`(s=T,c=T)	0.49	0.49	0.50	0.51	0.53	0.78	2.1	5.9	NA(5)
kosarak-100k.dat.gz
`eb80667`(s=F,c=T)	0.20	0.20	0.20	0.20	0.20	0.24	0.60	1.85	67.9
`eb80667`(s=F,c=F)	4.9	4.9	4.9	5.0	5.1	6.4	15.7	-	-
`eb80667`(s=T,c=T)	0.62	0.62	0.62	0.65	0.70	1.15	3.6	10.2	NA(9)
kosarak-200k.dat.gz
`eb80667`(s=F,c=T)	0.51	0.51	0.51	0.51	0.51	0.60	1.3	3.8	123
`eb80667`(s=T,c=T)	0.80	0.80	0.80	0.85	0.92	1.83	6.3	18.8	NA(5)

dbarbier · 2020-01-05T10:15:29Z

Some remarks:

Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated. Also note that TransactionEncoder uses a CSR format, which is logical, but it has to be converted to CSC for apriori, and this conversion takes time and memory.
Implementation does not use hash-trees. The reason is that current API expects a DataFrame, and with hash-trees it seems more logical to directly work on a list of transactions.

This is a work in progress.

rasbt · 2020-01-07T04:27:22Z

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

In any case, I am really thankful for all the good work you put into this. This is really awesome. And I can take care of the automatic data downloads from here.

regarding

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).

Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated.

I am not an expert at this by any means, but I think they are usually only memory efficient but not necessarily efficient when it comes to processing times.

dbarbier · 2020-01-07T09:13:36Z

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

No worries, this issue is not trivial and requires careful thinking, please take your time.

Other apriori implementations takes as input a list of transactions; there are several optimizations which can then be performed:

use a sparse representation of this list of transactions (Apriori paper suggests hash-trees, some other authors prefer prefix trees) which speeds up itemset counts (supports)
remove transactions during processing when they can no more contain frequent itemsets (for instance if their number of items is less than next_max_itemset)

Here it is very likely that user loaded her dataset as a list of transactions and called TransactionEncoder to convert it into a pandas DataFrame in order to call apriori function, thus it does not make sense to me to let apriori convert its argument internally back into a list of transactions. IMHO the first design decision to consider is: do you want to keep current input as a pandas DataFrame, or can it be changed to a list of transactions?

If the former, optimizations mentioned above cannot be performed and this PR is almost done. If the latter, a lot more work is needed.

[...]

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).
[...]

Sorry I do not understand your point; dense pandas DataFrame can use either row major or column major storage, and it looks like this depends on which DataFrame constructor had been called(2d array vs dict). We could indeed add a cell to show how to convert input DataFrame to speed up apriori function.
But as far as I can tell, this is different for sparse DataFrame, user has no control on internal storage, and all we can do is to call df.to_coo().tocsc(). If input argument could be passed as a Numpy array (or scipy sparse matrix), that would be different.

rasbt · 2020-01-29T02:38:50Z

Sorry, still haven't had time to look into this more. Along with the new semester (lots of teaching) & 2 paper deadlines in January, there wasn't time for much else recently. I am currently making a 0.17.1 bugfix release with the recent changes -- someone from industry contacted me about this, because due to company firewalls, several people can only install it from PyPI (not from GitHub directly). Will revisit this PR soon though -- thanks for all the work on it so far!

dbarbier force-pushed the db/apriori-gen branch from 9266c65 to 16a2136 Compare December 18, 2019 11:03

dbarbier changed the title ~~Implement apriori-gen as in original paper~~ [WIP] Implement apriori-gen as in original paper Dec 18, 2019

dbarbier force-pushed the db/apriori-gen branch from f75d5e8 to 5a13f09 Compare December 28, 2019 18:46

dbarbier force-pushed the db/apriori-gen branch from 5a13f09 to a00ee50 Compare December 30, 2019 10:09

dbarbier force-pushed the db/apriori-gen branch 2 times, most recently from bcc64c5 to 4c43f5c Compare December 30, 2019 14:07

rasbt closed this Jan 2, 2020

rasbt reopened this Jan 2, 2020

dbarbier force-pushed the db/apriori-gen branch from 4c43f5c to b731fd2 Compare January 3, 2020 15:32

dbarbier added 3 commits January 3, 2020 16:42

Replace _FixedLengthTrie by set

2917065

This is now possible because tuples are hashable.

Replace _support function

eb80667

For unknbown reasons, np.sum is slow on a very large boolean array.

dbarbier force-pushed the db/apriori-gen branch from b731fd2 to eb80667 Compare January 3, 2020 15:42

dbarbier force-pushed the db/apriori-gen branch from 58d34a2 to f228762 Compare January 3, 2020 16:28

Add benchmark script

09e6e2f

This is a work in progress.

dbarbier force-pushed the db/apriori-gen branch from f228762 to 09e6e2f Compare January 6, 2020 13:19

dbarbier mentioned this pull request Feb 5, 2020

apriori.py line 224: ValueError: negative dimensions are not allowed #613

Open

rasbt mentioned this pull request May 5, 2020

refactor: replace pandas apply with more efficient means (#655) #684

Open

5 tasks

rasbt mentioned this pull request May 26, 2020

Apriori - I'm only interested in one product / consequent. Can I speed up the apriori algorithm? #658

Open

rasbt mentioned this pull request Mar 4, 2022

Improve Apriori efficiency #899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement apriori-gen as in original paper #646

[WIP] Implement apriori-gen as in original paper #646

dbarbier commented Dec 18, 2019 •

edited

dbarbier commented Dec 18, 2019

pep8speaks commented Dec 20, 2019 •

edited

dbarbier commented Dec 20, 2019 •

edited

dbarbier commented Dec 20, 2019

dbarbier commented Dec 20, 2019 •

edited

rasbt commented Dec 28, 2019

rasbt commented Dec 28, 2019

dbarbier commented Dec 28, 2019

rasbt commented Dec 29, 2019

dbarbier commented Dec 30, 2019

rasbt commented Jan 2, 2020

rasbt commented Jan 2, 2020

dbarbier commented Jan 3, 2020

dbarbier commented Jan 3, 2020 •

edited

dbarbier commented Jan 5, 2020

rasbt commented Jan 7, 2020 •

edited

dbarbier commented Jan 7, 2020

rasbt commented Jan 29, 2020

[WIP] Implement apriori-gen as in original paper #646

Are you sure you want to change the base?

[WIP] Implement apriori-gen as in original paper #646

Conversation

dbarbier commented Dec 18, 2019 • edited

Description

Related issues or pull requests

Pull Request Checklist

dbarbier commented Dec 18, 2019

pep8speaks commented Dec 20, 2019 • edited

Comment last updated at 2020-01-06 13:19:09 UTC

dbarbier commented Dec 20, 2019 • edited

dbarbier commented Dec 20, 2019

dbarbier commented Dec 20, 2019 • edited

rasbt commented Dec 28, 2019

rasbt commented Dec 28, 2019

dbarbier commented Dec 28, 2019

rasbt commented Dec 29, 2019

dbarbier commented Dec 30, 2019

rasbt commented Jan 2, 2020

rasbt commented Jan 2, 2020

dbarbier commented Jan 3, 2020

dbarbier commented Jan 3, 2020 • edited

dbarbier commented Jan 5, 2020

rasbt commented Jan 7, 2020 • edited

dbarbier commented Jan 7, 2020

rasbt commented Jan 29, 2020

dbarbier commented Dec 18, 2019 •

edited

pep8speaks commented Dec 20, 2019 •

edited

dbarbier commented Dec 20, 2019 •

edited

dbarbier commented Dec 20, 2019 •

edited

dbarbier commented Jan 3, 2020 •

edited

rasbt commented Jan 7, 2020 •

edited