refactor: replace pandas apply with more efficient means (#655) #684

keyanyang · 2020-05-01T01:39:38Z

Description

Goal: Replacing pandas' .apply wherever possible.

What I did:

Checked all functions that use .apply. Though there is no place that where/select can be leveraged, using Numpy's vectorize could speed up functions greatly.
Improve the runtime performance for apriori function and tests by replacing pandas' .apply with numpy's vectorize as the data (see below) show vectorize is faster than apply for the current use.
Improve the efficiency for generate_itemsets function by replacing Python lists with Numpy's arrays and replacing iterative division with array division.

Related issues or pull requests

Refactor #655

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

Performance comparison

Data Preperation

import pandas as pd
import numpy as np

frozenset_array = np.array([frozenset([3]), frozenset([10]), frozenset([8]), frozenset([6]), frozenset([3,5]), 
     frozenset([10,5]), frozenset([8,3]), frozenset([8,5]), frozenset([8,3,5]), frozenset([5,6])])
# x = np.repeat(x, 500)
df = pd.DataFrame(frozenset_array, columns=['col1'])

len

# current
%timeit -n 100 np.max(df.col1.apply(len))
322 µs ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(len)(df.col1).max()
73.6 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

set

# current
%timeit -n 100 df['col1'].apply(lambda x: set(x))
229 µs ± 34.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(set)(df.col1)
65.9 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

frozenset to str

# current
%timeit -n 100 df['col1'].apply(lambda x: str(x))
243 µs ± 29.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(str)(df.col1)
85.3 µs ± 8.76 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

colname map

colname_map = {0: 'Apple', 1: 'Corn', 2: 'Dill', 3: 'Eggs', 4: 'Ice cream', 5: 'Kidney Beans', 6: 'Milk', 7: 'Nutmeg', 8: 'Onion', 9: 'Unicorn', 10: 'Yogurt'}

# current
%timeit -n 100 df['col1'].apply(lambda x: frozenset([colname_map[i] for i in x]))
225 µs ± 27.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 [frozenset(a) for a in np.vectorize(map)(colname_map.get, df['col1'])]
75.1 µs ± 6.93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cast each element to stringfied frozenset

# copy the data below from tests
cols = ['antecedents', 'consequents', 'antecedent support', 'consequent support', 'support', 'confidence', 'lift', 'leverage', 'conviction']
expect = pd.DataFrame([
        [(8,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
        [(6,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
        [(8, 3), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
        [(8, 5), (3,), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf],
        [(8,), (3, 5), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf],
        [(3,), (5,), 0.8, 1.0, 0.8, 1.0, 1.0, 0.0, np.inf],
        [(5,), (3,), 1.0, 0.8, 0.8, 0.8, 1.0, 0.0, 1.0],
        [(10,), (5,), 0.6, 1.0, 0.6, 1.0, 1.0, 0.0, np.inf],
        [(8,), (3,), 0.6, 0.8, 0.6, 1.0, 1.25, 0.12, np.inf]],
        columns=cols
    )

# current
%timeit -n 100 expect['antecedents'].apply(lambda x: str(frozenset(x)))
%timeit -n 100 expect['consequents'].apply(lambda x: str(frozenset(x)))
245 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
214 µs ± 8.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# new
%timeit -n 100 np.vectorize(str)(np.vectorize(frozenset)(expect['antecedents']))
%timeit -n 100 np.vectorize(str)(np.vectorize(frozenset)(expect['consequents']))
103 µs ± 6.69 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
86.6 µs ± 3.25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pep8speaks · 2020-05-01T01:39:43Z

Hello @keyanyang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-01 01:49:14 UTC

rasbt · 2020-05-05T04:58:49Z

Thanks a lot for your PR! Just to get an idea of how much the efficiency has improved, I just ran the code against some benchmark datasets from the website http://fimi.uantwerpen.be/data/ that @dbarbier shared via #646 (I related PR that I really want to revisit at some point, sorry for the delay @dbarbier , it has been a very stressful semester).

I ran it as apriori(df, min_support=0.6) on pumsb and as apriori(df, min_support=0.1) on connect`. The first number is the original implementation, and the second number is the one submitted in this PR:

chess: 40.9 sec vs 41.9 sec
pumsb (first 50 rows and first 50 columns): 28.1 vs 27.4

The code snippets for loading the unzipped datasets are

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

df = pd.read_csv('chess.dat.txt', sep='\s+', header=None)
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

and

df = pd.read_csv('pumsb.dat.txt', sep='\s+', header=None)
df = df.iloc[:20, :20]
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)

It looks like that the runtime wasn't really affected by the change from apply to vectorize+map -- I guess that's because the bottleneck is more in the combination generation section.

I propose to leave this PR open for now and apply these changes to #646 later, which is a big overhaul to of the apriori implementation by @dbarbier -- I am hoping to find some hours of uninterrupted time to take a closer look at #646 this summer. In the meantime, if you are interested, it would be great to get some additional feedback (and pair of eyes) on #646 :)

refactor: replace pandas apply with more efficient means (rasbt#655)

11e4a4a

keyanyang force-pushed the master branch from b875301 to 11e4a4a Compare May 1, 2020 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace pandas apply with more efficient means (#655) #684

refactor: replace pandas apply with more efficient means (#655) #684

keyanyang commented May 1, 2020

pep8speaks commented May 1, 2020 •

edited

rasbt commented May 5, 2020

refactor: replace pandas apply with more efficient means (#655) #684

Are you sure you want to change the base?

refactor: replace pandas apply with more efficient means (#655) #684

Conversation

keyanyang commented May 1, 2020

Description

Related issues or pull requests

Pull Request Checklist

Performance comparison

pep8speaks commented May 1, 2020 • edited

Comment last updated at 2020-05-01 01:49:14 UTC

rasbt commented May 5, 2020

pep8speaks commented May 1, 2020 •

edited