Associative Learning Algorithms #2662

dfrusdn · 2013-12-13T02:55:49Z

I noticed that there were no Associative Learning Algorithms such as:

Apiori Alogorithm
Equivalence Classification Algorithm (Eclat)
PrefixSpan
FP-Growth

All of them are used to detect combination of patterns in a dataset.

Some of them are kind of difficult to implement I would say about 200 lines of code?

amueller · 2013-12-14T05:43:57Z

I'm not sure item set mining is in the scope of sklearn. I only know the apriori algorithm but I know there are more advanced ones. I guess one could fit them in the API using sparse indicator matrices but somehow they seems very disjoint from the rest of sklearn.

dfrusdn · 2013-12-14T08:19:59Z

They can be used for as a precursor for the CBA algorithm, a decision tree algorithm for categorical data

amueller · 2013-12-27T14:48:17Z

There are no decision trees (or any other algorithm) for categorical data without a one-hot-transform in sklearn.

larsmans · 2014-03-10T17:13:01Z

I think frequent item mining should be considered OT. None of the core developers works in that area, so any submitted code is likely to become orphaned. We've been trying to reduce the scope of the library for this very reason.

GaelVaroquaux · 2014-03-10T17:42:20Z

I think frequent item mining should be considered OT. None of the core
developers works in that area, so any submitted code is likely to become
orphaned.

Also, I believe that the kind of code patterns will be very different
then what we currently have.

Not saying that it is not interesting, just saying the tool should be a
different one.

ajaybhat · 2014-04-10T14:47:47Z

Hi,

I have some knowledge of Apriori and FP growth algorithm. I'd like to work on this issue. Is there anyone else already working on it, and if so I'd like to help with that too.

larsmans · 2014-06-15T14:12:16Z

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

lmarti · 2014-07-11T21:16:50Z

Sad decision!

ogrisel · 2014-07-11T21:46:50Z

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

Very reasonable decision :)

joernhees · 2014-09-23T10:34:59Z

👍 for focus
👎 for excluding a whole class of well known unsupervised learning algorithms

jnothman · 2014-09-23T11:00:45Z

@joernhees could you explain how this formulation of unsupervised learning even fits into the scikit-learn API? If not easily, then it probably belongs in scope of a different project that can establish its own API. I think @larsmans made that quite clear above, and it doesn't deserve a snide response.

joernhees · 2014-09-23T21:52:21Z

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning algorithms and just expected to find them in sklearn (as it's a pretty awesome collection of machine learning algorithms and usually i find most things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted to voice both:

Pleased to see that you make the good software engineering decision to focus (which is difficult).
Disappointed that association rule mining isn't part of it and there's another person out there who misses it. As i said it can be seen as an own class of unsupervised learning algorithms and it's quite successful (amazon). Maybe it's a bit too much data mining and a bit too little machine learning for sklearn, but just twist it a bit and you get rule learning which is quite useful for the explainable prediction of the next action an actor might take for example.

You're right that association rule mining doesn't fully fit into the current API. Conceptually i see it somewhere in between dimensionality reduction techniques and hierarchical clustering. API wise it's probably closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way, please accept my apologies.

jnothman · 2014-09-23T23:30:04Z

no problem. There are definitely Python implementations of apriori.
Building a good library that collects together alternatives, and gives them
a consistent (scikit-learn-like) API seems like a nice project... I think
classifiers based on association rule mining may well be in scope for
scikit-learn, but unless they are sufficiently popular and standardised
already, it runs the risk of becoming code without a maintainer.

On 24 September 2014 07:52, Jörn Hees notifications@github.com wrote:

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning
algorithms and just expected to find them in sklearn (as it's a pretty
awesome collection of machine learning algorithms and usually i find most
things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted
to voice both:

Pleased to see that you make the good software engineering decision
to focus (which is difficult).

Disappointed that association rule mining isn't part of it and
there's another person out there who misses it. As i said it can be seen as
an own class of unsupervised learning algorithms and it's quite successful
(amazon). Maybe it's a bit too much data mining and a bit too little
machine learning for sklearn, but just twist it a bit and you get rule
learning which is quite useful for the explainable prediction of the next
action an actor might take for example.

You're right that association rule mining doesn't fully fit into the
current API. Conceptually i see it somewhere in between dimensionality
reduction techniques and hierarchical clustering. API wise it's probably
closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way,
please accept my apologies.

—
Reply to this email directly or view it on GitHub
#2662 (comment)
.

jamesmcm · 2014-11-13T16:19:55Z

I think this would be worthwhile, this article: Comparing Association Rules and Decision Trees
for Disease Prediction demonstrates clear advantages in comparison with decision trees.

This blog post includes Python code for A-Priori, it might be interesting to have a go at implementing these algorithms sometime. Is there any work on a separate prototyping package?

larsmans · 2014-11-13T16:25:50Z

None so far. Maybe you can try to gather support for this on the mailing list?

hlin117 · 2015-03-24T22:11:38Z

I am, for one, disappointed that these algorithms are not implemented in sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan, and the number of citations for both of those papers ("Mining frequent patterns without candidate generation" and "Mining sequential patterns by pattern-growth") is proof that both of those algorithms have a place in sklearn.

jnothman · 2015-03-24T22:26:16Z

Just because scikit-learn has a popularity criterion for included
algorithms, that doesn't mean every popular algorithm should be included.
Scikit-learn needs to have limited scope, and this is simply too far from
classification and regression-like problems (although I'd be interested to
see a successful association-based classifier implemented).

Feel free to be disappointed, but I strongly doubt that ARL techniques will
be directly included in scikit-learn in the foreseeable future (although
another project may provide them with a scikit-learn-like API). There are
other projects where these algorithms are more appropriate, but if you're
disappointed with them too, go make your own.

On 25 March 2015 at 09:11, Henry notifications@github.com wrote:

I am, for one, disappointed that these algorithms are not implemented in
sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan,
and the number of citations for both of those papers (
"Mining frequent patterns without candidate generation" and "Mining
sequential patterns by pattern-growth") is proof that both of those
algorithms have a place in sklearn.

—
Reply to this email directly or view it on GitHub
#2662 (comment)
.

aloknayak29 · 2015-06-29T14:45:58Z

Association learning algorithms are simply too far from classification and regression-like problems. Although we can consider Frequent Itemset/ pattern mining algorithm instead as a feature generation algorithm like countvectorizer and tfidfvectorizer. Those frequent patterns might be used in any classifier algorithm as input features, and will be much more intuitive and somewhat different than applying information gain based decision tree learning

larsmans · 2015-06-29T15:36:05Z

That's an option. Kudo and Matsumoto show how to sample a subset of the polykernel with PrefixSpan.

aloknayak29 · 2015-06-30T04:39:00Z

I can lookup and check scikit-learn documentation, but I will ask you directly, Is this option (Kudo and Matsumoto) available in scikit-learn.

larsmans · 2015-06-30T08:20:15Z

No. I'm just saying it could be.

mrandrewandrade · 2015-12-01T08:29:55Z

+1 for Apiori Alogorithm

rmenich · 2016-04-18T15:14:06Z

Note that there are ML algorithms which depend up frequent item lists as input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f., http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all the features are binary indicators (perhaps as a result of one-hot-encoding). We can consider a training set row to be a 'basket' and the presence of a feature for that training set row to be an 'item' within the basket. Thus, fairly generic data sets could be operated upon by apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated and eventually an if-then-else structure is created from them. See the referenced paper for more details.

The point is that having frequent itemset mining approaches available could support classifiers and regressors --- already within the scope of sklearn --- not just market basket analysis.

jnothman · 2016-04-19T00:53:22Z

That's motivation for such algorithms to be available in scipy, perhaps. Of
course, if a classifier or similar that meets scikit-learn's inclusion
guidelines were implemented with itemset mining, it's got a good chance of
inclusion, apriori and all.

On 19 April 2016 at 01:14, rmenich notifications@github.com wrote:

Note that there are ML algorithms which depend up frequent item lists as
input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f.,
http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all
the features are binary indicators (perhaps as a result of
one-hot-encoding). We can consider a training set row to be a 'basket' and
the presence of a feature for that training set row to be an 'item' within
the basket. Thus, fairly generic data sets could be operated upon by
apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated
and eventually an if-then-else structure is created from them. See the
referenced paper for more details.

The point is that having frequent itemset mining approaches available
could support classifiers and regressors --- already within the scope of
sklearn --- not just market basket analysis.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#2662 (comment)

actsasgeek · 2016-08-13T21:25:18Z

I don't know how much of sklearn has changed since this conversation started but there's an entire "cluster" package that's not regression/classification either. I think a good implementation of the latest algorithms for association rules and frequent itemsets would be welcome by many in sklearn.

jnothman · 2016-08-14T01:45:21Z

Clustering is much like classification, but unsupervised, and has long been part of scikit-learn. Association rule mining remains outside the primary tasks scikit-learn focuses on, and does not neatly fit its API, but might be relevant in the context of an association-based classifier.

"latest algorithms" isn't what scikit-learn is about. See our FAQ.

It would be nice not to have to repeat myself.

amueller · 2016-08-26T20:52:51Z

@actsasgeek if you want to implement association rule mining in a scikit-learn compatible way, we'd be happy to include it into scikit-learn-contrib: https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/README.md

un-lock-me · 2017-08-17T05:59:18Z

I hope my repetitive question does not bother you, as I see a feeling of opposite toward adding association rule mining in such a great lib like scikit learn. I just want to get updated is there any frequent item set implemented in scikit learn after three years of the creation of this thred?.

jnothman · 2017-08-17T10:40:57Z

Association rule mining is outside of the scope of machine learning, and certainly out of the scope of scikit-learn. Classification based on association rules is the only context in which we would consider it, and then it would still need to be a hard sell.

…

On 17 August 2017 at 15:59, saria85 ***@***.***> wrote: I hope my repetitive question does not bother you, as I see a feeling of opposite toward adding association rule mining in such a great lib like scikit learn. I just want to get updated is there any frequent item set implemented in scikit learn after three years of the creation of this thred?. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2662 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67fCICLgV-3OpYiV3ErpJSW0mobgks5sY9a4gaJpZM4BT5PS> .

gbroques · 2018-04-27T05:47:51Z

For those who are interested,

A library called mlxtend implements the a priori algorithm:
http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/

Sandy4321 · 2019-07-01T18:55:34Z

yes everybody needs it, so it will be great to have in scikit-learn.
one more link for using it in ML
http://www2.cs.uh.edu/~ordonez/pdfwww/w-2006-HIKM-ardtmed.pdf
Comparing Association Rules and Decision Trees
for Disease Prediction

jnothman · 2019-07-01T21:53:19Z

That is pattern mining not ML

remiadon · 2021-03-08T10:13:00Z

Hi everyone,

I am a research engineer, working on the implementation of a standard pattern mining library in Python : scikit-mine, which is being designed for compatibility with scikit-learn

If you allow me I would like to give my opinion and thoughts concerning interactions between pattern mining and Machine Learning

Pattern Mining IS NOT Machine Learning. It is a different area of research, and proper inclusion of this family of algorithms into the Python ecosystem is a topic in itself. Echoing @amueller @larsmans @GaelVaroquaux and @ogrisel I also believe this is out of the scope of sklearn (hence the need for other libraries to handle it)
Echoing @ajaybhat @hlin117 @jnothman and @rmenich : Apriori and FPGrowth algorithms are standard frequent itemset mining algorithms that many people know, but IMO they have been outperformed by other methods in the last decade, both in computational runtimes and the quality of the discovered patterns. SLIM is of this kind
Echoing @Sandy4321 and @rmenich interactions between pattern mining and other libraries in the Python ecosystem is definitely something to be considered. I am working on this :)
Inclusion in scikit-learn-contrib is also something to hope for, at least if people express the need for such algorithms

NB : I know it can seem frustrating sometimes, I was frustrated myself when I wanted to use Pattern Mining algorithms and couldn't find any tool that suited me. Maintainers have to make strong choices, including saying NO to people in need. Hopefully the community will converge and everyone will be satisfied

rmenich · 2021-03-08T14:47:05Z

Remi: Association rule mining is a foundational component in some interpretable ML approaches (c.f., Cynthia Rudin's work on Bayesian rule lists <https://arxiv.org/abs/1602.08610> and her other decision list papers <https://users.cs.duke.edu/~cynthia/papers.html>). So I'm not sure about the statement, " Pattern Mining IS NOT Machine Learning "; why the semantic distinction? Ron Menich

…

On Mon, Mar 8, 2021 at 5:13 AM Rémi Adon ***@***.***> wrote: Hi everyone, I am a research engineer, working on the implementation of a standard pattern mining library in Python : scikit-mine <https://github.com/scikit-mine/scikit-mine>, which is being designed for compatibility with scikit-learn If you allow me I would like to give my opinion and thoughts concerning interactions between pattern mining and Machine Learning 1. Pattern Mining IS NOT Machine Learning. It is a different area of research, and proper inclusion of this family of algorithms into the Python ecosystem is a topic in itself. Echoing @amueller <https://github.com/amueller> @larsmans <https://github.com/larsmans> @GaelVaroquaux <https://github.com/GaelVaroquaux> and @ogrisel <https://github.com/ogrisel> I also believe this is out of the scope of sklearn (hence the need for other libraries to handle it) 2. Echoing @ajaybhat <https://github.com/ajaybhat> @hlin117 <https://github.com/hlin117> @jnothman <https://github.com/jnothman> and @rmenich <https://github.com/rmenich> : Apriori and FPGrowth algorithms are standard frequent itemset mining algorithms that many people know, but IMO they have been outperformed by other methods in the last decade, both in computational runtimes and the quality of the discovered patterns. SLIM <https://scikit-mine.github.io/scikit-mine/reference/itemsets.html#slim> is of this kind 3. Echoing @Sandy4321 <https://github.com/Sandy4321> and @rmenich <https://github.com/rmenich> interactions between pattern mining and other libraries in the Python ecosystem is definitely something to be considered. I am working on this :) 4. Inclusion in scikit-learn-contrib is also something to hope for, at least if people express the need for such algorithms NB : I know it can seem frustrating sometimes, I was frustrated myself when I wanted to use Pattern Mining algorithms and couldn't find any tool that suited me. Maintainers have to make strong choices, including saying NO to people in need. Hopefully the community will converge and everyone will be satisfied — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2662 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AENM6PNO3YUSHZ4MWGTNBD3TCSPMBANCNFSM4AKPSPJA> .

remiadon · 2021-03-08T16:49:00Z

@rmenich the main differences. Again, this is only my opinion

ML models parameters usually are numerical objects (vectors/matrices of number). PM algorithms discover (~learn) symbols (pattern sets) from symbols (a set of transactions). In the case of APriori, due to the pattern explosion property, storing this pattern set as an internal attribute of a Python class, like this is done in sklearn, would blow up memory on most modern datasets
sklearn models expect numpy matrices as input. The literature on itemset minining (subset of pattern mining) mention datasets presented as tabular data (a transactional dataset can be OneHotEncoded, eg with sklearn.MultiLabelBinarizer), but also "raw" transactional datasets, in the form of lists-of-lists
Once patterns are extracted from a dataset, one can run rule extraction. As far as I know there is no way to make this fit into the sklearn predict/predict_proba/decision_function API. If you match the left side of a rule, you get the right side of this rule as an output, but you will never get a matrix/vector ...

Concerning Cynthia Rudin's work, If I am correct the model is trained from pre-mined rules. In other words one can use patterns discovered by a PM algo as knowledge to build a Machine Learning model, but I would not say PM is ML ...

Also good to note : the algorithms mentioned in this thread deal with itemset mining, which is actually a subpart of what the Pattern Mining literature offers. Their exists a plethora of other types of patterns

sequential patterns (eg. substrings in strings)
subgraph mining
periodic pattern (eg. log analysis)
...

Sandy4321 · 2021-03-14T19:24:08Z

https://github.com/remiadon
https://github.com/rmenich

it is big mistake to say that Dr.Rudin approach is not machine learning

try to read her papers again

c.f., Cynthia Rudin's work on Bayesian rule lists
https://arxiv.org/abs/1602.08610 and her other decision list papers
https://users.cs.duke.edu/~cynthia/papers.html)

adrinjalali · 2024-01-22T09:23:54Z

This conversation is no more constructive and isn't following our CoC (https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md). I'm locking the conversation.

larsmans mentioned this issue Mar 10, 2014

Feature Request: Apriori algorithm (working code) #2872

Closed

larsmans closed this as completed Jun 15, 2014

Sandy4321 mentioned this issue Mar 14, 2021

github: some say it is not machine learning corels/pycorels#19

Open

This comment was marked as abuse.

Sign in to view

scikit-learn locked and limited conversation to collaborators Jan 22, 2024

Associative Learning Algorithms #2662

Associative Learning Algorithms #2662

Comments

dfrusdn commented Dec 13, 2013

amueller commented Dec 14, 2013

dfrusdn commented Dec 14, 2013

amueller commented Dec 27, 2013

larsmans commented Mar 10, 2014

GaelVaroquaux commented Mar 10, 2014

ajaybhat commented Apr 10, 2014

larsmans commented Jun 15, 2014

lmarti commented Jul 11, 2014

ogrisel commented Jul 11, 2014

joernhees commented Sep 23, 2014

jnothman commented Sep 23, 2014

joernhees commented Sep 23, 2014

jnothman commented Sep 23, 2014

jamesmcm commented Nov 13, 2014

larsmans commented Nov 13, 2014

hlin117 commented Mar 24, 2015

jnothman commented Mar 24, 2015

aloknayak29 commented Jun 29, 2015

larsmans commented Jun 29, 2015

aloknayak29 commented Jun 30, 2015

larsmans commented Jun 30, 2015

mrandrewandrade commented Dec 1, 2015

rmenich commented Apr 18, 2016

jnothman commented Apr 19, 2016

actsasgeek commented Aug 13, 2016

jnothman commented Aug 14, 2016

amueller commented Aug 26, 2016

un-lock-me commented Aug 17, 2017

jnothman commented Aug 17, 2017 via email

gbroques commented Apr 27, 2018

Sandy4321 commented Jul 1, 2019

jnothman commented Jul 1, 2019 via email

remiadon commented Mar 8, 2021

rmenich commented Mar 8, 2021 via email

remiadon commented Mar 8, 2021 • edited

Sandy4321 commented Mar 14, 2021

This comment was marked as abuse.

adrinjalali commented Jan 22, 2024 • edited

remiadon commented Mar 8, 2021 •

edited

adrinjalali commented Jan 22, 2024 •

edited