Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Associative Learning Algorithms #2662

Closed
dfrusdn opened this issue Dec 13, 2013 · 38 comments
Closed

Associative Learning Algorithms #2662

dfrusdn opened this issue Dec 13, 2013 · 38 comments

Comments

@dfrusdn
Copy link

dfrusdn commented Dec 13, 2013

I noticed that there were no Associative Learning Algorithms such as:

Apiori Alogorithm
Equivalence Classification Algorithm (Eclat)
PrefixSpan
FP-Growth

All of them are used to detect combination of patterns in a dataset.

Some of them are kind of difficult to implement I would say about 200 lines of code?

@amueller
Copy link
Member

I'm not sure item set mining is in the scope of sklearn. I only know the apriori algorithm but I know there are more advanced ones. I guess one could fit them in the API using sparse indicator matrices but somehow they seems very disjoint from the rest of sklearn.

@dfrusdn
Copy link
Author

dfrusdn commented Dec 14, 2013

They can be used for as a precursor for the CBA algorithm, a decision tree algorithm for categorical data

@amueller
Copy link
Member

There are no decision trees (or any other algorithm) for categorical data without a one-hot-transform in sklearn.

@larsmans
Copy link
Member

I think frequent item mining should be considered OT. None of the core developers works in that area, so any submitted code is likely to become orphaned. We've been trying to reduce the scope of the library for this very reason.

@GaelVaroquaux
Copy link
Member

I think frequent item mining should be considered OT. None of the core
developers works in that area, so any submitted code is likely to become
orphaned.

Also, I believe that the kind of code patterns will be very different
then what we currently have.

Not saying that it is not interesting, just saying the tool should be a
different one.

@ajaybhat
Copy link

Hi,

I have some knowledge of Apriori and FP growth algorithm. I'd like to work on this issue. Is there anyone else already working on it, and if so I'd like to help with that too.

@larsmans
Copy link
Member

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

@lmarti
Copy link

lmarti commented Jul 11, 2014

Sad decision!

@ogrisel
Copy link
Member

ogrisel commented Jul 11, 2014

Closing this issue. I think association learning should be prototyped in a separate package; if it turns out that the code and interfaces are similar enough to ours, we can consider the code for merging into scikit-learn.

Very reasonable decision :)

@joernhees
Copy link
Contributor

👍 for focus
👎 for excluding a whole class of well known unsupervised learning algorithms

@jnothman
Copy link
Member

@joernhees could you explain how this formulation of unsupervised learning even fits into the scikit-learn API? If not easily, then it probably belongs in scope of a different project that can establish its own API. I think @larsmans made that quite clear above, and it doesn't deserve a snide response.

@joernhees
Copy link
Contributor

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning algorithms and just expected to find them in sklearn (as it's a pretty awesome collection of machine learning algorithms and usually i find most things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted to voice both:

  • Pleased to see that you make the good software engineering decision to focus (which is difficult).
  • Disappointed that association rule mining isn't part of it and there's another person out there who misses it. As i said it can be seen as an own class of unsupervised learning algorithms and it's quite successful (amazon). Maybe it's a bit too much data mining and a bit too little machine learning for sklearn, but just twist it a bit and you get rule learning which is quite useful for the explainable prediction of the next action an actor might take for example.

You're right that association rule mining doesn't fully fit into the current API. Conceptually i see it somewhere in between dimensionality reduction techniques and hierarchical clustering. API wise it's probably closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way, please accept my apologies.

@jnothman
Copy link
Member

no problem. There are definitely Python implementations of apriori.
Building a good library that collects together alternatives, and gives them
a consistent (scikit-learn-like) API seems like a nice project... I think
classifiers based on association rule mining may well be in scope for
scikit-learn, but unless they are sufficiently popular and standardised
already, it runs the risk of becoming code without a maintainer.

On 24 September 2014 07:52, Jörn Hees notifications@github.com wrote:

sorry if this came across as snide, that wasn't my intention.

I originally arrived here searching for association rule learning
algorithms and just expected to find them in sklearn (as it's a pretty
awesome collection of machine learning algorithms and usually i find most
things i need in it (big thank you for that)).

After reading this thread i was both: pleased and disappointed, and wanted
to voice both:

  • Pleased to see that you make the good software engineering decision
    to focus (which is difficult).
  • Disappointed that association rule mining isn't part of it and
    there's another person out there who misses it. As i said it can be seen as
    an own class of unsupervised learning algorithms and it's quite successful
    (amazon). Maybe it's a bit too much data mining and a bit too little
    machine learning for sklearn, but just twist it a bit and you get rule
    learning which is quite useful for the explainable prediction of the next
    action an actor might take for example.

You're right that association rule mining doesn't fully fit into the
current API. Conceptually i see it somewhere in between dimensionality
reduction techniques and hierarchical clustering. API wise it's probably
closest to hierarchical clustering.

As two lines were probably too short to express that in a friendly way,
please accept my apologies.


Reply to this email directly or view it on GitHub
#2662 (comment)
.

@jamesmcm
Copy link

I think this would be worthwhile, this article: Comparing Association Rules and Decision Trees
for Disease Prediction
demonstrates clear advantages in comparison with decision trees.

This blog post includes Python code for A-Priori, it might be interesting to have a go at implementing these algorithms sometime. Is there any work on a separate prototyping package?

@larsmans
Copy link
Member

None so far. Maybe you can try to gather support for this on the mailing list?

@hlin117
Copy link
Contributor

hlin117 commented Mar 24, 2015

I am, for one, disappointed that these algorithms are not implemented in sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan, and the number of citations for both of those papers ("Mining frequent patterns without candidate generation" and "Mining sequential patterns by pattern-growth") is proof that both of those algorithms have a place in sklearn.

@jnothman
Copy link
Member

Just because scikit-learn has a popularity criterion for included
algorithms, that doesn't mean every popular algorithm should be included.
Scikit-learn needs to have limited scope, and this is simply too far from
classification and regression-like problems (although I'd be interested to
see a successful association-based classifier implemented).

Feel free to be disappointed, but I strongly doubt that ARL techniques will
be directly included in scikit-learn in the foreseeable future (although
another project may provide them with a scikit-learn-like API). There are
other projects where these algorithms are more appropriate, but if you're
disappointed with them too, go make your own.

On 25 March 2015 at 09:11, Henry notifications@github.com wrote:

I am, for one, disappointed that these algorithms are not implemented in
sklearn. My advisor is Jiawei Han, the author of FP-growth and PrefixSpan,
and the number of citations for both of those papers (
"Mining frequent patterns without candidate generation" and "Mining
sequential patterns by pattern-growth") is proof that both of those
algorithms have a place in sklearn.


Reply to this email directly or view it on GitHub
#2662 (comment)
.

@aloknayak29
Copy link

Association learning algorithms are simply too far from classification and regression-like problems. Although we can consider Frequent Itemset/ pattern mining algorithm instead as a feature generation algorithm like countvectorizer and tfidfvectorizer. Those frequent patterns might be used in any classifier algorithm as input features, and will be much more intuitive and somewhat different than applying information gain based decision tree learning

@larsmans
Copy link
Member

That's an option. Kudo and Matsumoto show how to sample a subset of the polykernel with PrefixSpan.

@aloknayak29
Copy link

I can lookup and check scikit-learn documentation, but I will ask you directly, Is this option (Kudo and Matsumoto) available in scikit-learn.

@larsmans
Copy link
Member

No. I'm just saying it could be.

@mrandrewandrade
Copy link
Contributor

+1 for Apiori Alogorithm

@rmenich
Copy link

rmenich commented Apr 18, 2016

Note that there are ML algorithms which depend up frequent item lists as input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f., http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all the features are binary indicators (perhaps as a result of one-hot-encoding). We can consider a training set row to be a 'basket' and the presence of a feature for that training set row to be an 'item' within the basket. Thus, fairly generic data sets could be operated upon by apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated and eventually an if-then-else structure is created from them. See the referenced paper for more details.

The point is that having frequent itemset mining approaches available could support classifiers and regressors --- already within the scope of sklearn --- not just market basket analysis.

@jnothman
Copy link
Member

That's motivation for such algorithms to be available in scipy, perhaps. Of
course, if a classifier or similar that meets scikit-learn's inclusion
guidelines were implemented with itemset mining, it's got a good chance of
inclusion, apriori and all.

On 19 April 2016 at 01:14, rmenich notifications@github.com wrote:

Note that there are ML algorithms which depend up frequent item lists as
input. For example, see Cynthia Rudin's Bayesian Rule Lists (c.f.,
http://www.stat.washington.edu/research/reports/2012/tr609%20-%20old.pdf).

Consider a data set with a response variable to be predicted for which all
the features are binary indicators (perhaps as a result of
one-hot-encoding). We can consider a training set row to be a 'basket' and
the presence of a feature for that training set row to be an 'item' within
the basket. Thus, fairly generic data sets could be operated upon by
apriori, FP-growth, and other frequent itemset mining techniques.

In the Bayesian Rule List algorithm, the frequent itemsets are evaluated
and eventually an if-then-else structure is created from them. See the
referenced paper for more details.

The point is that having frequent itemset mining approaches available
could support classifiers and regressors --- already within the scope of
sklearn --- not just market basket analysis.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#2662 (comment)

@actsasgeek
Copy link

I don't know how much of sklearn has changed since this conversation started but there's an entire "cluster" package that's not regression/classification either. I think a good implementation of the latest algorithms for association rules and frequent itemsets would be welcome by many in sklearn.

@jnothman
Copy link
Member

Clustering is much like classification, but unsupervised, and has long been part of scikit-learn. Association rule mining remains outside the primary tasks scikit-learn focuses on, and does not neatly fit its API, but might be relevant in the context of an association-based classifier.

"latest algorithms" isn't what scikit-learn is about. See our FAQ.

It would be nice not to have to repeat myself.

@amueller
Copy link
Member

@actsasgeek if you want to implement association rule mining in a scikit-learn compatible way, we'd be happy to include it into scikit-learn-contrib: https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/README.md

@un-lock-me
Copy link

I hope my repetitive question does not bother you, as I see a feeling of opposite toward adding association rule mining in such a great lib like scikit learn. I just want to get updated is there any frequent item set implemented in scikit learn after three years of the creation of this thred?.

@jnothman
Copy link
Member

jnothman commented Aug 17, 2017 via email

@gbroques
Copy link

For those who are interested,

A library called mlxtend implements the a priori algorithm:
http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/

@Sandy4321
Copy link

yes everybody needs it, so it will be great to have in scikit-learn.
one more link for using it in ML
http://www2.cs.uh.edu/~ordonez/pdfwww/w-2006-HIKM-ardtmed.pdf
Comparing Association Rules and Decision Trees
for Disease Prediction

@jnothman
Copy link
Member

jnothman commented Jul 1, 2019 via email

@remiadon
Copy link

remiadon commented Mar 8, 2021

Hi everyone,

I am a research engineer, working on the implementation of a standard pattern mining library in Python : scikit-mine, which is being designed for compatibility with scikit-learn

If you allow me I would like to give my opinion and thoughts concerning interactions between pattern mining and Machine Learning

  1. Pattern Mining IS NOT Machine Learning. It is a different area of research, and proper inclusion of this family of algorithms into the Python ecosystem is a topic in itself. Echoing @amueller @larsmans @GaelVaroquaux and @ogrisel I also believe this is out of the scope of sklearn (hence the need for other libraries to handle it)
  2. Echoing @ajaybhat @hlin117 @jnothman and @rmenich : Apriori and FPGrowth algorithms are standard frequent itemset mining algorithms that many people know, but IMO they have been outperformed by other methods in the last decade, both in computational runtimes and the quality of the discovered patterns. SLIM is of this kind
  3. Echoing @Sandy4321 and @rmenich interactions between pattern mining and other libraries in the Python ecosystem is definitely something to be considered. I am working on this :)
  4. Inclusion in scikit-learn-contrib is also something to hope for, at least if people express the need for such algorithms

NB : I know it can seem frustrating sometimes, I was frustrated myself when I wanted to use Pattern Mining algorithms and couldn't find any tool that suited me. Maintainers have to make strong choices, including saying NO to people in need. Hopefully the community will converge and everyone will be satisfied

@rmenich
Copy link

rmenich commented Mar 8, 2021 via email

@remiadon
Copy link

remiadon commented Mar 8, 2021

@rmenich the main differences. Again, this is only my opinion

  • ML models parameters usually are numerical objects (vectors/matrices of number). PM algorithms discover (~learn) symbols (pattern sets) from symbols (a set of transactions). In the case of APriori, due to the pattern explosion property, storing this pattern set as an internal attribute of a Python class, like this is done in sklearn, would blow up memory on most modern datasets
  • sklearn models expect numpy matrices as input. The literature on itemset minining (subset of pattern mining) mention datasets presented as tabular data (a transactional dataset can be OneHotEncoded, eg with sklearn.MultiLabelBinarizer), but also "raw" transactional datasets, in the form of lists-of-lists
  • Once patterns are extracted from a dataset, one can run rule extraction. As far as I know there is no way to make this fit into the sklearn predict/predict_proba/decision_function API. If you match the left side of a rule, you get the right side of this rule as an output, but you will never get a matrix/vector ...

Concerning Cynthia Rudin's work, If I am correct the model is trained from pre-mined rules. In other words one can use patterns discovered by a PM algo as knowledge to build a Machine Learning model, but I would not say PM is ML ...

Also good to note : the algorithms mentioned in this thread deal with itemset mining, which is actually a subpart of what the Pattern Mining literature offers. Their exists a plethora of other types of patterns

  • sequential patterns (eg. substrings in strings)
  • subgraph mining
  • periodic pattern (eg. log analysis)
  • ...

@Sandy4321
Copy link

https://github.com/remiadon
https://github.com/rmenich

it is big mistake to say that Dr.Rudin approach is not machine learning

try to read her papers again

c.f., Cynthia Rudin's work on Bayesian rule lists
https://arxiv.org/abs/1602.08610 and her other decision list papers
https://users.cs.duke.edu/~cynthia/papers.html)

@TensorBlast

This comment was marked as abuse.

@adrinjalali
Copy link
Member

adrinjalali commented Jan 22, 2024

This conversation is no more constructive and isn't following our CoC (https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md). I'm locking the conversation.

@scikit-learn scikit-learn locked and limited conversation to collaborators Jan 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests