Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store validation score in file for ensemble learning #173

Open
kprimice opened this issue Mar 23, 2016 · 2 comments
Open

Store validation score in file for ensemble learning #173

kprimice opened this issue Mar 23, 2016 · 2 comments

Comments

@kprimice
Copy link

I didn't get in your code but I would appreciate this feature (or another equivalent way to do it):

I don't know what is your algorithm for ensemble learning, but I guess there is some kind of bootstraping or cross-validation involved.

Maybe it's easier to explain it with an example:

Supposing you use a 5 folds CV in your ensemble learning algorithm, so you try to predict 1/5th of the observation each time. And score it by comparing to the true value.
Would it be possible to save this predictions in a .csv ? By merging the 5 * 1/5th above (*)

What I want to do with it:
I would like to run machineJS a first time on 100% of my training data. Then take the ouput (*) and run a second machineJS on it.

(Right now, what I do:
Run machineJS a first time on 50% of my training data, --predict on the remaining 50%. Then take this prediction and run machineJS on it.)

@ClimbsRocks
Copy link
Owner

@kprimice I love your focus on cross-validation! We're going to get along handsomely.

machineJS already does this under the hood. Before we do anything, we split off a certain portion of our data (say 30%) for ensembling. Let's call this the ensembleData. I actually put a fair bit of effort into making sure that we keep track of the same rows for ensembleData no matter what. That way you can perform different feature engineering, but still use the same ensemble data in the end. That lets you use different forms of feature engineering as their own form of hyperparameters.

Once we've split out the 30% ensembleData, that leaves us with 70% of the training dataset remaining.

With that 70%, we're going to run a cross-validated search for the optimal hyperparameters. By default, we're using a pretty small number of folds for this, since our goal at this point is not to train the best model, but rather to find the optimal hyperparameters.

Once we have our optimal hyperparameters, we train the algorithm on the entire 70%.

Once we've trained the algorithm on the 70% dataset, using the optimal hyperparameters found from the cross-validated search, we then test it on our ensembleData. Remember that since ensembleData is a subsample from our training data, that we do have the answers available here to score on. As expected, the scores from the hyperparameter search generally closely mirror the scores on the ensembleData.

Once we've scored it on the ensembleData, we use the algorithm to make predictions on both the test dataset, and the ensembleData. Again, it's critical at this juncture that every algorithm be making predictions on the same ensembleData, which is why we split this dataset off before anything else.

Once we have done this for all algorithms, we pass control to another module I wrote called ensembler. ensembler will aggregate together all these predictions from all the stage 0 models. Partly it just gathers them all into one block that is added to our ensembleData. So, if our training dataset originally had 10 columns, that means our ensembleData has 10 columns. But if we train up 20 models on this, each of those models will make a prediction. ensembler will add each of those 20 predictions to the ensembleData as features, so we now have 30 features in ensembleData.

More than that though, ensembler also performs some feature engineering. It does things like figure out the min, max, median, mode, and variance of all the predictions from the stage 0 classifiers.

And of course, we repeat this exact same process on our testData. So now our ensembleData and our testData have the original 10 features, plus the additional 20 features that are the predictions from our stage 0 classifiers. At this point, we're not too sure which of those new features are useful, which aren't, maybe some are more useful in some circumstances than others are. And that sounds like a perfect problem for a machine learning algorithm to sort out!

So we do exactly that. We feed the ensembleData back into machineJS. It performs a cross-validated search for the hyperparameters on the ensembleData that now includes all the stage 0 predictions. Once we've got the right hyperparameters, we use that model to make predictions on our testData, which also has all the accumuluated stage 0 predictions in it.

Once we've trained up a few models here, we ship control back over to ensembler again. ensembler aggregates together the predictions from this ensemble round (simple averaging of our most accurate models), and outputs the final results.

Whew.

As you can tell, we heavily believe in cross-validation, consistency, and ensembling.

Oh, and we do write nearly all the intermediate results to files- check out the various nested sub-folders within the "Predictions" folder.

Hope this helps explain that we're already doing what it is you want to do! That's kind of the whole point of machineJS- it just automates everything for you, letting you focus on feature engineering and deciding how to put this to use at the end.

Keep submitting tickets- I love what you've found so far!

@kprimice
Copy link
Author

Thank you for you great answer! I think it should be pinned somewhere cause it's really nice to understand how your whole project works!

Unfortunately it didn't answer my question, because I wasn't clear enough in my explanations. I will try to rephrase it with an example:

Imagine that you have training datas with 10 features and 10 000 rows. And the ouput (target) is 0 or 1.
The first thing I would do is put it in machineJS. It will perform everything you described above and output some predictions about the test set.

Now imagine that the 10 000 rows in my data are actually time dependent. Meaning the prediction on the row t should be greatly affected by the prediction on row t-1 and t+1.

So my way of dealing with this problem (and I don't know if it is the optimal way) would be to add new features to my 10 original one : 2 new features (or more) corresponding to the predictions at t-1 and t+1.
Then I would feed these 12 features to another machineJS session. And get new predictions which should now take into account the time dependency of my datas.

The "trivial" way of doing this would be to split the training set in 2 (5 000 rows each). And perform the first step on 1/2 of the data, make some predictions on the other half. And then perform the second step on 1/2 of the data and make predictions on the test. This is actually already possible with you algorithm, as I can split the data myself and feed it to your machineJS.
But a more efficient way of doing this (I think) would be to use the full data set (10 columns, 10 000 rows) from the beginning. And, during the first step, to concatenate all the predictions made on the validation set during cross-validation. So that you get predictions on the full training data, you add this predictions (t-1 and t+1) to your training data and get your 10 + 2 features that you can feed back to a new machineJS session.

I am quite new in the machine learning community, so I don't know if my approach is actually relevant. What do you think?

@ClimbsRocks ClimbsRocks reopened this Mar 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants