Qlearning+sarsa v2.0 #1005

adrianjav · 2017-12-09T22:11:42Z

This pull request adds more reinforcement learning functionality to dlib. Namely it implements the Qlearning and SARSA algorithms. I think I've done a good job with respect to backward compability and style code so far but let me know anything that needs a review.
A list of the changes:

Qlearning and SARSA algorithms (files qlearning.h and sarsa.h)
Separated policy in its own header (policy.h) where there are two implementations (greedy_policy and epsilon_policy).
An example that shows both algorithms working (qlearning_sarsa_ex.cpp)
The same example as a unitary test (reinforcement_learning.cpp). It works for me but it'd be nice some more testing since convergence could depend on the machine.
An interface for the model the agents will work with (model_abstract.h).

About the mechanics: These new algorithms avoid using the process_sample of lspi. Instead they run on a model (usually an online model) that allows them to take steps and get feedback from their choices. What the algorithms really train aren't models but policies. That way you can specify a custom policy and custom initial weights and, when training, the algorithms use an epsilon-greedy policy that uses the given policy as the underlying one.

Related PR: #824

Edit:
Mmmm... AppVeyor fails at the unitary tests. Any suggestions on how to test if they work? Since their convergence is non deterministic and qlearning can get worse after converging I find it hard to test.

davisking · 2017-12-10T19:03:32Z

Wow, this is a lot of stuff. Thanks for working on this PR :)

I haven't looked it over very carefully, but it generally looks good. You should flesh out the comments in the qlearning_sarsa_ex.cpp example a bit so that it reads like a tutorial (e.g. explains the relevant model details so that someone who isn't already familiar with Q learning can understand and learn what it does).

Also definitely fix the failing unit tests. Most of them are working though so that's good. Make sure your random number generators are all seeded the same every run so the tests reliably do the same thing. You don't want unit tests to have some small chance of failure. Beyond that I don't have much advice other than to dig in and debug it.

Let me know when you think the PR is ready for a full review.

adrianjav · 2017-12-12T10:41:58Z

Thanks god that the tests failed!
Reviewing the code I saw a terrible bug when adding rewards. I've changed that and added custom PRNG support when training. Now everything seems to go smoothly as intended.
I've also rewritten the comments in the example.

I think now it's ready for a full review.

travek · 2017-12-28T19:20:00Z

Hi!
When this feature will be merged into Master?

davisking · 2017-12-28T22:52:56Z

Some time in the next two weeks. I've been really busy with other stuff and so haven't had time to review it yet.

davisking

I spent a while looking this over, and I think it's pretty good overall. Although I do have a lot of comments.

A big one is I think that the feature extractor and model classes should just be one class. Both of these classes are not provided by dlib but implemented by the user. The dlib code doesn't really even know that there are two classes.

The central challenge is communicating to the user what they have to do to correctly implement a "model". That would be easier if it was presented to them as a single class they have to implement which has some methods that they are only required to implement if using certain algorithms.

davisking · 2018-01-07T13:44:57Z

dlib/control/approximate_linear_models_abstract.h


 namespace dlib
 {

 // ----------------------------------------------------------------------------------------

+    template <
+        typename T,
+        typename U


What are these template arguments? The documentation doesn't mention them.

davisking · 2018-01-07T13:56:57Z

dlib/control/approximate_linear_models_abstract.h

    struct example_feature_extractor 
    {
        /*!
            WHAT THIS OBJECT REPRESENTS
                This object defines the interface a feature extractor must implement if it
-                is to be used with the process_sample and policy objects defined at the
-                bottom of this file.  Moreover, it is meant to represent the core part


This wrong now? Since process_sample and the policy objects now take models rather than feature extractors.

Also, at this point in my review I still find the need to have these other objects depend on models rather than directly on feature extractors. But maybe I'll see why that's a good idea when I get more into the review.

davisking · 2018-01-07T13:57:33Z

dlib/control/approximate_linear_models_abstract.h

                of a model used in a reinforcement learning algorithm.

                In particular, this object models a Q(state,action) function where
                    Q(state,action) == dot(w, PSI(state,action))
                    where PSI(state,action) is a feature vector and w is a parameter
                    vector.

-                Therefore, a feature extractor defines how the PSI(x,y) feature vector is
+                Therefore,  a feature extractor defines how the PSI(x,y) feature vector is


Fix whitespace.

Also, try not to make whitespace changes since it makes the review harder.

davisking · 2018-01-07T14:05:28Z

dlib/control/approximate_linear_models_abstract.h

@@ -56,41 +59,29 @@ namespace dlib
                - returns the dimensionality of the PSI() feature vector.  
        !*/

-        action_type find_best_action (
-            const state_type& state,
-            const matrix<double,0,1>& w


This seems confusing. I see that the lspi code (in dlib/test/lspi.cpp) still defines feature extractors this way. So this is wrong isn't it? This function still needs to be here.

A better way to organize this model/feature extractor stuff is probably to have the documentation say there is just one class, not two. Then you can define two different versions of the model class. There is the basic one which was previously defined by the example_feature_extractor. Then there is the more detailed one with the new things you want to add to models. The new one has all the functions the feature extractor had, but with some additional new ones. Then when you use any model class it's not like it depends on some feature extractor class. It's just one class.

After all, the model is the thing the user makes. So it's weird that later on it's defined to take a feature extractor as a template argument.

Anyway, my point is that it's really unclear what a model or feature extractor is right now. Clarifying that is the most important thing here since that's the interface a user of the software is expected to implement. If it's not really clear people will just not use it and this whole RL module will never be used. Making that model interface simple so people can easily implement it is the central challenge.

davisking · 2018-01-07T14:07:02Z

dlib/control/approximate_linear_models_abstract.h

-            const state_type& state,
-            const action_type& action,
-            matrix<double,0,1>& feats
+        matrix<double,0,1> get_features (


This was changed in the new code, but the LSPI code still does it the old way. Everything needs to be the same.

Also, don't do this. This change makes the code slower, since now each call to get_features allocates a block of memory. The other way avoids the reallocation. I realize move constructors make returning the matrix not super slow. But reallocating memory in a tight loop is also bad.

davisking · 2018-01-07T14:21:40Z

dlib/control/policy_abstract.h

+template <
+    typename model_type
+    >
+class example_policy


Why is there an example policy object?

davisking · 2018-01-07T14:25:36Z

dlib/control/qlearning_abstract.h

+                  model_abstract.h.
+
+            WHAT THIS OBJECT REPRESENTS
+                This objects is an implementation of the well-known reinforcement learning


Fix spelling.

objects and algorithms

The text is wrong, it doesn't use any process_samples

Please make sure all the documentation is correct. I am extremely anal about having correct and clear documentation :). You should review all the comments a few times to make sure they are correct and clear before the next PR review.

davisking · 2018-01-07T14:34:38Z

dlib/control/qlearning_abstract.h

+        ) const;
+        /*!
+            requires
+                - policy is of the form example_policy<model_type>, i.e., an instance of


This is too general. You should change this function to take a policy<model> which is much more specific. So far I don't see any reason to not be more specific. That is, why would a user want to create their own policy object?

davisking · 2018-01-07T14:35:39Z

dlib/control/qlearning_abstract.h

+                  defined in std::random. By default it assumes it to be the standard
+                  default_random_engine class.
+            ensures
+                - returns a policy of the type policy_type as the result of applying the


The wording of this is confusing and ungrammatical.

davisking · 2018-01-07T14:38:19Z

dlib/control/sarsa.h

+            );
+
+            reward_type total_reward = static_cast<reward_type>(0);
+            for(auto iter = 0u; iter < iterations; ++iter){


Put { on a line by itself so the code style is consistent with all the other code in this file and other parts of dlib.

adrianjav · 2018-01-07T18:04:51Z

Thanks for your comments!

I strongly agree with you in merging feature_extractor and model in one class. I might have overthought it just because feature_extractor was already there and I tried to separate the model and its representation. But at the end it's confusing for the user.

The other things come mostly from: rushing; being used to my own code style; English not being my first language.

I'll try to fix them all in the next few days.

adrianjav · 2018-02-05T22:10:08Z

It's been a month since my last post and I feel like I have to say something. I've been applying the changes of the review but I'm still missing a few things.

I haven't forgotten it, just being really busy with my actual job. As soon as I have some spare time I'll post the last changes

adrianjav · 2018-02-18T15:13:19Z

Finally I've applied those requested changes into the old code. The most remarkable change is that, as you suggested, I describe in the abstract two classes: offline model (the former feature extractor) and online model; So the user can choose to implement one of them depending on its necessities.

Overall I agree with the notes you took in the review, the only thing I could argue is that I returned matrices by value on purpose trusting the compiler's ROV optimization. Either way I've also changed that and reviewed the comments twice.

davisking · 2018-02-19T13:41:28Z

Cool, thanks for the update. I'll review the PR sometime over the coming week.

davisking · 2018-03-06T12:28:13Z

I haven't forgotten about this PR. But I've got a new baby at home and have been a little busy :). I'll get to it though.

adrianjav · 2018-03-06T21:16:41Z

Hahaha don't worry, I'm sure you'll do it. I'm quite busy as well.
Congratulations for your newborn! 👶

davisking · 2018-03-07T02:36:05Z

Thanks :)

travek · 2018-04-15T18:21:09Z

Davis,
when can you push this PR ?

davisking · 2018-04-15T21:01:37Z

Not sure. I'll get to it thought. You can use it now though, no need to wait on me. In fact, other people testing and reporting back here will help me in my own review when I get to it.

davisking

Ok, so I looked this over and it's pretty good. I think the class structure is nice and overall it's pretty good. I did find a bunch of things that need to be fixed before being merged though.

One of the big ones is that the comments say that the models are thread safe but then there are these mutable random number generators in the implementations. I noted this in my comments. That definitely needs to be fixed, and I think the right thing to do is probably to make them not mutable and then let member functions that mutate them not be const. I haven't worked out the implications of that on the interface, if it's unreasonable then don't do it and keeping mutable things is fine. But the best interface pattern with regards to thread safety is to be able to say that const members are thread safe and non-const aren't, because they mutate memory. So if that can be done here without making the API irritating to the user in some way then that's the best. The other option is to simply remove the notes about the model being thread safe. I don't think there is any need in the code we have for it to actually be thread safe, although it would be nice. But the main issue is that whatever type of thread safety claimed by the documentation must really exist in the code.

Also, sorry about the delay. I've been unusually busy :)

davisking · 2018-04-26T11:41:29Z

dlib/control/approximate_linear_models.h

+        ) const
+        {
+            std::bernoulli_distribution d(epsilon);
+            return d(gen) ? get_model().random_action(state) : underlying_policy(state);


This was a little confusing. Why doesn't this say?

return d(gen) ? underlying_policy.random_action(state) : underlying_policy(state);

I thought there were two different models here at first. In general, less indirection is better, or at least the code should use consistent indirection so that it's clear what is being referenced. The way it's written here makes it look like there are two different models in play when really it's just one.

random action is a method from online model, I have sustituted that for

return d(gen) ? underlying_policy.get_model().random_action(state) : underlying_policy(state);

davisking · 2018-04-26T11:43:40Z

dlib/control/approximate_linear_models.h

+        serialize(version, out);
+        serialize(item.get_policy(), out);
+        serialize(item.get_epsilon(), out);
+        serialize(item.get_generator(), out);


Does this work? Are there serialize routines defined for the random number generators in std::? You should add unit tests that invoke the serialization routines for these new objects to make sure they all work.

davisking · 2018-04-26T11:45:34Z

dlib/control/approximate_linear_models_abstract.h

-                    where PSI(state,action) is a feature vector and w is a parameter
-                    vector.
+                This object defines the inferface that any model has to implement if it
+                is to be used in an offline fashion along with some method like the lspi


This wording is confusing. lspi is a class, not a method.

davisking · 2018-04-26T11:49:54Z

dlib/control/approximate_linear_models_abstract.h

+        /*!
+            WHAT THIS OBJECT REPRESENTS
+                This object defines the inferface that any model has to implement if it
+                is to be used in an online fashion along with some method like the qlearning


Say "if it is to be used by an object such as the qlearning class". The word "method" means a member function and is confusing in this context.

davisking · 2018-04-26T11:50:24Z

dlib/control/approximate_linear_models_abstract.h

+                method defined in the file qlearning_abstract.h.
+
+                Being online means that the model doesn't hold prior data but it interacts
+                with the environment and performing actions from some given state turning


fix grammar

I also don't really understand what this is saying. This needs to be clarified. A good way to do this is to talk explicitly about the member functions in this interface that are not in the offline one. You can talk about how this is like the offline version but that it additional has such and such and that is useful for so and so.

davisking · 2018-04-26T14:56:51Z

dlib/control/qlearning_abstract.h

+        );
+        /*!
+            requires
+                - discount >= 0 and discount <= 1.


Can it really be 0?

I think so, a discount of 0 means that the algorithm is totally short-sighted. It will only consider the immediate reward.

davisking · 2018-04-26T14:57:59Z

dlib/control/qlearning_abstract.h

+                - prng_engine is a pseudo-random number generator class like the ones
+                  defined in std::random. By default it is the standard one.
+            ensures
+                - returns the policy resulting of applying the learning function over


What is the policy object used for? The text here doesn't say anything about it.

davisking · 2018-04-26T15:01:03Z

dlib/control/sarsa_abstract.h

+                that is, the current expected reward from there, and the new expected qvalue.
+
+                Note that, unlike qlearning, sarsa is an on-policy reinforcement learning
+                algorithm meaning that it takes the policy into account while learning.


See my other comment about this being a little confusing. Reader unfamiliar with these algorithms will think this is saying something like "q-learning" ignores the policy or something wrong like that.

davisking · 2018-04-26T15:01:42Z

dlib/control/sarsa_abstract.h

+                - prng_engine is a pseudo-random number generator class like the ones
+                  defined in std::random. By default it is the standard one.
+            ensures
+                - returns the policy resulting of applying the learning function over


The policy argument is not mentioned. What is it used for?

davisking · 2018-04-26T15:03:54Z

dlib/test/reinforcement_learning.cpp

+        ) const
+        {
+            auto best = numeric_limits<double>::lowest();
+            auto best_indexes = std::vector<int>();


You don't have to change it, but I just wanted to say that writing it this way is a little bizarre :)

Why not just write: std::vector<int> best_indexes;?

adrianjav · 2018-04-30T13:54:14Z

Thanks for another good review Davis! I've just looked quickly the comments and I agree with what I've read so far. I'm quite busy these days but 'll try to fix everything in the next weeks

davisking · 2018-04-30T14:17:34Z

No problem :)

adrianjav · 2018-06-03T23:14:47Z

I think I've fixed everything so far. Some things that are worth mentioning (skiping documentation changes):

I have removed the thread-safety requirements. It doesn't that useful for those classes and "keep it simple, stupid".
I'm keeping the mutable random_engine attribute on the classes. I don't consider it a internal state of the object and the only reason is not static/global is because of serialization.
Serialization works. I have added a serialization test to see whether it works as expected.
For keeping the code clean, I have added to serialize.h methods for all the instances of random devices from the standard library. They are serialized by just dumping their internal code into streams.

Waiting for travis and appveyor tests to finish though.

sandsmark · 2020-02-25T17:33:18Z

Are all the comments from the previous review fixed? Just in case Someone™ wanted to pick this up to get it merged.

davisking · 2020-02-26T02:19:01Z

I'm not sure. I had my second child right around the time this PR was made, and got severely sidetracked by that and a bunch of work stuff. I never ended up looking at this PR, since it's big and time consuming to review. I admit I feel bad about this, and I've left the PR open as a reminder of my shame in reviewing this. One of the issues though is I'm not sure if anyone really wants to use this. The state-of-the-art in RL is all deep learning stuff, rather than the simpler things in this diff. So I'm hesitant to merge things that are not really useful to users (that's not to say @adrianjav didn't do a nice job. He absolutely did.)

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it? If so then maybe it's worth it. In particular, is @adrianjav using this and getting value out of it?

adrianjav · 2020-02-26T09:10:26Z

Hi @sandsmark. This PR is two-year old, as far as I remember I fixed everything that was reviewed, although I am not sure anymore.

First, thanks @davisking :) Regarding your questions, I am not using nor interested in RL stuff. This was an attempt of making my university exercises useful for a broader audience and I found a sweet spot in your library. At this moment I am doing a PhD in machine learning, yet I don't focus on RL at all, so I don't get any real value out of this.

With that said, I believe that having these (simple) implementations in your library can't be harmful. Worst case it will be used for educational purposes or as a template to implement more complex algorithms (e.g., PPO). I will be happy to rescue this PR if that will useful for someone.

davisking · 2020-02-28T03:52:34Z

Yeah, I guess at least it's useful as an educational resource. I'll go over this again this weekend and see about merging it :)

davisking · 2020-02-29T14:52:00Z

Can you give me write access to this PR? See for instructions https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork

I just rebased this branch on master, fixed some conflicts, and tried to push to the PR but it's not letting me.

adrianjav · 2020-03-01T11:11:49Z

The "Allow edits from maintainers" option was already selected for me, you should have write access to my branch.

davisking · 2020-03-01T15:48:16Z

Huh. I just tried again and still the same error.

remote: Resolving deltas: 100% (108/108), completed with 11 local objects.
To https://github.com/adrianjav/dlib.git
 ! [remote rejected]   adrianjav-qlearning+sarsa -> adrianjav-qlearning+sarsa (permission denied)
error: failed to push some refs to 'https://github.com/adrianjav/dlib.git'

Anyway, maybe it's because I'm trying to push a rebased change. Can you rebase on master and resolve the errors?

adrianjav · 2020-03-01T21:59:04Z

Ok, that is weird, this is what I actually see from my side:

I will try to rebase to my master branch tomorrow, I don't have the git in local at the moment. I will be back to you as soon as I do it.

adrianjav · 2020-03-08T11:12:29Z

I just rebased this branch with the latest version of dlib. I locally tested the example and unit tests and both work. Hope you can edit the branch now.

…d compability with lspi.

- Improved English notes - Merged feature_extractor and model (now the user can choose which one to implement, offline or online model) - Improved training function header in qlearning and sarsa

This reverts commit ee5428a.

There was a compilation error that happened on gcc4.8.4 on travis but not in my local compiler

sandsmark · 2020-04-10T15:40:47Z

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it?

For some definition of "use", I guess. I used to make a game a year for an AI competition I held, and I wanted to test SARSA with them. Just for fun, and because I've been thinking about starting the competition up again, and it's much easier to tune games before release with a non-trivial bot (and so far I only used my own trivial Q-learning implementations).

And the reason I asked was because I thought about fixing any remaining issues myself to get it merged, but I wasn't entirely sure what needed to be fixed.

davisking · 2020-04-11T02:14:07Z

I'll go over this PR again and probably fix whatever needs fixing (if anything) and merge it. It's a big PR though so it needs a chunk of time where I can sit and work on it. I just haven't been able to get time for it the last few weeks. I'll do it soon though.

adrianjav · 2020-04-11T14:18:30Z

Totally understandable, these last weeks haven't been easy for anyone. Stay safe!

davisking requested changes Jan 7, 2018

View reviewed changes

davisking requested changes Apr 27, 2018

View reviewed changes

davisking added the enhancement label Sep 4, 2018

Repository owner deleted a comment from dlib-issue-bot Sep 4, 2018

adrianjav closed this Mar 1, 2020

adrianjav reopened this Mar 1, 2020

Adrián Javaloy and others added 14 commits March 14, 2020 19:34

Added a first version of Qlearning + updated rl interface

650c1d5

Added Sarsa and an example. Everything is working alright.

f0bd6a8

Added test for the reinforcement learning methods and checked backwar…

8b241b2

…d compability with lspi.

Fixed reward bug + model bug + prng support

d20c510

Commented example + updated abstracts

3e56eb5

Applied review notes

160db7d

- Improved English notes - Merged feature_extractor and model (now the user can choose which one to implement, offline or online model) - Improved training function header in qlearning and sarsa

travis hotfix

8830753

Revert "travis hotfix"

c5cf9b3

This reverts commit ee5428a.

real travis hotfix

ebc5648

There was a compilation error that happened on gcc4.8.4 on travis but not in my local compiler

templated template parameters must have class, not typename

9650cd8

Applied the notes of the second review

cade6fb

changed uint type to int

b6f1fde

Fixed "cannot call member function without object"

eaa621a

cleanup and add some tests

e3f2d28

davisking force-pushed the qlearning+sarsa branch from 286dedf to e3f2d28 Compare March 14, 2020 23:35

Qlearning+sarsa v2.0 #1005

Are you sure you want to change the base?

Qlearning+sarsa v2.0 #1005

Conversation

adrianjav commented Dec 9, 2017 • edited

davisking commented Dec 10, 2017

adrianjav commented Dec 12, 2017 • edited

travek commented Dec 28, 2017

davisking commented Dec 28, 2017 via email

davisking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianjav commented Jan 7, 2018

adrianjav commented Feb 5, 2018

adrianjav commented Feb 18, 2018

davisking commented Feb 19, 2018

davisking commented Mar 6, 2018 via email

adrianjav commented Mar 6, 2018

davisking commented Mar 7, 2018 via email

travek commented Apr 15, 2018

davisking commented Apr 15, 2018

davisking left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrianjav commented Apr 30, 2018

davisking commented Apr 30, 2018

adrianjav commented Jun 3, 2018

sandsmark commented Feb 25, 2020

davisking commented Feb 26, 2020

adrianjav commented Feb 26, 2020

davisking commented Feb 28, 2020

davisking commented Feb 29, 2020

adrianjav commented Mar 1, 2020

davisking commented Mar 1, 2020

adrianjav commented Mar 1, 2020

adrianjav commented Mar 8, 2020

sandsmark commented Apr 10, 2020

davisking commented Apr 11, 2020

adrianjav commented Apr 11, 2020

adrianjav commented Dec 9, 2017 •

edited

adrianjav commented Dec 12, 2017 •

edited