Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qlearning+sarsa v2.0 #1005

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open

Conversation

adrianjav
Copy link

@adrianjav adrianjav commented Dec 9, 2017

This pull request adds more reinforcement learning functionality to dlib. Namely it implements the Qlearning and SARSA algorithms. I think I've done a good job with respect to backward compability and style code so far but let me know anything that needs a review.
A list of the changes:

  • Qlearning and SARSA algorithms (files qlearning.h and sarsa.h)
  • Separated policy in its own header (policy.h) where there are two implementations (greedy_policy and epsilon_policy).
  • An example that shows both algorithms working (qlearning_sarsa_ex.cpp)
  • The same example as a unitary test (reinforcement_learning.cpp). It works for me but it'd be nice some more testing since convergence could depend on the machine.
  • An interface for the model the agents will work with (model_abstract.h).

About the mechanics: These new algorithms avoid using the process_sample of lspi. Instead they run on a model (usually an online model) that allows them to take steps and get feedback from their choices. What the algorithms really train aren't models but policies. That way you can specify a custom policy and custom initial weights and, when training, the algorithms use an epsilon-greedy policy that uses the given policy as the underlying one.

Related PR: #824

Edit:
Mmmm... AppVeyor fails at the unitary tests. Any suggestions on how to test if they work? Since their convergence is non deterministic and qlearning can get worse after converging I find it hard to test.

@davisking
Copy link
Owner

Wow, this is a lot of stuff. Thanks for working on this PR :)

I haven't looked it over very carefully, but it generally looks good. You should flesh out the comments in the qlearning_sarsa_ex.cpp example a bit so that it reads like a tutorial (e.g. explains the relevant model details so that someone who isn't already familiar with Q learning can understand and learn what it does).

Also definitely fix the failing unit tests. Most of them are working though so that's good. Make sure your random number generators are all seeded the same every run so the tests reliably do the same thing. You don't want unit tests to have some small chance of failure. Beyond that I don't have much advice other than to dig in and debug it.

Let me know when you think the PR is ready for a full review.

@adrianjav
Copy link
Author

adrianjav commented Dec 12, 2017

Thanks god that the tests failed!
Reviewing the code I saw a terrible bug when adding rewards. I've changed that and added custom PRNG support when training. Now everything seems to go smoothly as intended.
I've also rewritten the comments in the example.

I think now it's ready for a full review.

@travek
Copy link

travek commented Dec 28, 2017

Hi!
When this feature will be merged into Master?

@davisking
Copy link
Owner

davisking commented Dec 28, 2017 via email

Copy link
Owner

@davisking davisking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent a while looking this over, and I think it's pretty good overall. Although I do have a lot of comments.

A big one is I think that the feature extractor and model classes should just be one class. Both of these classes are not provided by dlib but implemented by the user. The dlib code doesn't really even know that there are two classes.

The central challenge is communicating to the user what they have to do to correctly implement a "model". That would be easier if it was presented to them as a single class they have to implement which has some methods that they are only required to implement if using certain algorithms.


namespace dlib
{

// ----------------------------------------------------------------------------------------

template <
typename T,
typename U
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these template arguments? The documentation doesn't mention them.

struct example_feature_extractor
{
/*!
WHAT THIS OBJECT REPRESENTS
This object defines the interface a feature extractor must implement if it
is to be used with the process_sample and policy objects defined at the
bottom of this file. Moreover, it is meant to represent the core part
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wrong now? Since process_sample and the policy objects now take models rather than feature extractors.

Also, at this point in my review I still find the need to have these other objects depend on models rather than directly on feature extractors. But maybe I'll see why that's a good idea when I get more into the review.

of a model used in a reinforcement learning algorithm.

In particular, this object models a Q(state,action) function where
Q(state,action) == dot(w, PSI(state,action))
where PSI(state,action) is a feature vector and w is a parameter
vector.

Therefore, a feature extractor defines how the PSI(x,y) feature vector is
Therefore, a feature extractor defines how the PSI(x,y) feature vector is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix whitespace.

Also, try not to make whitespace changes since it makes the review harder.

@@ -56,41 +59,29 @@ namespace dlib
- returns the dimensionality of the PSI() feature vector.
!*/

action_type find_best_action (
const state_type& state,
const matrix<double,0,1>& w
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems confusing. I see that the lspi code (in dlib/test/lspi.cpp) still defines feature extractors this way. So this is wrong isn't it? This function still needs to be here.

A better way to organize this model/feature extractor stuff is probably to have the documentation say there is just one class, not two. Then you can define two different versions of the model class. There is the basic one which was previously defined by the example_feature_extractor. Then there is the more detailed one with the new things you want to add to models. The new one has all the functions the feature extractor had, but with some additional new ones. Then when you use any model class it's not like it depends on some feature extractor class. It's just one class.

After all, the model is the thing the user makes. So it's weird that later on it's defined to take a feature extractor as a template argument.

Anyway, my point is that it's really unclear what a model or feature extractor is right now. Clarifying that is the most important thing here since that's the interface a user of the software is expected to implement. If it's not really clear people will just not use it and this whole RL module will never be used. Making that model interface simple so people can easily implement it is the central challenge.

const state_type& state,
const action_type& action,
matrix<double,0,1>& feats
matrix<double,0,1> get_features (
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed in the new code, but the LSPI code still does it the old way. Everything needs to be the same.

Also, don't do this. This change makes the code slower, since now each call to get_features allocates a block of memory. The other way avoids the reallocation. I realize move constructors make returning the matrix not super slow. But reallocating memory in a tight loop is also bad.

template <
typename model_type
>
class example_policy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there an example policy object?

model_abstract.h.

WHAT THIS OBJECT REPRESENTS
This objects is an implementation of the well-known reinforcement learning
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix spelling.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

objects and algorithms

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text is wrong, it doesn't use any process_samples

Please make sure all the documentation is correct. I am extremely anal about having correct and clear documentation :). You should review all the comments a few times to make sure they are correct and clear before the next PR review.

) const;
/*!
requires
- policy is of the form example_policy<model_type>, i.e., an instance of
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too general. You should change this function to take a policy<model> which is much more specific. So far I don't see any reason to not be more specific. That is, why would a user want to create their own policy object?

defined in std::random. By default it assumes it to be the standard
default_random_engine class.
ensures
- returns a policy of the type policy_type as the result of applying the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording of this is confusing and ungrammatical.

);

reward_type total_reward = static_cast<reward_type>(0);
for(auto iter = 0u; iter < iterations; ++iter){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put { on a line by itself so the code style is consistent with all the other code in this file and other parts of dlib.

@adrianjav
Copy link
Author

Thanks for your comments!

I strongly agree with you in merging feature_extractor and model in one class. I might have overthought it just because feature_extractor was already there and I tried to separate the model and its representation. But at the end it's confusing for the user.

The other things come mostly from: rushing; being used to my own code style; English not being my first language.

I'll try to fix them all in the next few days.

@adrianjav
Copy link
Author

It's been a month since my last post and I feel like I have to say something. I've been applying the changes of the review but I'm still missing a few things.

I haven't forgotten it, just being really busy with my actual job. As soon as I have some spare time I'll post the last changes

@adrianjav
Copy link
Author

Finally I've applied those requested changes into the old code. The most remarkable change is that, as you suggested, I describe in the abstract two classes: offline model (the former feature extractor) and online model; So the user can choose to implement one of them depending on its necessities.

Overall I agree with the notes you took in the review, the only thing I could argue is that I returned matrices by value on purpose trusting the compiler's ROV optimization. Either way I've also changed that and reviewed the comments twice.

@davisking
Copy link
Owner

Cool, thanks for the update. I'll review the PR sometime over the coming week.

@davisking
Copy link
Owner

davisking commented Mar 6, 2018 via email

@adrianjav
Copy link
Author

Hahaha don't worry, I'm sure you'll do it. I'm quite busy as well.
Congratulations for your newborn! 👶

@davisking
Copy link
Owner

davisking commented Mar 7, 2018 via email

@travek
Copy link

travek commented Apr 15, 2018

Davis,
when can you push this PR ?

@davisking
Copy link
Owner

Not sure. I'll get to it thought. You can use it now though, no need to wait on me. In fact, other people testing and reporting back here will help me in my own review when I get to it.

Copy link
Owner

@davisking davisking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I looked this over and it's pretty good. I think the class structure is nice and overall it's pretty good. I did find a bunch of things that need to be fixed before being merged though.

One of the big ones is that the comments say that the models are thread safe but then there are these mutable random number generators in the implementations. I noted this in my comments. That definitely needs to be fixed, and I think the right thing to do is probably to make them not mutable and then let member functions that mutate them not be const. I haven't worked out the implications of that on the interface, if it's unreasonable then don't do it and keeping mutable things is fine. But the best interface pattern with regards to thread safety is to be able to say that const members are thread safe and non-const aren't, because they mutate memory. So if that can be done here without making the API irritating to the user in some way then that's the best. The other option is to simply remove the notes about the model being thread safe. I don't think there is any need in the code we have for it to actually be thread safe, although it would be nice. But the main issue is that whatever type of thread safety claimed by the documentation must really exist in the code.

Also, sorry about the delay. I've been unusually busy :)

) const
{
std::bernoulli_distribution d(epsilon);
return d(gen) ? get_model().random_action(state) : underlying_policy(state);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a little confusing. Why doesn't this say?

return d(gen) ? underlying_policy.random_action(state) : underlying_policy(state);

I thought there were two different models here at first. In general, less indirection is better, or at least the code should use consistent indirection so that it's clear what is being referenced. The way it's written here makes it look like there are two different models in play when really it's just one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random action is a method from online model, I have sustituted that for

return d(gen) ? underlying_policy.get_model().random_action(state) : underlying_policy(state);

serialize(version, out);
serialize(item.get_policy(), out);
serialize(item.get_epsilon(), out);
serialize(item.get_generator(), out);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work? Are there serialize routines defined for the random number generators in std::? You should add unit tests that invoke the serialization routines for these new objects to make sure they all work.

where PSI(state,action) is a feature vector and w is a parameter
vector.
This object defines the inferface that any model has to implement if it
is to be used in an offline fashion along with some method like the lspi
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is confusing. lspi is a class, not a method.

/*!
WHAT THIS OBJECT REPRESENTS
This object defines the inferface that any model has to implement if it
is to be used in an online fashion along with some method like the qlearning
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say "if it is to be used by an object such as the qlearning class". The word "method" means a member function and is confusing in this context.

method defined in the file qlearning_abstract.h.

Being online means that the model doesn't hold prior data but it interacts
with the environment and performing actions from some given state turning
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix grammar

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really understand what this is saying. This needs to be clarified. A good way to do this is to talk explicitly about the member functions in this interface that are not in the offline one. You can talk about how this is like the offline version but that it additional has such and such and that is useful for so and so.

);
/*!
requires
- discount >= 0 and discount <= 1.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it really be 0?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, a discount of 0 means that the algorithm is totally short-sighted. It will only consider the immediate reward.

- prng_engine is a pseudo-random number generator class like the ones
defined in std::random. By default it is the standard one.
ensures
- returns the policy resulting of applying the learning function over
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the policy object used for? The text here doesn't say anything about it.

that is, the current expected reward from there, and the new expected qvalue.

Note that, unlike qlearning, sarsa is an on-policy reinforcement learning
algorithm meaning that it takes the policy into account while learning.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment about this being a little confusing. Reader unfamiliar with these algorithms will think this is saying something like "q-learning" ignores the policy or something wrong like that.

- prng_engine is a pseudo-random number generator class like the ones
defined in std::random. By default it is the standard one.
ensures
- returns the policy resulting of applying the learning function over
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The policy argument is not mentioned. What is it used for?

) const
{
auto best = numeric_limits<double>::lowest();
auto best_indexes = std::vector<int>();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to change it, but I just wanted to say that writing it this way is a little bizarre :)

Why not just write: std::vector<int> best_indexes;?

@adrianjav
Copy link
Author

Thanks for another good review Davis! I've just looked quickly the comments and I agree with what I've read so far. I'm quite busy these days but 'll try to fix everything in the next weeks

@davisking
Copy link
Owner

No problem :)

@adrianjav
Copy link
Author

I think I've fixed everything so far. Some things that are worth mentioning (skiping documentation changes):

  • I have removed the thread-safety requirements. It doesn't that useful for those classes and "keep it simple, stupid".
  • I'm keeping the mutable random_engine attribute on the classes. I don't consider it a internal state of the object and the only reason is not static/global is because of serialization.
  • Serialization works. I have added a serialization test to see whether it works as expected.
  • For keeping the code clean, I have added to serialize.h methods for all the instances of random devices from the standard library. They are serialized by just dumping their internal code into streams.

Waiting for travis and appveyor tests to finish though.

Repository owner deleted a comment from dlib-issue-bot Sep 4, 2018
@sandsmark
Copy link
Contributor

Are all the comments from the previous review fixed? Just in case Someone™ wanted to pick this up to get it merged.

@davisking
Copy link
Owner

I'm not sure. I had my second child right around the time this PR was made, and got severely sidetracked by that and a bunch of work stuff. I never ended up looking at this PR, since it's big and time consuming to review. I admit I feel bad about this, and I've left the PR open as a reminder of my shame in reviewing this. One of the issues though is I'm not sure if anyone really wants to use this. The state-of-the-art in RL is all deep learning stuff, rather than the simpler things in this diff. So I'm hesitant to merge things that are not really useful to users (that's not to say @adrianjav didn't do a nice job. He absolutely did.)

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it? If so then maybe it's worth it. In particular, is @adrianjav using this and getting value out of it?

@adrianjav
Copy link
Author

Hi @sandsmark. This PR is two-year old, as far as I remember I fixed everything that was reviewed, although I am not sure anymore.

First, thanks @davisking :) Regarding your questions, I am not using nor interested in RL stuff. This was an attempt of making my university exercises useful for a broader audience and I found a sweet spot in your library. At this moment I am doing a PhD in machine learning, yet I don't focus on RL at all, so I don't get any real value out of this.

With that said, I believe that having these (simple) implementations in your library can't be harmful. Worst case it will be used for educational purposes or as a template to implement more complex algorithms (e.g., PPO). I will be happy to rescue this PR if that will useful for someone.

@davisking
Copy link
Owner

Yeah, I guess at least it's useful as an educational resource. I'll go over this again this weekend and see about merging it :)

@davisking
Copy link
Owner

Can you give me write access to this PR? See for instructions https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/allowing-changes-to-a-pull-request-branch-created-from-a-fork

I just rebased this branch on master, fixed some conflicts, and tried to push to the PR but it's not letting me.

@adrianjav
Copy link
Author

The "Allow edits from maintainers" option was already selected for me, you should have write access to my branch.

@davisking
Copy link
Owner

Huh. I just tried again and still the same error.

remote: Resolving deltas: 100% (108/108), completed with 11 local objects.
To https://github.com/adrianjav/dlib.git
 ! [remote rejected]   adrianjav-qlearning+sarsa -> adrianjav-qlearning+sarsa (permission denied)
error: failed to push some refs to 'https://github.com/adrianjav/dlib.git'

Anyway, maybe it's because I'm trying to push a rebased change. Can you rebase on master and resolve the errors?

@adrianjav adrianjav closed this Mar 1, 2020
@adrianjav adrianjav reopened this Mar 1, 2020
@adrianjav
Copy link
Author

Ok, that is weird, this is what I actually see from my side:
image

I will try to rebase to my master branch tomorrow, I don't have the git in local at the moment. I will be back to you as soon as I do it.

@adrianjav
Copy link
Author

I just rebased this branch with the latest version of dlib. I locally tested the example and unit tests and both work. Hope you can edit the branch now.

@sandsmark
Copy link
Contributor

Anyway, are you @sandsmark or @adrianjav using this stuff or interested in it?

For some definition of "use", I guess. I used to make a game a year for an AI competition I held, and I wanted to test SARSA with them. Just for fun, and because I've been thinking about starting the competition up again, and it's much easier to tune games before release with a non-trivial bot (and so far I only used my own trivial Q-learning implementations).

And the reason I asked was because I thought about fixing any remaining issues myself to get it merged, but I wasn't entirely sure what needed to be fixed.

@davisking
Copy link
Owner

I'll go over this PR again and probably fix whatever needs fixing (if anything) and merge it. It's a big PR though so it needs a chunk of time where I can sit and work on it. I just haven't been able to get time for it the last few weeks. I'll do it soon though.

@adrianjav
Copy link
Author

Totally understandable, these last weeks haven't been easy for anyone. Stay safe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants