"An End-to-end Deep Reinforcement Learning Approach for the Long-term Short-term Planning on the Frenet Space"
-
[
2020
] [📝] [] [ 🎓University of California
,Istanbul Technical University
] -
[
highway
,mid-to-mid
,CARLA
]
Click to expand
Left: actually, this is not end-to-end since the agent does not receive raw images and does not control low-level actuators. Rather mid-to-mid . Middle: the reward function penalizes speed deviations , considering if a lane change was 'beneficial'. Right: take-away: RL can bring efficiency (higher speed) but lacks safety. Source. |
For feature extraction, two different 1D convolution layers are used separately. One for ego states, (represented w.r.t a fixed reference point) and one for surrounding vehicles’ states (relative to the ego vehicle). Source. |
Authors: Moghadam, M., Alizadeh, A., Tekin, E., & Elkaim, G. H.
-
Main motivation:
- Compare
mid-to-mid
RL
to classical modular rule-based architectures on highway scenarios. - Conclusion:
RL
can be more efficient but less safe.
- Compare
-
Baseline:
-
C.f.
"An Autonomous Driving Framework for Long-term Decision-making and Short-term Trajectory Planning on Frenet Space"
by the same authors in the section Rule-based Decision Making. -
Hierarchical layers:
BP
decides about lane changes (discrete) and target speeds based onIDM
andMOBIL
.LP
generates a continuous spline in the Frenet space to implementBP
's decisions.
-
action
:- Do not predict low-level commands. Rather produce a trajectory. Hence
x-to-mid
. -
[Elegant parametrization] "A polynomial trajectory, aka a lattice, can be characterized using three continuous values:
vf
,df
, andtf
[speed
,lateral position
, andarrival time
]." - Each value is in the [
−1
,1
] range. -
"Exploring various regions in the
action
space is equivalent to examining different splines in the driving corridors." Delta-t
?
- Do not predict low-level commands. Rather produce a trajectory. Hence
-
state
.-
About
ego
:lat-position
(lane),speed
.- I do not see the point of considering its
longitudinal
position for this highway task.-
"The
longitudinal
position of the ego is normalized w.r.t. its initial position and the total track length TL." - Highway driving is a continuous task. Thus there cannot be any notion of "progress" toward a "goal position".
-
-
About
others
:position
andspeed
relative to the ego.N=14
slots available.-
"If a region is unoccupied, the corresponding value in the input tensor get
−1
." [One could also have used additional flags instead.]
-
Temporal dimension: describe the trajectories.
- Stack of past
30
state
s. -
"We use convolution operation along time channel to extract the encoded time-series features from the past trajectories."
- Stack of past
-
-
reward
- Coarse-grain: large penalties for
collisions
andoff-road
. - Fine-grain: deviation to the
target speed
, conditioned on thespeed gain
brought by the lane change. - The ego-
longitudinal
position is not considered. No reward for longitudinalprogress
, no penalty fortimeout
.- Again, I do see the point of considering this information in the
state
.
- Again, I do see the point of considering this information in the
- Coarse-grain: large penalties for
"Combining Reinforcement Learning with Model Predictive Control for On-Ramp Merging"
-
[
2020
] [📝] [ 🎓University of Illinois
] [ 🚗Xmotors.ai
] -
[
prediction model
]
Click to expand
Bottom-left: the MPC solver performs a search over the discretized the S-T space. Right: (disappointing?) results for the proposed combined approach. As for the on-distribution evaluation, the combined RL -MPC agent is outperformed by the RL agent in terms of efficiency and comfort . In addition it is not able to avoid crashes in some out-of-distribution traffic scenarios (different traffic densities and traffic speeds). Source. |
Authors: Lubars, J., Gupta, H., Raja, A., Srikant, R., Li, L., & Wu, X.
-
Main motivation:
- Combine
MPC
with model-freeRL
. - Limitation of model-free
RL
: lack of safety guarantees and interpretability. - Limitation of
MPC
: imperfect predictive model.-
"
DDPG
can implicitly learn how to best interact with other drivers, without an explicit model for their behavior."
-
- Combine
-
Main idea:
- Mask
RL
actions based on some safety checker that considersprediction
, as opposed checks on single steps. - And use a
MPC
-based planner as "safe
" fallback policy.-
[About the solver] "The resolution of the lattice is
0.3
seconds in time and0.05
meters in distance, and planning was done over a horizon of5
seconds and150
meters." - Examples of
weights
incost function
forMPC
:w1
=10,000,000
for getting within an unsafe distance of an obstacle, whilew3
=0.5
for the difference from thedesired speed
.- Therefore no hard safety constraint.
-
- The key question: what for
transition model
?- Despite parameter tuning, the
ST
solver (used forMPC
) cannot achieve the level of efficiency or comfort of theRL
-agent. Due to imperfect model.
- Despite parameter tuning, the
- Mask
-
How to derive a
RL
PLAN? And how to evaluate if it issafe
?-
"While our
ST
solver naturally produces a trajectory for the ego car, theRL
agent only produces anaction
u0
=µ
(x0
) for the givenobservation
. In order to predict a longer trajectory, we use our prediction modelfˆ
from theST
solver to “roll out” a longer trajectory." - What for
model
?- A "simple" model that ignores interactions:
-
"These predictions are made independently of the ego vehicle’s planned trajectory."
-
- A "simple" model that ignores interactions:
- Safety of a trajectory is evaluated based on:
1-
... some minimum distancedmin
(=5m
).-
"In the presence of imperfect prediction models, tuning the value of
dmin
can provide a trade off betweensafety
andperformance
."
-
2-
... the finalstate
.- The
ST
solver (used forMPC
) is run from the lateststate
in theRL
trajectory. - If it cannot find a safe way to proceed, the planned sequence of
RL
actions
is deemed unsafe.
- The
-
"Reinforcement Learning for Autonomous Driving with Latent State Inference and Spatial-Temporal Relationships"
-
[
2020
] [📝] [ 🎓Stanford University
,Berkeley
] [ 🚗Honda
] -
[
privileged learning
,latent variable inference
,auxiliary task
,POMDP
,driver model
,interaction modelling
,influence-passing
,GNN
,LSTM
]
Click to expand
When deciding to estimate the latent aggressiveness variable as an auxiliary task, two design choices arise: 1- Is it beneficial to share layers (shared encoder) between the policy net and the classification net? 2- Are losses from two tasks somehow coupled, e.g. by some weighted sum? The authors answer twice 'no' and show the benefit of a separated inference for this particular task. In the coupled versions, the auxiliary task does not really help, acting as a distractor. Source. |
Top-left: the T-intersection task. Bottom: in the latent state inference, the LSTM parameters are shared for all the vehicles except the ego vehicle. The output of the LSTM is then used as the initial embedding for the corresponding vehicle node in the Gt . The node embeddings are then updated by several layers of GraphSage convolution. Left: modelling the so-called influence passing between different vehicles through GNN offers interpretability: the auxiliary latent inference gives information about how the policy infers the latent states of surrounding vehicles. Right: How the performance of a trained agent degrades when the conditions vary. Source. |
Authors: Ma, X., Li, J., Kochenderfer, M. J., Isele, D., & Fujimura, K.
-
Can we draw some parallels with
"Learning by Cheating"
?- Privileged learning: Both leverage access to information that are not seen by the agent.
- Both split the task into
2
stages, which makes the initial task easier, while offering interpretability."Learning by Cheating"
:"learn to see"
/"learn to act"
.- Here:
"learn to infer hidden driving style"
/"learn to act"
.
-
Main motivations:
1-
Improve the efficiency of model-freeRL
when thestate
is not fully observable.- The idea is to extract/estimate this hidden useful information (separately) to help the primary
RL
task.-
"Inferring the latent state of traffic participants would extract useful information that an unassisted learning system would otherwise miss."
- Thus the concept of "latent state inference as an auxiliary task".
-
- It works when some
state
variables are hidden to the agent but can be retrieved from the simulator. - Examples of latent states include
intention
ordriving style
parameters.- Therefore close to topics such as driver modelling and
intent
recognition. - And their knowledge would be precious, since different values could predict drastically different outcomes.
-
"The
conservative
drivers would yield to the ego driver if they intercept laterally or the ego vehicle is approaching the lane center with a speed larger than0.5m
, while theaggressive
driver would ignore the ego vehicle."
- Therefore close to topics such as driver modelling and
- The idea is to extract/estimate this hidden useful information (separately) to help the primary
2-
Interpretability.- The estimated aggressiveness forwarded to the agent can explain its reaction.
-
"The graph representation used by the
STGSage
also gives additional interpretability on the influence passing structure inside the network."
3-
Robustness against distribution shift.- One idea is to train with
observation
andtransition
noise. The separation of the auxiliary task is also beneficial.
- One idea is to train with
-
Main ideas:
1-
Explicitly infer the latent state.- Main (strong) assumption:
- The true latent states of the surrounding drivers are known at
training
time.
- The true latent states of the surrounding drivers are known at
- This enables supervised learning: learn to estimate the non-observable parts of the
state
to help theRL
agent.
- Main (strong) assumption:
2-
Encode spatial-temporal relationships.-
"By combining the latent inference with relational modelling, we propose a
RL
framework that explicitly learns to infer the latent states of surrounding drivers using graph neural networks."
-
-
POMDP
state
- For each driver:
- Physical part:
position
andspeed
. - Latent part:
driving style
in {CONSERVATIVE
,AGGRESSIVE
}.
- Physical part:
- For each driver:
observation
- Only the physical part of the
state
. - The agent must estimate the latent
driving style
variables.-
"The goal of the auxiliary latent inference is to learn P(
zi.t
|o.1:t
), whereo.1:t
is the ego agent’s historicalobservation
up to timet
." - One non-learning-based option could be to perform max-likelihood inference using the
IDM
model.
-
- Only the physical part of the
action
- Target
speed
in {0m/s
,0.5m/s
,3m/s
}, implemented by some low levelPD
controller. -
"It also has a safety checker that performs an emergency break if the ego vehicle is too close to other vehicles."
- No
steering
decision: the ego-path is pre-defined.
- Target
reward
- Penalize collisions
- Reward
speed
and merge completion.
dt
0.1 s
observation
model- A small observation noise is added to the observation on the physical state of each surrounding vehicle.
- It comes from a zero-mean Gaussian. But what is the standard deviation?
transition
model- An
IDM
model with decides theacceleration
of other cars. - The desired
speed
is set to3 m/s
. - But the
desired front gap
is function of the aggressiveness hidden parameter.-
"Conservative drivers vary their
desired front gap
uniformly between0.5
to0.8
of the original gap, where the aggressive driver varies between0.4
to0.7
."- Note the distributions are overlapping!
-
-
"The actual
acceleration
of the surrounding vehicles are also influenced by a Gaussian noise with a standard deviation of0.1 m/s²
."
- An
- Policy optimization:
PPO
, with RLkit.
-
Auxiliary tasks: how to learn both
RL
task and supervised latent state inference?1-
Coupled inference = {shared encoder
} + {shared update
}:shared encoder
= two heads (policy
anddriving style
s predictions) based on some shared encoder.shared update
:-
[Baseline
1
] "Besides the shared encoder, the losses from two tasks are also coupled by a weighted sum and the entire network is trained with the policy optimization optimizer." - In this baseline, the weight of the supervised learning loss is
0.1
.
-
2-
Shared inference = {shared encoder
} + {independent update
}:independent update
:-
[Baseline
2
] "An encoder is shared between thepolicy
network andlatent inference
network, but the two tasks are trained with separate losses and optimizers."
-
3-
Separated inference - minimal coupling (here):-
"In this work, we treat the policy and the latent inference as two separated modules. The
latent inference
network learns the mapping from the historical observations to the latent distribution, i.e.Pφ
(zi.t
|o.1:t
)." -
Benefit
1
: Theprivileged
learning makes theRL
task easier during training. Hence speed up.-
"Feeding the ground truth latent state at exploration time helps the policy find the trajectory leading to the task goal. This is especially important when the task is difficult and the reward is sparse."
-
-
Benefit
2
: It removes the need to weight losses contributions when computing gradients.-
"By using a separate network for each task, we minimize the mutual influence of the
gradients
from different tasks because such mutual influence could be harmful as shown in our experiments." -
"The effect of the feature shaping from the auxiliary task is not always helping, which is likely due to the distraction caused by the multiple loss sources so that the gradient estimates on both tasks are biased. Such distraction is minimized in the separated structure."
-
-
Benefit
3
: Flexibility in network design.- The separation allows different choices for network structures in different modules.
- The experiments show that an
LSTM
network is more suitable for theRL
task, but aGNN
is more suitable for latent inference:-
"The latent inference is more decentralized where the network needs to focus on the local interactions of each surrounding vehicle with its neighbors. Such local interactions have a shared structure which are better captured by the graph representation and convolutions."
-
"On the contrary, the
RL
task is more centralized on the ego vehicle. In this case, theLSTM
structure provides more flexibility on focusing on the ego-relevant features."
-
-
-
How to model
interaction
when encoding driving scenes?- About graph representations:
-
"The shared dependence of different traffic participants can be formalized as a graph with each car represented as a node."
- Graph (
Gt
):nodes
: vehicles in the scene.edges
: direct influence between vehicles-
-
"The drivers are only directly influenced by the closest vehicles in its lane as well as the ego vehicle which is trying to merge. The ego vehicle is consider to be influenced by all vehicles to make the optimal long-term plan."
-
- About graphical nets:
-
"
GNNs
are a type of deep learning model that is directly applied to graph structures. They naturally incorporate relational inductive bias into learning-based models."
-
- About
influence-passing
:-
"As drivers are influenced by their surrounding vehicles, different traffic participants may directly or indirectly impact other drivers. For example, in highway driving, the ego driver is more directly influenced by neighboring vehicles, while vehicles farther away influence the ego driver indirectly through chains of influence that propagate to the neighbors. Such
influence-passing
forms a graph representation of the traffic scenario, where eachnode
represents a traffic participant and eachedge
represents a direct influence. With the recent development of graph neural networks (GNN
), we are able to model such relational information contained in graphs efficiently."
-
- About graph representations:
-
What for graph convolutional layers to process the spatial-temporal (
ST
) graph?- The authors test different convolutional layers in the latent inference network:
-
"The latent inference accuracy from the
STGSage
is significantly higher than that ofSTGCN
andSTGAT
." - About
STGSage
:-
"We adopt a three layer network architecture similar as
STGAT
to process both the spatial relational information in [the graph]Gt
and the temporal information ino1:t
." -
"We propose
STGSage
, a spatial-temporalGNN
structure, which efficiently learn the spatial interactions as well as the temporal progressions of different traffic participants." -
"By concatenating the self
node
embedding with the aggregated neighbour embedding, theGraphSage
convolution is able to flexibly distinguish the selfnode
embedding from its neighbors in the embedding update."
-
-
Auxiliary inference:
-
Cannot
RL
infer hidden parameters on its own?-
"While most
RL
methods implicitly learn the latent states through returns, we show that learning this explicitly will improve the efficiency and interpretability of the learning." -
"As the divergence between the two
desired gap
distributions is small, it is very difficult for a reinforcement learner to infer the intentions of the driver implicitly."
-
-
Can the primary task also help the auxiliary one?
- Allegedly, yes:
-
"Mutual benefits of the coupling with relational modelling."
-
"Graph nets improve latent inference in the auxiliary task."
- Not clear to me. I thought the
GNN
are used for the auxiliary task only?
-
- Allegedly, yes:
-
Non model-free-
RL
approaches:- Planning requires a
transition
model.-
"While a
POMDP
solver is able to consider the uncertainty in the intention inference, it is computationally expensive."
-
-
"Morton and Kochenderfer investigates the simultaneous learning of the latent space and the vehicle control through
imitation learning
."
- Planning requires a
-
-
Robustness to distribution shift: how the trained agent performs when:
1-
... the distribution of the hidden variables varies?- At training time:
P
(CONSERVATIVE
) =0.5
. -
"As
P
(CONSERVATIVE
) becomes smaller, more drivers in the main lane are likely to be aggressive, and the difficulty for making a successful right turn increases."
- At training time:
2-
... thefront gap sample interval
varies?- It indicates the interval length of the uniform distribution where the surrounding driver samples its desired front gap.
- The
mean
of the distribution is kept the same. - At training time:
interval length
=0.3
.
"Trajectory Planning for Autonomous Vehicles Using Hierarchical Reinforcement Learning"
-
[
2020
] [📝] [📝] [🎞️] [🎞️] [ 🎓Hong Kong Polytechnic University
,Carnegie Mellon University
] [ 🚗Argo AI
] -
[
HRL
,PID
,LSTM
]
Click to expand
What I do not understand: the idea of HRL is to enable temporal / action abstraction, i.e. produce a high-level decision such as change-lane at a low-frequency and then rely and wait for a low-level policy to compute the WPs at a higher-frequency. But here, both levels are queried at each time-step? What is the difference with a "flat" net that would directly choose between one of these 6 predefined trajectories, especially if no temporal abstraction is performed? Apart from that, note in the algorithm that the replay buffer is initially filled with experiences collected by running a rule-based controller. Source. |
Source. |
Authors: Ben Naveed, K., Qiao, Z., & Dolan, J. M.
-
Main motivations:
-
1-
Improve convergence rate and yield better performances.- One difficulty of
AD
is to learn a model able to deal with multiple sub-goals, such as different manoeuvres.-
"
HRL
helps divide the task of autonomous vehicle driving intosub-goals
[...] It allows the [low-level] policies learned to be reused for any other scenario." - The idea it to enforce the
Q
-net to produce an intermediatesub-goal
that conditions the rest of the net.
-
- More generally, Hierarchical
RL
reduces the complexity by abstractingactions
.
- One difficulty of
-
2-
Ensure smoothness and stability (avoid lane invasion).- The idea is to predict some kind of
trajectory
, instead of low-level commands such asthrottle
,steering
andbrake
. - A
PID
controller is then used to implement and track the producedtrajectory
.
- The idea is to predict some kind of
-
3-
Ensure robustness wrt. noisy observations.- Some
LSTM
layers are added to both high-leveloptions
and low-leveltrajectory
networks. LSTM
takes longer to train. Therefore, if no noise is added at test time, models withoutLSTM
perform better.
- Some
-
Hierarchical
MDP
.-
state
state
=14
parameters describing cars'speeds
,lane-ids
and thegaps
between them, normalized with somesafe distances
.- Both
Q
-nets take a so-calledhistory
as input: a stack of3
state
s.
-
action
:2
x3
=6
possibilities:- Binary high-level
options
: {keep lane
,change lane
}. - For each
option
,3
discrete low-level "trajectory
choices":- For
sub-goal
=keep lane
:- Follow the lane for a "long" distance. At constant speed?
- Follow the lane for a "short" distance. What about the speed? Does that mean it is completed faster than the previous choice?
- Slow down.
- For
sub-goal
=change lane
:- Three
speed profiles
are possible.
- Three
- From the selected
option
andtrajectory
choice, a targetspeed
and a finalwaypoint
are derived and passed to thePID
controller.
- For
- That means that the meaning of the output of the low-level net depends on the
sub-goal
.- What is the difference with a "flat" net that would directly choose between one of these
6
predefined trajectories, especially if no temporal abstraction is performed?
- What is the difference with a "flat" net that would directly choose between one of these
- Binary high-level
-
reward
- Similar to
"Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning"
] by same team. -
"The ego-car gets penalized separately for choosing the wrong high-level
option
or choosing the wrong low-leveltrajectory
choice." high-level
options.- No detailed? Or maybe the final
success
/failure
and theper-step penalties
.
- No detailed? Or maybe the final
low-level
trajectory choices are penalized for:- Collisions.
- Allegedly because that means the chosen trajectory does not complete its
sub-goal
. - Ok, but what if the sub-goal is not applicable? E.g. the high-level
option
is tochange lane
, but the target lane is occupied.
- Allegedly because that means the chosen trajectory does not complete its
- Risk, based on safety distances.
- I see it as a way to anticipate collisions, which helps for credit assignment.
- Inefficiency, i.e. for conservative choices that were not required.
- For
sub-goal
=lane change
: thewait
choice gets penalized if it is selected unnecessarily. - For
sub-goal
=keep lane
: if not obstacle is detected, a longer trajectory should be planned.
- For
- In a way,
action
are penalized if they do not match the "expected"action
.- What is then the difference with rule-based approaches such as
slot
-based methods?
- What is then the difference with rule-based approaches such as
- Collisions.
- Similar to
-
About
dt
?-
Carla is run at
30
fps (said in paper) or at10
fps (said in the video).- But at which frequencies decisions are made? Is some kind of consistency enforced over time?
-
Do both level work at the same frequency?
- It would not make any sense since the goal of
HRL
is to offer temporal abstraction by abstractingaction
s. -
"
Q1
generates thesub-goal
g
for the following steps and a controllerQ2
outputs theaction
a based on thesub-goal
selected until the nextsub-goal
is generated by the meta-controller." [That mean the high-level policy can interrupt the low-level one? Isn't is rather "until thesub-goal
is considered as completed (success / failed)" instead? I am confused here]
- It would not make any sense since the goal of
-
Do low-level
action
s have same durations?- I think no. For
sub-goal
=keep lane
, the point is that the low-level can decide to follow the lane for only a "short" distance, i.e. apply the sub-goal but come back quickly to ask for the high-level to make a new decision. - This mechanism seems beneficial in cases of uncertainty or where it is important to be able to react quickly.
- I think no. For
-
-
"Safe Trajectory Planning Using Reinforcement Learning for Self Driving"
-
[
2020
] [📝] [📝] [🎞️] [ 🎓Texas University
,Carnegie Mellon University
] -
[
high-level decision making
,action masking
,prediction-based safety checks
]
Click to expand
The road is flatten and divided in slices called junction s. For each junction the agent should decide the changes in speed and in lateral position , compared to the previous junction . From this, a sequence of waypoints can be computed. And a PID controller converts this trajectory into low-level commands. A safety module assumes all obstacle are static and defines collision-free states . If the proposed trajectory always stays closed to "safe" states , it is implemented. Otherwise, its projections on the safe space is chosen (not necessary feasible). Source. |
Authors: Coad, J., Qiao, Z., & Dolan, J. M.
-
Main motivation:
1-
Applysafety
checks on decisions produced by aRL
agent before applying them to the car.2-
Anticipate. Predict / detectunsafe
situations in advance.- Safety checkers must be perform on a long horizon.
- Using low level controls as
action
and evaluating one (not a sequence) of them would lead to a too "reactive" approach.- Because it may be too late to react.
-
"Revoking the next
steer
orthrottle
if it is deemed that such anaction
would take the car into anunsafe
state
, would not work, since the car is often travelling at speeds that cannot be instantaneously stopped or diverted." -
"In addition,
controls
must be generated at a high and consistent frequency; and thus the system is brittle to any unexpected system latency."
- Predictive checks on high-level level, such as a desired
trajectory
, must be preferred.-
"Because we are planning seconds into the future, we can detect potential traffic violations, kinematic violations and collisions with obstacles far before they happen and account for them."
-
-
Main idea:
X-to-mid
instead ofX-to-end
.- A
trajectory
is proposed - It is then fed to a
PID
controller to generate the vehiclesteering
andthrottle
/break
. -
"By using trajectory planning instead of direct control, we can apply higher level safety systems to restrict the
RL
-generated trajectory if it is seen to take the car into an unsafestate
, such as off the road or into an obstacle."
- A
-
Hierarchical decision making: Should
behavioural planning
andtrajectory planning
be performed together?- Some non-learnt approaches (e.g.
Focus
andHierarchical
) separate them into two steps:1-
Low-fidelity grid search: perform an exhaustive search over all possible discrete paths to choose an optimal behaviouraltrajectory
.-
"They call it a behavioral
trajectory
because thepath
andvelocity
profile chosen are greatly influenced by environmental factors such as obstacles, regulatory elements and road geometry and thus require behavioral choices."
-
2-
Non-convex numerical path optimization to produce the finaltrajectory
.-
"[Issue] We could be constraining ourselves to a local minimum by choosing a behavioral trajectory is only optimal in the discrete space."
- The proposed
RL
agent combine them.- Its forward pass produces trajectory faster than when using an exhaustive search and numerical optimization.
-
"One exhaustive search in the relatively small
action
space we used takes an average of7.9 ms
, while querying for the next trajectory from theRL
policy takes0.87 ms
, makingRL
nearly a magnitude faster."
- Some non-learnt approaches (e.g.
-
A
trajectory
is a sequence that contains both spatial and temporal information.-
"The goal of
trajectory
planning is to find a sequence ofstates
Q = <q1
,q2
, ...qn
> whereqi
= (x.i
,y.i
,speed.i
)." - How about the "temporal" information? What is the time gap between two consecutive
qi
?- I.e. how much time do I have to transition from (
x1
,y1
,speed1
) to (x2
,y2
,speed2
)? - Is that time gap constant?
- How to represent a "stay stopped" decision?
- Does the trajectory (sequence) has constant size?
- I.e. how much time do I have to transition from (
-
-
About the
MDP
.-
Important: The road is "linearised" and divided into cells.
- Its longitudinal discretization results in slices called
layers
.
- Its longitudinal discretization results in slices called
-
state
(also used by the safety module to definesafe
/unsafe
states
):S
: static occupancy grid.O
:speed
regulatory grid.- Ego-
lateral position
, relative to the middle lane. - Ego-
speed
.
-
Continuous
action
space:- For each "
layer
", two quantities are predicted:1-
Thelateral
change between layerj−1
andj
.2-
Thevelocity
change between layerj−1
andj
.- It says for each layer how the
speed
and thelateral position
will change. A sequence of waypoints can be derived from it. -
"We compare to the discretized exhaustive search method and see that we obtain a smoother behavioral path with less drastic turns. We see that the
RL
results in a smoother and more efficient path."
-
"The
action
space is of size2H
whereH
is the number of layers we are planning through, spaced out by some distanceL
between the layers in curvilinear space."- What is the value of
H
?
- What is the value of
- Again, how to represent a "I am at speed=0 and want to stay there and wait", e.g. at a red traffic light. Information about the next layers are irrelevant?
- For each "
-
reward
- A combination of
speed error
,acceleration
,jerk
,extra distance
,curvature
,lane crossing
,centripetal acceleration
. -
"Additionally, a success is given a
reward
of10
, a failure is given a reward of-20
and each step is given areward
of+1
." [how is a "success" defined?]
- A combination of
-
dt
? What duration between two decisions?- It needs to be able to react.
- But trajectories must be consistent.
-
Performance metrics:
mean step reward
,max acceleration
,max jerk
,mean extra distance
,max curvature
,max centrip. acc.
. -
What simulator to train? What scenarios?
-
-
About the safety checker.
- First, define a set of
unsafe
states
.- Apparently it is defined here based on the:
1-
The occupancy grid (presence of obstacles): to avoid collisions.2-
The centripetalacceleration
: for comfort and kinematics constraints.
- Apparently it is defined here based on the:
- How to accept / reject the proposed
action
, here atrajectory
?- One could check if none of the (
x
,y
,speed
) of thetrajectory
are colliding in the predicted future occupancy grids. And also ensure comfort and feasibility of thetransitions
. - Here, another approach:
1-
Find thesafe
space.- How do they predict what positions will be "safe" in the future? By assuming static environments_.
2-
The trajectory is projected onto thesafe
space. This form a secondtrajectory
which issafe
(but not necessaryfeasible
).3-
The residuals, i.e. distances between corresponding points, are measured.4-
If the maximum gap is below some threshold, i.e. if the proposed trajectory always stays always close to thesafe
space, it is accepted.
- One could check if none of the (
- What to do if the proposed
trajectory
is rejected?- The second
trajectory
(projection onto thesafe
space) is chosen.- But, as mentioned, this may not be feasible from a kinematics/dynamics point of view?
- The second
- Need for a prediction module:
- These forward roll-outs used to evaluate the
risk
rely on predictions: how will the scene evolve. - Limitation: dynamics objects (e.g. cars) are ignored here.
- These forward roll-outs used to evaluate the
- First, define a set of
-
Interesting: Which low-level
action
to limit oscillatory behaviours?-
"In
Driving in dense traffic with model-free reinforcement learning
, the authors use anRL
agent to control the derivative ofacceleration
andsteer
, which isjerk
andsteering rate
. They claim this increases the comfort of the ride by reducing oscillatory and jerky behavior."
-
"Combining reinforcement learning with rule-based controllers for transparent and general decision-making in autonomous driving"
-
[
2020
] [📝] [ 🎓Politecnico di Milano
] -
[
parametrized model
,scene decomposition
,rule-based
]
Click to expand
The thresholds of a non-differentiable rule-based policy are optimized using RL . Therefore the name policy gradient with parameter-based exploration . Somehow similar to decision-trees. How to constrain θ2 < θ3 for instance? I do not know. Bottom-right: scene decomposition to handle a varying number of surrounding cars. Source. |
Authors: Likmeta, A., Metelli, A. M., Tirinzoni, A., Giol, R., Restelli, M., Romano, D., & Restelli, M.
-
Main motivation:
1-
Keep the interpretability of handcrafted rule-based controllers ...2-
... but avoid the biased and difficult manual tuning of its parameters.
-
Main idea:
- Instead of manually tuned, parameters of the rule-based controller are learnt.
- The objective of the learning is to maximize the
expected return
, produced by the rule-based policy.
- The objective of the learning is to maximize the
- Impose the structure of the policy, leaving some of its parameters, such as threshold values, as degrees of freedom.
-
"Exploration is moved at the
parameter
level".
-
- Different parameters are sampled from a differentiable function, for instance a Gaussian distribution.
- A search is performed based on
reward
signals.
- A search is performed based on
- Example: learn the scalar parameter
θ
for the rule "iff
(x
)>
θ
then takeaction-1
, else takeaction-2
." Wheref
can be non-differentiable. - Note that the choice of the parameters is independent of the input
state
!
- Instead of manually tuned, parameters of the rule-based controller are learnt.
-
About
safety
.- A safety checker indicates if a (
state
,action
) pair is dangerous or not: - "Can we avoid a crash if you chose this
action
while the leader brakes during the next time step?" - This is used to mask dangerous
actions
and providereward
signals to the parameter tuner.
- A safety checker indicates if a (
-
How to be independent of the number of surrounding vehicles and their permutations?
- Canonical scene decomposition:
0-
The controller is trained to cope with one vehicle.1-
For each vehicle, compute theaction
.2-
Aggregate, here select the most conservativeaction
, i.e. the lowestacceleration
.
- The time complexity is linear with the number of vehicle.
- I guess it would be relevant to consider the closest vehicles first.
- Canonical scene decomposition:
"Deep Surrogate Q-Learning for Autonomous Driving"
-
[
2020
] [📝] [ 🎓University of Freiburg
] [ 🚗BMW
] -
[
offline RL
,sampling efficiency
]
Click to expand
The goal is to reduce the required driving time to collect a certain number of lane-changes . The idea is to estimate the q-values not only for the ego car, but also for the surrounding ones. Assuming it is possible to infer which action each car was performing, this results in many (s , a , r , s' ) transitions collected at each time step. The replay buffer hence can be filled way faster compared to considering only ego-transitions. Source. |
From another paper: Illustration of the difference between online (on-policy ), off-policy and offline RL . The later resembles supervised learning where data collection is easier. It therefore hold a huge potential for application were interactions with the environment are expensive or dangerous. Source. |
Authors: Huegle, M., Kalweit, G., Werling, M., & Boedecker, J.
-
Main motivation:
- Interaction efficiency: limit the costly interactions with the environment.
- The goal is to reduce the required
driving time
to collect a certain number of lane-changes
- The goal is to reduce the required
- Application: learn to make high-level decisions on the highway.
- Interaction efficiency: limit the costly interactions with the environment.
-
Main idea:
- Leverage the
transitions
of surrounding observed cars, instead of only considering the ego-agent'stransitions
.- Hence not only
off-policy
: transitions are collected with policies possibly different from the current one. - But also
offline RL
: forget aboutexploration
andinteractions
and utilize previously collected offline data.
- Hence not only
- Leverage the
-
Main hypotheses:
- The agent can detect or infer
actions
of its surrounding cars. - All cars are assumed to have the same action space as the agent or
actions
that are mappable to the agent'saction
space. - No assumption about the
reward function
of other cars. Here, only the egoreward function
is used to evaluate the (s
,a
,s'
) transitions. -
"This is in contrast to the multi-agent
RL
setting, where the long-termreturn
of all agents involved is optimized together."
- The agent can detect or infer
-
action
space and decisionperiod
.- About
keep lane
,left lane-change
, andright lane-change
. - Step size of
2s
.- Hence not about
safety
. Rather about about efficiency and long-term planning capability. -
"Collision avoidance and maintaining safe distance in longitudinal direction are controlled by an integrated
safety
module." -
"
acceleration
is handled by a low-level execution layer with model-based control of acceleration to guarantee comfort andsafety
."
- Hence not about
-
"Additionally, we filter a time span of
5s
before and after all lane changes in the dataset with a step size of2s
, leading to a consecutive chain of5
time-steps."
- About
-
About
offline
RL
:- Recordings are easier:
-
[static sensors] "This drastically simplifies data-collection, since such a
transition
set could be collected by setting up a camera on top of a bridge above a highway." -
[sensors on a driving test vehicle] "In this setting, a policy can even be learned without performing any lane-changes with the test vehicle itself."
-
- Promising for
sim-to-real
:-
"If it was possible to simply train policies with previously collected data, it would likely be unnecessary in many cases to manually design high-fidelity simulators for simulation-to-real-world transfer.
-
- What about
exploration
,distributional shift
andoptimality
?
- Recordings are easier:
-
Why "surrogate"?
- Because other drivers are used as surrogates, i.e. they replace the ego-agent to learn the ego-agent's
value function
.
- Because other drivers are used as surrogates, i.e. they replace the ego-agent to learn the ego-agent's
-
How to deal with a variable-sized input vehicles list and be permutation-independent?
- Idea of
PointNet
, here calledDeepSet
: use an aggregator, for instancepooling
.
- Idea of
-
How to estimate multiple
action-values
at once?- As opposed to multiple `individual forward-passes per sample.
- In parallel (if the hardware enables it!).
- Extend the
DeepSet
-like architecture which already computesΨ
(scene[t]
), the representation of the scene. ThisΨ
is needed here to estimate eachq-value
.
-
How to estimate the
q-value
of a (s
,a
) pair for a given (e.g. ego) agent?1-
The car-state
is concatenated with the representation of the sceneΨ
.2-
This then goes through aQ
module, here fully-connected net.
-
How to sample experiences from the replay buffer_?_
- Uniform sampling is an option.
- But as for Prioritized Experience Replay for
DQN
, one may prefer to sample "interesting" cases. - But one should be careful with the induced distribution (c.f. importance sampling in
PER
)
- But as for Prioritized Experience Replay for
- Here: the more complex the scene, the more
q-values
are estimated, the more efficient the sampling.-
"We propose a sample distribution dependent on the complexity of the scene, i.e. the number of vehicles."
- Therefore called Scene-centric Experience Replay (
SCER
).
-
- About the resulting distribution.
-
"
Scene-centric Experience Replay
via a permutation-equivariant architecture leads to a more consistent gradient, since theTD
-errors are normalized w.r.t. all predictions for the different positions in the scene while keeping thei.i.d.
assumption of stochastic gradient descent by sampling uniformly from the replay buffer."
-
- Uniform sampling is an option.
"Deep Reinforcement Learning and Transportation Research: A Comprehensive Review"
-
[
2020
] [📝] [ 🎓University of Illinois
] -
[
review
]
Click to expand
Most approaches are used for low-level control in specific tasks. Instead of predicting the actuator commands such as throttle , most are predicting the acceleration which should be easier to transfer from the simulator to a real car that has different dynamics and capabilities. Source. |
Authors: Farazi, N. P., Ahamed, T., Barua, L., & Zou, B.
- In short:
- Review of some of the latest papers on model-free
RL
for several applications such as autonomous driving and traffic management e.g.signal
androute
control. - No mention of the
safety
limitations, which is one of the major problems of model-free methods.
- Review of some of the latest papers on model-free
"MIDAS: Multi-agent Interaction-aware Decision-making with Adaptive Strategies for Urban Autonomous Navigation"
-
[
2020
] [📝] [ 🎓University of Pennsylvania
] [ 🚗Nuro
] -
[
attention
,parametrized driver-type
]
Click to expand
Top: scenarios are generated to require interaction-aware decisions from the ego agent. Bottom: A driver-type parameter is introduced to learn a single policy that works across different planning objectives. It represents the driving style such as the level of aggressiveness . It affects the terms in the reward function in an affine way. Source. |
The conditional parameter (ego’s driver-type ) should only affect the encoding of its own state , not the state of the other agents. Therefore it injected after the observation encoder. MIDAS is compared to DeepSet and “Social Attention”. SAB stands for ''set-attention block'', ISAB for ''induced SAB '' and PMA for ''pooling by multi-head attention''. Source. |
Authors: Chen, X., & Chaudhari, P.
-
Motivations:
1-
Derive an adaptive ego policy.- For instance, conditioned on a parameter that represents the driving style such as the level of
aggressiveness
. -
"
MIDAS
includes adriver-type
parameter to learn a single policy that works across different planning objectives."
- For instance, conditioned on a parameter that represents the driving style such as the level of
2-
Handle an arbitrary number of other agents, with a permutation-invariant input representation.-
[
MIDAS
uses anattention
-mechanism] "The ability to pay attention to only the part of theobservation
vector that matters for control irrespective of the number of other agents in the vicinity."
-
3-
Decision-making should be interaction-aware.-
"A typical
planning
algorithm would predict the forward motion of the other cars and ego would stop until it is deemed safe and legal to proceed. While this is reasonable, it leads to overly conservative plans because it does not explicitly model the mutual influence of the actions of interacting agents." - The goal here is to obtain a policy more optimistic than a worst-case assumption via the tuning of the
driver-type
.
-
-
How to train a user-tuneable adaptive policy?
-
Each agent possesses a real-valued parameter
βk
∈ [−1
,1
] that models its “driver-type”. A large value ofβk
indicates an aggressive agent and a small value ofβk
indicates that the agent is inclined to wait for others around it before making progress." β
is not observable to others and is used to determine the agent’s velocity asv = 2.7β + 8.3
.- This affine form
wβ+b
is also used in all sub-rewards of thereward
function:1-
Time-penalty for every timestep.2-
Reward for non-zero speed.3-
Timeout penalty that discourages ego from stopping the traffic flow.-
"This includes a stalement penalty where all nearby agents including ego are standstill waiting for one of them to take initiative and break the tie."
-
4-
Collision penalty.5-
A penalty for following too close to the agent in front. - This one does not depend onβ
.
- The goal is not to explicitly infer
β
fromobservation
s, as it is done by thebelief tracker
of somePOMDP
solvers. - Where to inject
β
?-
"We want ego’s
driver-type
information to only affect the encoding of its ownstate
, not the state of the other agents." -
"We use a two-layer perceptron with ReLU nonlinearities to embed the scalar variable
β
and add the output to the encoding of ego’sstate
." [Does that mean that the single scalarβ
go alone through two FC layers?]
-
- To derive adjustable drivers, one could use
counterfactual reasoning
withk-levels
for instance.-
"It however uses self-play to train the policy and while this approach is reasonable for highway merging, the competition is likely to result in high collision rates in busy urban intersections such as ours."
-
-
-
About the
attention
-based architecture.- The permutation-invariance and size independence can be achieved by combining a
sum
followed by someaggregation
operator such asaverage pooling
.- The authors show the limitation of using a
sum
. Instead of preferring amax pooling
, the authors suggest usingattention
: -
"Observe however that the summation assigns the same weight to all elements in the input. As we see in our experiments, a value function using this
DeepSet
architecture is likely to be distracted by agents that do not inform the optimal action."
- The authors show the limitation of using a
-
"An
attention
module is an elegant way for thevalue function
to learnkey
,query
,value
embeddings that pay moreattention
to parts of the input that are more relevant to the output (for decision making)." Set transformer
.-
"The set-transformer in
MIDAS
is an easy, automatic way to encode variable-sizedobservation
vectors. In this sense, our work is closest to “Social Attention”, (Leurent & Mercat, 2019), which learns to influence other agents based on road priority and demonstrates results on a limited set of road geometries."
-
- The permutation-invariance and size independence can be achieved by combining a
-
observation
space (not very clear).-
"Their observation vector contains the locations of all agents within a Euclidean distance of
10m
and is created in an ego-centric coordinate frame." - The list of
waypoints
to reach agents-specific goal locations is allegedly also part of theobservation
. - No orientation? No previous poses?
-
-
Binary
action
.-
"Control
actions
of all agents areukt
∈ {0
,1
} which correspond tostop
andgo
respectively." - No information about the
transition
/dynamics
model.
-
-
Tricks for off-policy RL, here
DQN
.-
"The
TD2
objective can be zero even if the value function is not accurate because the Bellman operator is only a contraction in theL∞
norm, not theL2
norm." - Together with
double DQN
, andduelling DQN
, the authors proposed two variants:1-
Thenet 1
(local) selects the action withargmax
fornet 2
(target
ortime-lagged
, whose weights are copied fromnet 1
at fixed periods) and vice-versa.-
"This forces the first copy, via its time-lagged parameters to be the evaluator for the second copy and vice-versa; it leads to further variance reduction of the target in the
TD
objective."
-
2-
Duringaction selection
,argmax
is applied on the average of theq
-values of both networks.
-
-
Evaluation.
- Non-ego agents drive using an Oracle policy that has full access to trajectories of nearby agents. Here it is rule-based (
TTC
). - Scenarios are diverse for training (no
collision
scenario during testing. [Why?]):generic
: Initial and goal locations are uniformly sampled.collision
: The ego will collide with at least one other agent in the future if it does not stop at an appropriate timestep.interaction
: At least2
other agents will arrive at a location simultaneously with the ego car.-
"Ego cannot do well in
interaction
episodes unless it negotiates with other agents." -
"We randomize over the number of agents,
driver-types
, agent IDs, road geometries, and add small perturbations to their arrival time to construct1917
interaction
episodes." -
"Curating the dataset in this fashion aids the reproducibility of results compared to using random seeds to initialize the environment."
-
-
"[Robustness] At test time, we add
Bernoulli
noise of probability0.1
to theactions
of other agents to model the fact that driving policies of other agents may be different from each other." - Performance in the simulator is evaluated based on:
1-
Thetime-to-finish
which is the average episode length.2-
Thecollision-
,timeout-
andsuccess rate
which refer to the percentage of episodes that end with the corresponding status (the three add up to1
).
-
"To qualitatively compare performance, we prioritize
collision rate
(an indicator forsafety
) over thetimeout rate
andtime-to-finish
(which indicateefficiency
). Performance of theOracle planner
is reported over4
trials. Performance of the trained policy is reported across4
random seeds."
- Non-ego agents drive using an Oracle policy that has full access to trajectories of nearby agents. Here it is rule-based (
"An end-to-end learning of driving strategies based on DDPG and imitation learning"
Click to expand
A small set of collected expert demonstrations is used to train an IL agent while pre-training the DDPG (offline RL ). Then, the IL agent is used to generate new experiences, stored in the M1 buffer. The RL agent is then trained online ('self-learning') and the generated experiences are stored in M2 . During this training phase, the sampling from M1 is progressively reduced. The decay of the sampling ratio is automated based on the return . Source. |
Authors: Zou, Q., Xiong, K., & Hou, Y.
- I must say the figures are low quality and some sections are not well-written. But I think the main idea is interesting to report here.
- Main inspiration:
Deep Q-learning from Demonstrations
(Hester et al. 2017) atDeepMind
.-
"We present an algorithm, Deep Q-learning from Demonstrations (
DQfD
), that leverages small sets of demonstration data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism."
-
- Motivations:
- Improve the training efficiency of model-free off-policy
RL
algorithms.-
[issue with
DDPG
] "The reason whyRL
converges slowly is that it requires constant exploration from the environment. The data in the early experience pool (s
,a
,r
,s_
) are all low-reward data, and the algorithm cannot always be trained with useful data, resulting in a slow convergence rate." -
[issue with
behavioural cloning
] "The performance of pureIL
depends on the quality and richness of expert data, and cannot be self-improved."
-
-
[main idea] "Demonstration data generated by
IL
are used to accelerate the learning speed ofDDPG
, and then the algorithm is further enhanced through self-learning."
- Improve the training efficiency of model-free off-policy
- Using
2
experience replay buffers.1-
Collect a small amount of expert demonstrations.-
"We collect about
2000
sets of artificial data from theTORCS
simulator as expert data and use them to train a simpleIL
network. Because there is less expert data,Dagger
(Dataset Aggregation) is introduced in the training method, so that the network can get the best results with the least data."
-
2-1.
Generate demonstration data using theIL
agent. And store them in theexpert pool
.2-2.
Meanwhile, pre-train theDDPG
algorithm on the demonstration data. It isoffline
, i.e. the agent trains solely without any interaction with the environment.3-
After the pre-training is completed, theDDPG
algorithm starts exploration. It stores the exploration data in theordinary pool
.- Experiences are sampled from the
2
pools. -
"We record the maximum value of total reward
E
whenIL
produced demonstration data for each round, and use it as the threshold to adjust thesampling ratio
[..] when the algorithm reaches an approximate expert level, the value of theexpert/ordinary
ratio will gradually decrease."
- Experiences are sampled from the
- Results
-
[training time] "Experimental results show that the
DDPG-IL
algorithm is3
times faster than the ordinaryDDPG
and2
times faster than the single experience poolDDPG
based on demonstration data." - In addition,
DDPG-IL
shows the best performance.
-
"High-Speed Autonomous Drifting with Deep Reinforcement Learning"
-
[
generalization
,action smoothing
,ablation
,SAC
,CARLA
]
Click to expand
The goal is to control the vehicle to follow a trajectory at high speed (>80 km/h ) and drift through manifold corners with large side slip angles (>20° ), like a professional racing driver. The slip angle β is the angle between the direction of the heading and the direction of speed vector. The desired heading angle is determined by the vector field guidance (VFG ): it is close to the direction of the reference trajectory when the lateral error is small. Source. |
Top-left: The generalization capability is tested by using cars with different kinematics and dynamics. Top-right: An action smoothing strategy is adopted for stable control outputs. Bottom: the reward function penalized deviations to references states in term of distance , direction , and slip angle . The speed factor v is used to stimulate the vehicle to drive fast: If v is smaller than 6 m/s , the total reward is decreased by half as a punishment. Source. |
The proposed SAC -based approach as well as the three baselines can follow the reference trajectory. However, SAC achieves a much higher average velocity (80 km/h ) than the baselines. In addition, it is shown that the action smoothing strategy can improve the final performance by comparing SAC-WOS and SAC . Despite the action smoothing , the steering angles of DDPG is also shaky. Have a look at the steering gauges! Source. |
Authors: Cai, P., Mei, X., Tai, L., Sun, Y., & Liu, M.
-
Motivations:
1-
Learning-based.- The car dynamics during transient drift (high
speed
>80km/h
andslipe angle
>20°
as opposed to steady-state drift) is too hard to model precisely. The authors claim it should rather be addressed by model-free learning methods.
- The car dynamics during transient drift (high
2-
Generalization.- The drift controller should generalize well on various
road structures
,tire friction
andvehicle types
.
- The drift controller should generalize well on various
-
MDP
formulation:state
(42
-dimensional):- [Close to
imitation learning
] It is called "error-basedstate
" since it describes deviations to the referenced drift trajectories (performed by a experienced driver with aLogitech G920
). - These deviations relate to the
location
,heading angle
,velocity
andslip angle
.-
[Similar to
D
inPID
] "Time derivatives of the error variables, such asd(ey)/d(t)
, are included to provide temporal information to the controller."
-
- The
state
also contains the laststeering
andthrottle
commands. Probably enabling consistent action selection in theaction smoothing
mechanism.
- [Close to
action
:- The
steering
is limited to a smaller range of [−0.8
,0.8
] instead of [−1
,1
] to prevent rollover. -
"Since the vehicle is expected to drive at high speed, we further limit the range of the
throttle
to [0.6
,1
] to prevent slow driving and improve training efficiency.
- The
action
smoothing against shaky control output.-
"We impose continuity in the
action
, by constraining the change of output with the deployed action in the previous step:a[t]
=K1
.a_net[t]
+K2
.a[t-1]
.
-
-
Algorithms:
1-
DQN
can only handle the discreteaction
space: it selects among5
*10
=50
combinations, without theaction smoothing
strategy.2-
DDPG
is difficult to converge due to the limited exploration ability caused by its deterministic character.3-
[Proposed] Soft actor-critic (SAC
) offers a better convergence ability while avoiding the high sample complexity: Instead of only seeking to maximize the lifetime rewards,SAC
seeks to also maximize the entropy of the policy (as a regularizer). This encourages exploration: the policy should act as randomly as possible [encourage uniform action probability] while being able to succeed at the task.4-
TheSAC-WOS
baseline does not have anyaction smoothing
mechanism (a[t]
=K1
.a_net[t]
+K2
.a[t-1]
) and suffers from shaky behaviours.
-
Curriculum learning:
-
"Map (
a
) is relatively simple and is used for the first-stage training, in which the vehicle learns some basic driving skills such as speeding up by applying large values of throttle and drifting through some simple corners. Maps (b
-f
) have different levels of difficulty with diverse corner shapes, which are used for further training with the pre-trained weights from map (a
). The vehicle can use the knowledge learned from map (a
) and quickly adapt to these tougher maps, to learn a more advanced drift technique." - Is the replay buffer build from training with map (
a
) reused? What about the weighting parameterα
of the entropy termH
in the objective function?
-
-
Robust
training
for generalization:- The idea is to expose different
road structures
andcar models
during training, i.e. make theMDP
environmentstochastic
. -
"At the start of each episode, the
tire friction
andvehicle mass
are sampled from the range of [3.0
,4.0
] and [1.7t
,1.9t
] respectively." -
[Evaluation.] "Note that for each kind of vehicle, the referenced drift trajectories are different in order to meet the respective physical dynamics." [How are they adjusted?]
- Benefit of
SAC
:-
[from
BAIR
blog] "Due to entropy maximization at training time, the policy can readily generalize to these perturbations without any additional learning."
-
- The idea is to expose different
-
Ablation study for the
state
space.-
"Can we also provide less information during the training and achieve no degradation in the final performance?"
- Findings:
1-
Necessity ofslip angle
information (instate
andreward
) duringtraining
.-
[Need for
supervision
/expert demonstration
] "Generally, accurateslip angles
from expert drift trajectories are indeed necessary in the training stage, which can improve the final performance and the training efficiency."
-
2-
Non-degraded performance with a rough and easy-to-access reference trajectory duringtesting
. Making it less dependant on expert demonstraions.
-
"Trajectory based lateral control: A Reinforcement Learning case study"
-
[
generalization
,sim2real
,ablation
,DDPG
,IPG CarMaker
]
Click to expand
The task is to predict the steering commands to follow the given trajectory on a race track. Instead of a sum , a product of three deviation terms is proposed for the multi-objective reward . Not clear to me: which WP is considered for the deviation computation in the reward ? Source. |
Testing a vehicle with different drive dynamics on an unseen track. Source. |
Authors: Wasala, A., Byrne, D., Miesbauer, P., O’Hanlon, J., Heraty, P., & Barry, P.
-
Motivations:
1-
Generalization in simulation.- The learnt agent should be able to complete at
test
-time unseen and more complex tracks using different unseen vehicle models. -
"How can RL agents be trained to handle diverse scenarios and adapt to different car models?"
- The learnt policy is tested on vehicles with different
dimensions
,weights
,power-trains
,chassis
,transmission
andaerodynamics
.
- The learnt agent should be able to complete at
2-
sim-to-real
transfer.-
"Is training solely in a simulated environment usable in a real car without any additional training / fine-tuning on the car itself?"
-
"[Other works] Those that did implement live vehicle testing, required further tuning/training in order to make the jump from simulation and were constrained to low speeds between
10km∕h
and40km∕h
."
-
3-
Ablation study.-
"What are the most useful (essential)
state
–space parameters?"
-
-
Task: predict the
steering
commands to follow the given trajectory on a race track.-
"[Need for learning-based approach.] These traditional controllers [
MPC
,PID
] require large amounts of tuning in order to perform at an acceptable level for safe autonomous driving." - Path-velocity decomposition:
- The longitudinal control (
acceleration
andbraking
) is handled by some "baseline driver". -
"Our experiments show that trying to control all three actuation signals with a single network makes it very difficult for the agent to converge on an optimal policy."
-
"[The decomposition] helped to improve
credit assignment
, a common problem in the field involving the agent trying to assign whataction
is responsible for thereward
the agent is receiving."
- The longitudinal control (
-
-
MDP
formulation:state
:10
waypoints spaced evenly5m
apart.- Several ego vehicle parameters such as the
wheel speed
, andengine rpm
.
discount factor
set to1
.- Exploration and termination.
- When the car drives
off-road
or when a highyaw acceleration
is measured, the episode is terminated.
- When the car drives
reward
- Multi-objective:
efficiency
,safety
andcomfort
. - Instead of a
sum
, here aproduct
of three deviation terms:speed
(ratio),angle
(normalized ratio) anddistance
(normalized ratio).
- Multi-objective:
- Algorithm:
DDPG
was preferred overPPO
since it isoff-policy
: better sampling efficiency.
- Simulator:
- The commercial vehicle dynamics simulation platform
IPG CarMaker
.
- The commercial vehicle dynamics simulation platform
-
Ablation study in the
state
space.1-
Parameters can be overwritten at test time.- During inference,
wheel speed
andengine RPM
are replaced with0.5
(the median of the normalized range [0
,1
]). -
"Surprisingly, the agent was still able to drive successfully with no evident side effects being displayed."
- During inference,
2-
Parameters can be removed during training.-
"However, once these features [
wheel speed
andengine RPM
] were removed, the agent was no longer able to converge." -
[Intuition - I think there must be something else] "Similar to training wheels on a bicycle, these features helped to stabilize the training and allow the agent to see patterns, that are otherwise more difficult to detect."
-
-
Generalization.
1-
Driving on unseen and more complextracks
.- The stability of the agent’s policy began to deteriorate as it was exposed to increasingly complex
states
outside of itstraining
scope.
- The stability of the agent’s policy began to deteriorate as it was exposed to increasingly complex
2-
Driving at higher speeds (120
−200 km∕h
), while being trained on a top speed of80 km∕h
.-
"Once we re-scaled the normalization thresholds of the
state
–space to match the increased range of velocity, the agent was able to handle even greater speeds." -
"Re-normalizing the
state
space to match the range of the initial trainingstate
space played a key role in the agent' ability to adapt to new and unseen challenges."
-
3-
Driving with unseen and varying car models.-
"The generalizability across multiple vehicle models could be improved by adding a vehicle descriptor to the
state
–space and training with a wider range of vehicle models."
-
-
Transfer to a live test vehicle.
- Challenges:
1-
Port the trainedKeras
model toC++
, with frugally-deep.2-
Build thestate
from the AD stack.-
"
Engine rpm
was the only element of thestate
–space we could not reproduce in the AD stack, since we wanted our controller to only utilize information that any other controller had available. We replaced theengine rpm
with the median value of the normalized range ([0
,1
]), i.e.0.5
."
-
3-
Deal with latency betweencommands
andactuation
.- Actually a big topic!
- Addressed with
action
holding. -
[Also, about the decision frequency:] "The created trajectory in our AD stack was updated every 100 ms, as opposed to the
CarMaker
simulation where it was updated each simulation step, i.e. continuously."
- Findings:
-
"[Unseen
state
s and differenttransition
function] We found that moving from simulation to the real-world without additional training poses several problems. Firstly, our training simulation did not contain any sloped roads, weather disturbances (e.g.wind
), or inaccuracies of sensor measurements used to represent thestate
–space."
-
-
"While the ride comfort was not ready for a production vehicle yet, it achieved speeds of over
60 km∕h
while staying within lane and taking turns."
- Challenges:
-
I really like the
lessons learned
section.1-
Generalization.-
"We believe that the cause of the agents high
generalization
capabilities stems from how the agent was trained [state
,reward
,scenarios
], as opposed to what algorithm was used for learning." -
"The
scenario
used fortraining
the agent is paramount to itsgeneralization
potential."
-
2-
Problem of oscillation.-
"The authors combat this by using an
action
smoothing method that constrains maximum change allowed betweenactions
. This constraint helps reduce the oscillation of the agent and promotes a smoother control policy."
-
3-
Catastrophic forgetting.-
"Seemingly without explanation, the agent would forget everything it had learned and degraded into a policy that would immediately drive the agent off the track."
- Solution: greatly increase the size of the replay buffer.
- This meant the agent was able to sample from a much wider distribution of experiences and still see examples of driving in poor conditions such as being off-centre.
- A similar idea could be to maintain two buffers: a
safe
and anon-safe
, as in(Baheri et al., 2019)
.
-
4-
Buffer initialization.-
"During our experiments, we noticed that pre-populating the replay buffer, before training accelerated the convergence."
-
5-
Loss
in theQ
-net is not converging.- As noted in this post:
- This is not unusual for
RL
and does not indicate anything is wrong. 1)
As the agent gets better at playing, estimating the reward does get more difficult (because it's no longer always0
).2)
As thereward
gets higher, and the average episode length gets longer, the amount of variance in thereward
can also get larger, so it's challenging even to prevent the loss from increasing.3)
A third factor is that the constantly changing d poses a "moving-target" problem for theQ-network
.
- This is not unusual for
-
"While the
DNN
’sloss
plot gives an indication of whether or not the agent is learning, it does not reflect the quality of the policy the agent has learned, i.e. an agent may learn a bad policy well."
- As noted in this post:
6-
action
repeats.- The agent repeats the same
action
for a given number of consecutive steps without revaluating thestate
–space. -
"We could accelerate the training and address the latency issues."
- Is the agent aware of that? For instance by penalizing
action
changes?
- The agent repeats the same
7-
Poor reproducibility.- Other references for
RL debugging
:
"Reinforcement Learning based Control of Imitative Policies for Near-Accident Driving"
Click to expand
Illustration of rapid phase transitions: when small changes in the critical states – the ones we see in near-accident scenarios – require dramatically different actions of the autonomous car to stay safe. Source. |
I must say I am a bit disappointed by the collision rate results. The authors mention safety a lot, but their approaches crash every third trial on the unprotected turn . The too-conservative agent TIMID gets zero collision. Is it then fair to claim ''Almost as Safe as Timid''? In such cases, what could be needed is an uncertainty-aware agent, e.g. POMDP with information gathering behaviours. Source. |
One of the five scenarios: In this unprotected left turn , a truck occludes the oncoming ado car. Bottom left: The two primitive policies (aggressive and timid ) are first learnt by imitation . Then they are used to train a high-level policies with RL , to select at each timestep ts (larger that primitive timestep) which primitive to follow. Bottom right: While AGGRESSIVE achieves higher completion rates for the low time limits, it cannot improve further with the increasing limit with collisions. Source. |
About phase transition: H-REIL usually chooses the timid policy at the areas that have a collision risk while staying aggressive at other locations when it is safe to do so. Baselines: π-agg , resp. π-agg , has been trained only on aggressive , resp. timid , rule-based demonstrations with IL . π-IL was trained on the mixture of both. A pity that no pure RL baseline is presented. Source. |
Authors: Cao, Z., Bıyık, E., Wang, W. Z., Raventos, A., Gaidon, A., Rosman, G., & Sadigh, D.
-
Motivation:
- Learn (no
rule-based
) driving policies in near-accident scenarios, i.e. where quick reactions are required, beingefficient
, whilesafe
. - Idea: decompose this complicated task into two levels.
- Learn (no
-
Motivations for a hierarchical structure:
1-
The presence of rapid phase transitions makes it hard forRL
andIL
to capture the policy because they learn a smooth policy acrossstates
.-
"Phase transitions in autonomous driving occur when small changes in the critical
states
– the ones we see in near-accident scenarios – require dramatically differentactions
of the autonomous car to stay safe." - Due to the non-smooth
value
function, anaction
taken in one state may not generalize to nearbystates
. - During training, the algorithms must be able to visit and handle all the critical states individually, which can be computationally inefficient.
- How to model the rapid phase transition?
- By switching from one
driving mode
to another. -
[The main idea here] "Our key insight is to model phase transitions as optimal switches, learned by
RL
, between differentmodes
of driving styles, each learned throughIL
."
- By switching from one
-
2-
To achieve full coverage,RL
needs to explore the full environment whileIL
requires a large amount of expert demonstrations covering allstates
.- Both are prohibitive since the
state
-action
space in driving is continuous and extremely large.- One solution to improve data-efficiency: Conditional imitation learning (
CoIL
). it extendsIL
with high-level commands and learns a separateIL
model for each command. High-level commands are required at test time, e.g., the direction at an intersection. Instead of depending on drivers to provide commands, the authors would like to learn these optimal mode-switching policy.
- One solution to improve data-efficiency: Conditional imitation learning (
-
"The mode switching can model rapid phase transitions. With the reduced action space and fewer time steps, the high-level
RL
can explore all thestates
efficiently to addressstate
coverage." -
"
H-REIL
framework usually outperformsIL
with a large margin, supporting the claim that in near-accident scenarios, training a generalizableIL
policy requires a lot of demonstrations."
- Both are prohibitive since the
-
Hierarchical RL
enables efficient exploration for the higher level with a reducedaction
space, i.e. goal space, while makingRL
in the lower level easier with an explicit and short-horizon goal.
-
Motivations for combining
IL
+RL
, instead of singleHRL
orCoIL
:1-
For the low-level policy:- Specifying
reward
functions forRL
is hard. -
"We emphasize that
RL
would not be a reasonable fit for learning the low-level policies as it is difficult to define thereward
function." -
"We employ
IL
to learn low-level policiesπi
, because each low-level policy sticks to one driving style, which behaves relatively consistently acrossstates
and requires little rapid phase transitions."
- Specifying
2-
For the high-level policy:-
"
IL
does not fit to the high-level policy, because it is not natural for human drivers to accurately demonstrate how to switch driving modes." -
"
RL
is a better fit since we need to learn to maximize thereturn
based on areward
that contains a trade-off between various terms, such as efficiency and safety. Furthermore, the action space is now reduced from a continuousspace
to a finite discretespace
[the conditional branches]." - Denser
reward
signals: settingts > 1
reduces the number of time steps in an episode and makes the collision penalty, which appears at most once per episode, less sparse.
-
-
Hierarchical
reinforcement
andimitation
learning (H-REIL
).1-
One high-level (meta-) policy, learned byRL
that switches between differentdriving modes
.- Decision: which low-level policy to use?
- Goal: learn a mode switching policy that maximizes the
return
based on a simpler pre-definedreward
function.
2-
Multiple low-level policiesπi
, learned byIL
: one perdriving mode
.- Imitate drivers with different characteristics, such as different
aggressiveness
levels. - They are "basic" and realize relatively easier goals.
-
"The low-level policy for each
mode
can be efficiently learned withIL
even with only a few expert demonstrations, sinceIL
is now learning a much simpler and specific policy by sticking to one driving style with little phase transition."
- Imitate drivers with different characteristics, such as different
-
How often are decisions taken?
500ms
= timestep of HL.100ms
= timestep of LL.- There is clearly a trade-off:
1-
The high level should run at low frequency (action
abstraction).2-
Not too low since it should be able to react and switch quickly.
- Maybe the
timid
IL
primitive could take over before the end of the500ms
if needed. - Or add some
reward
regularization to discourage changing modes as long as it is not very crucial.
-
Two
driving modes
:1-
timid
: drives in a safe way to avoid all potential accidents. It slows down whenever there is even a slight risk of an accident.2-
aggressive
: favoursefficiency
oversafety
. "It drives fast and frequently collides with the ado car."-
"Since humans often do not optimize for other nuanced metrics, such as
comfort
, in a near-accident scenario and the planning horizon of our high-level controller is extremely short, there is a limited amount of diversity that different modes of driving would provide, which makes having extra modes unrealistic and unnecessary in our setting." -
"This intelligent mode switching enables
H-REIL
to drive reasonably under different situations: slowly and cautiously under uncertainty, and fast when there is no potential risk."
-
About the low-level
IL
task.Conditional imitation
:- All the policies share the same feature extractor.
- Different branches split in later layers for
action
prediction, where each corresponds to onemode
. - The branch is selected by external input from high-level
RL
.
- Each scenario is run with the ego car following a hand-coded with two settings:
difficult
andeasy
.- Are there demonstrations of collisions? Are
IL
agents supposed to imitate that? How can they learn to recover from near-accident situations? -
"The
difficult
setting is described above where the ado car acts carelessly or aggressively, and is likely to collide with the ego car." - These demonstrations are used to learn the
aggressive
andtimid
primitive policies.
- Are there demonstrations of collisions? Are
- Trained with
COiLTRAiNE
. - Number of episodes collected per
mode
, for imitation:CARLO
:80.000
(computationally lighter).CARLA
:100
(it includes perception data).
-
About the high-level
RL
task:POMDP
formulation.reward
-
"It is now much easier to define a
reward
function because the ego car already satisfies some properties by following the policies learned from expert demonstrations. We do not need to worry aboutjerk
[ifts
large], because the experts naturally give low-jerk demonstrations." 1-
Efficiency term:Re
is negative in every time step, so that the agent will try to reach its destination as quickly as possible.- Ok, but how can it scale to continuous (non-episodic) scenarios, such as real-world driving?
2-
Safety term:Rs
gets an extremely large negative value if a collision occurs.
-
training
environment:CARLO
.- No detail about the
transition
model.
-
About
observation
spaces, for both tasks:- In
CARLO
:positions
andspeeds
of the ego car and the ado car, if not occluded, perturbed with Gaussian noise. - In
CARLA
: Same but with front-viewimage
.- How to process the image?
- Generate a binary image using an object detection model.
- Only the bounding boxes are coloured white. It provides information of the ado car more clearly and alleviates the environmental noise.
- How to process the image?
- How can the agents be trained if the
state
space varies? - Are frames stacked, as represented on the figure? Yes, one of the authors told me
5
are used.
- In
-
About
CARLO
simulator to train faster.CARLO
stands forCARLA
- Low Budget. It is less realistic but computationally much lighter thanCARLA
.-
"While
CARLO
does not provide realistic visualizations other than two-dimensional diagrams, it is useful for developing control models and collecting large amounts of data. Therefore, we useCARLO
as a simpler environment where we assume perception is handled, and so we can directly use the noisy measurements of other vehicles’speeds
andpositions
(if not occluded) in addition to thestate
of the ego vehicle."
-
Some concerns:
- What about the car dynamics in
CARLO
?-
[same
action
space] "For bothCARLO
andCARLA
, the control inputs for the vehicles arethrottle
/brake
andsteering
." CARLO
assumes point-mass dynamics models, while the model of physics engine ofCARLA
is much more complex with non-linear dynamics!- First, I thought agents were trained in
CARLO
and tested inCARLA
. But the transfer is not possible because of **mismatch indynamics
andstate
spaces.- But apparently training and testing are performed individually and separately in both simulators. Ok, but only one set of results is presented. I am confused.
-
- Distribution shift and overfitting.
- Evaluation is performed on scenarios used during
training
. - Can the system address situations it has not been trained on?
- Evaluation is performed on scenarios used during
- Safety!
- What about the car dynamics in
"Safe Reinforcement Learning for Autonomous Lane Changing Using Set-Based Prediction"
-
[
2020
] [📝] [ 🎓TU Munich
] -
[
risk estimation
,reachability analysis
,action-masking
]
Click to expand
action masking for safety verification using set-based prediction and sampling-based trajectory planning. Top-left: A braking trajectory with maximum deceleration is appended to the sampled trajectory. The ego vehicle never follows this braking trajectory , but it is utilized to check if the vehicle is in an invariably safe state at the end of its driving trajectory. Source. |
Authors: Krasowski, H., Wang, X., & Althoff
-
Previous work:
"High-level Decision Making for Safe and Reasonable Autonomous Lane Changing using Reinforcement Learning"
, (Mirchevska, Pek, Werling, Althoff, & Boedecker, 2018). -
Motivations:
- Let a safety layer guide the
exploration
process to forbid (mask
) high-levelactions
that might result in a collision and to speed up training.
- Let a safety layer guide the
-
Why is it called "set-based prediction"?
- Using reachability analysis, the
set
of future occupancies of each surrounding traffic participant and the ego car is computed. -
"If both
occupancy sets
do not intersect for all consecutive time intervals within a predefined time horizon and if the ego vehicle reaches an invariably safe set, a collision is impossible." - Two-step prediction:
1-
The occupancies of the surrounding traffic participants are obtained by using TUM's toolSPOT
= "Set-Based Prediction Of Traffic Participants".-
[
action
space] "High-levelactions
for lane-changing decisions:changing to the left lane
,changing to the right lane
,continuing in the current lane
, and staying in the current lane by activating a safe adaptive cruise control (ACC
)." SPOT
considers the physical limits of surrounding traffic participants and constraints implied by traffic rules.
-
2-
The precise movement is obtained by a sampling-based trajectory planner.
- Using reachability analysis, the
-
Other approaches:
-
"One approach is to execute the planned trajectories if they do not collide with a traffic participant according to its prediction. The limitation is that collisions still happen if other traffic participants’ behaviour deviates from their prediction."
- Reachability analysis verifies the safety of planned trajectories by computing all possible future motions of obstacles and checking whether they intersect with the occupancy of the ego vehicle.
-
"Since computing the exact reachable sets of nonlinear systems is impossible, reachable sets are over-approximated to ensure safety."
-
-
-
What if, after
masking
, all actions are verified as unsafe? Can safety be guaranteed?-
"To guarantee safety, we added a verified fail-safe planner, which holds available a safe action that is activated when the agent fails to identify a safe action." [not clear to me]
- Apparently, the lane is kept and an
ACC
module is used.
-
-
MDP
formulation: Episodic task.-
"We terminate an episode if the time horizon of the current traffic scenario is reached, the goal area is reached, or the ego vehicle collides with another vehicle."
- Furthermore, the
distance to goal
is contained inside thestate
. - Not clear to me: How can this work in long highway journeys? By setting successive goals? How can
state
values be consistent? Should not the task becontinuous
rather thanepisodic
?
-
-
About safe
RL
.- Safe
RL
approaches are distinguished by approaches that:1-
Modify theoptimization criterion
.2-
Modify theexploration process
."
-
"By modifying the
optimality objective
, agents behave more cautious than those trained without a risk measure included in the objective; however, the absence of unsafe behaviors cannot be proven. In contrast, by verifying the safety of theaction
and excluding possible unsafeactions
, we can ensure that the exploration process is safe."
- Safe
-
Using real-world highway dataset (
highD
).-
"We generated tasks by removing a vehicle from the recorded data and using its start and the final
state
as the initialstate
and the center of the goal region, respectively." -
[Evaluation] "We have to differentiate between collisions for which the ego vehicle is responsible and collisions that occur because no interaction between traffic participants was considered due to prerecorded data."
-
-
Limitations.
1-
Here, safeactions
are determined by set-based prediction, which considers all possible motions of traffic participants.-
"Due to the computational overhead for determining safe actions, the computation time for training safe agents is
16
times higher than for the non-safe agents. The average training step for safe agents takes5.46s
and0.112s
for non-safe agents." -
"This significant increase in the training time is mainly because instead of one trajectory for the selected action, all possible trajectories are generated and compared to the predicted occupancies of traffic participants."
-
2-
Distribution shift."The agents trained in
safe
mode did not experience dangerous situations with high penalties during training and cannot solve them in thenon-safe
test setting. Thus, the safety layer is necessary during deployment to ensure safety."3-
The interaction between traffic participants is essential."Although the proposed approach guarantees safety in all scenarios, the agent drives more cautiously than normal drivers, especially in dense traffic."
"Safe Reinforcement Learning with Mixture Density Network: A Case Study in Autonomous Highway Driving"
-
[
2020
] [📝] [ 🎓West Virginia University
] -
[
risk estimation
,collision buffer
,multi-modal prediction
]
Click to expand
Multimodal future trajectory predictions are incorporated into the learning phase of RL algorithm as a model lookahead: If one of the future states of one of the possible trajectories leads to a collision, then a penalty will be assigned to the reward function to prevent collision and to reinforce to remember unsafe states. Otherwise, the reward term penalizes deviations from the desired speed , lane position , and safe longitudinal distance to the lead traffic vehicle. Source. |
Author: Baheri, A.
-
Previous work:
"Deep Q-Learning with Dynamically-Learned Safety Module: A Case Study in Autonomous Driving"
(Baheri et al., 2019), detailed on this page too. -
Motivations:
- Same ideas as the previous work.
1-
Speed up the learning phase.2-
Reduce collisions (no safety guarantees though).
-
Ingredients:
- "Safety" is improved via
reward
shaping, i.e. modification of the optimization criterion, instead of constraining exploration (e.g.action masking
,action shielding
). - Two ways to classify a
state
as risky:1-
Heuristic (rule-based). From a minimum relativegap
to a traffic vehicle based on its relativevelocity
.2-
Prediction (learning-based). Model lookaheads (prediction / rollouts) are performed to assess the risk of a givenstate
.-
[Why learnt?] "Because heuristic safety rules are susceptible to deal with unexpected behaviors particularly in a highly changing environment". [Well, here the scenarios are generated in a simulator, that is ... heuristic-based]
- Contrary to
action masking
, "bad behaviours" are not discarded: they are stored in acollision
-buffer, which is sampled during theupdate
phase.
- "Safety" is improved via
-
How to predict a set of possible trajectories?
- Mixture density
RNN
(MD-RNN
). - It has been offline trained (supervised learning).
-
"The central idea of a
MDN
is to predict an entire probability distribution of the output(s) instead of generating a single prediction. TheMD-RNN
outputs aGMM
for multimodal future trajectory predictions that each mixture component describes a certain driving behavior."
- Mixture density
"Reinforcement Learning with Uncertainty Estimation for Tactical Decision-Making in Intersections"
-
[
2020
] [📝] [ 🎓Chalmers University
] [ 🚗Volvo
,Zenuity
] -
[
uncertainty-aware
,continuous episodes
]
Click to expand
Confidence of the recommended actions in a DQN are estimated using an ensemble method. Situation that are outside of the training distribution and events rarely seen while training can be detected. Middle: The reward associated to non-terminated state s is function of the jerk (comfort) and scaled so that its accumulation until timeout reaches -1 if no jerk is applied. Source. |
Authors: Hoel, C.-J., Tram, T., & Sjöberg, J.
-
Previous work: "Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation" (Hoel, Wolff, & Laine, 2020), detailed on this page too.
-
Motivation:
- Same idea as the previous work (ensemble
RPF
method used to estimate the confidence of the recommendedactions
) but addressing intersections instead of highways scenarios.
- Same idea as the previous work (ensemble
-
Similar conclusions:
- Uncertainty for situations that are outside of the training distribution can be detected, e.g. with the approaching vehicle driving faster than during training.
-
"Importantly, the method also indicates high uncertainty for rare events within the training distribution."
-
Miscellaneous:
- The threshold
c-safe
is still manually selected. Automatically updating its value during training would be beneficial. state
similar to "Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control", (Tram, Batkovic, Ali, & Sjöberg, 2019), detailed on this page too.action
space more abstract: {take way
,give way
,follow car {1, ... , 4}
}, relying on anIDM
for the implementation.- The timestep has been reduced from
1s
to0.04s
. Better to react!
- The threshold
-
Learning interactions:
- A
1d-CNN
structure is used for to input about the surrounding (and interchangeable) vehicles. -
"Applying the same weights to the input that describes the surrounding vehicles results in a better performance."
- A
-
How to learn
continuous
tasks while being trained onepisodic
ones (withtimeout
)?-
"If the current
policy
of the agent decides to stop the ego vehicle, anepisode
could continue forever. Therefore, atimeout
time is set toτmax = 20 s
, at which the episode terminates. The last experience of such an episode is not added to the replay memory. This trick prevents the agent to learn that an episode can end due to atimeout
, and makes it seem like an episode can continue forever, which is important, since the terminatingstate
due to the time limit is not part of theMDP
." -
"The
state
space, described above, did not provide any information on where in an episode the agent was at a given time step, e.g. if it was in the beginning or close to the end. The reason for this choice was that the goal was to train an agent that performed well inhighway
driving of infinite length. Therefore, thelongitudinal position
was irrelevant. However, at the end of a successful episode, the future discountedreturn
was0
. To avoid that the agent learned this, the last experience was not stored in the experience replay memory. Thereby, the agent was tricked to believe that the episode continued forever. [(C. Hoel, Wolff, & Laine, 2018)]"
-
"Development of A Stochastic Traffic Environment with Generative Time-Series Models for Improving Generalization Capabilities of Autonomous Driving Agents"
-
[
2020
] [📝] [ 🎓Istanbul Technical University
] -
[
generalisation
,driver model
]
Click to expand
The trajectory generator uses the Social-GAN architecture with a 0.8s horizon. Bottom-right: state vector of the agent trained with Rainbow-DQN . Bottom-right: reward function: hard crash refers to direct collisions with other vehicles whereas the soft crash represents dangerous approaches (no precise detail). Source. |
Authors: Ozturk, A., Gunel, M. B., Dal, M., Yavas, U., & Ure, N. K.
- Previous work: "Automated Lane Change Decision Making using Deep Reinforcement Learning in Dynamic and Uncertain Highway Environment" (Alizadeh et al., 2019)
- Motivation:
- Increase generalization capabilities of a
RL
agent by training it in a non-deterministic and data-driven traffic simulator.-
"Most existing work assume that surrounding vehicles employ rule-based decision-making algorithms such as
MOBIL
andIDM
. Hence the traffic surrounding the ego vehicle always follow smooth and meaningful trajectories, which does not reflect the real-world traffic where surrounding vehicles driven by humans mostly execute erroneous manoeuvres and hesitate during lane changes." -
"In this work, we develop a data driven traffic simulator by training a generative adversarial network (
GAN
) on real life trajectory data. The simulator generates randomized trajectories that resembles real life traffic interactions between vehicles, which enables training theRL
agent on much richer and realistic scenarios."
-
- Increase generalization capabilities of a
- About
GAN
for trajectory generations:-
"The
generator
takes the real vehicle trajectories and generate new trajectories. Thediscriminator
classifies the generated trajectories as real or fake." - Two ways to adapt
GAN
architecture to time-series data:1-
Convert the time series data into a 2D array and then perform convolution on it.2-
[done here] Develop a sequence to sequence encoder and decoderLSTM
network.
NGSIM
seems to have been used for training. Not detailed.
-
- To generate interaction-aware trajectories:
- Based on
Social-GAN
.-
"A social pooling is introduced where a
LSTM
Encoder encodes all the vehicles' position in a relative manner to the rest of the vehicles then a max pooling operation is performed at the hidden states of the encoder; arriving at a socially aware module."
-
- Alternatives could be to include graph-based or convolution-based models to extract interaction models.
- Based on
"From Simulation to Real World Maneuver Execution using Deep Reinforcement Learning"
-
[
sim2real
,noise injection
,train-validation-test
,D-A3C
,conditional RL
,action repeat
]
Click to expand
Environments for training (top) differ from those used for validation and testing (bottom): Multi-environment System consists in four training roundabouts in which vehicles are trained simultaneously, and a validation environment used to select the best network parameters based on the results obtained on such scenario. The generalization performance is eventually measured on a separate test environment. Source. |
A sort of conditional-learning methods: an aggressiveness level is set as an input in the state and considered during the reward computation. More precisely: α = (1−aggressiveness) . During the training phase, aggressiveness assumes a random value from 0 to 1 kept fixed for the whole episode. Higher values of aggressiveness should encourage the actor to increase the impatience; consequently, dangerous actions will be less penalized. The authors note that values of aggressiveness sampled outside the training interval [0, 1] produce consistent consequences to the agent conduct, intensifying its behaviour even further. The non-visual channel can also be used to specify the desired cruising speed. Source1 Source2. |
Source. |
Authors: Capasso, A. P., Bacchiani, G., & Broggi, A.
- Motivations:
1-
Increase robustness. In particular generalize to unseen scenarios.- This relates to the problem of overfitting for
RL
agents. -
"Techniques like random starts and sticky actions are often un-useful to avoid overfitting."
- This relates to the problem of overfitting for
2-
Reduce the gap between synthetic and real data.- Ambition is to deploy a system in the real world even if it was fully trained in simulation.
- Ingredients:
1-
Multipletraining
environments.2-
Use of separatedvalidation
environment to select the best hyper-parameters.3-
Noise injection to increase robustness.
- Previous works:
1
Intelligent Roundabout Insertion using Deep Reinforcement Learning [🎞️] about the need for a learning-based approach.-
"The results show that a rule-based (
TTC-
based) approach could lead to long waits since its lack of negotiation and interaction capabilities brings the agent to perform the entry only when the roundabout is completely free".
-
2
Microscopic Traffic Simulation by Cooperative Multi-agent Deep Reinforcement Learning about cooperative multi-agent:-
"Agents are collectively trained in a multi-agent fashion so that cooperative behaviors can emerge, gradually inducing agents to learn to interact with each other".
- As opposed to rule-based behaviours for
passive
actors hard coded in the simulator (SUMO
,CARLA
). -
"We think that this multi-agent learning setting is captivating for many applications that require a simulation environment with intelligent agents, because it learns the joint interactive behavior eliminating the need for hand-designed behavioral rules."
-
- Two types of
agent
,action
, andreward
.1-
Active agents learn to enter a roundabout.- The length of the episode does not change since it ends once the insertion in the roundabout is completed.
- High-level
action
in {Permitted
,Not Permitted
,Caution
}
2-
Passive agents learn to drive in the roundabout.- Low-level
action
in {accelerate
(1m/s2
),keep the same speed
,brake
(−2m/s2
).} - The length of the episode changes!
-
"We multiply the
reward
by a normalization factor, that is the ratio between the path length in which the rewards are computed, and the longest path measured among all the training scenarios"
-
- Low-level
- They interact with each other.
- The density of passive agents on the roundabout can be set in {
low
,medium
,high
}.
- The density of passive agents on the roundabout can be set in {
- The abstraction levels differ: What are the timesteps used? What about action holding?
- About the hybrid
state
:- Visual channel:
4
images84
x84
, corresponding to a square area of50
x50
meters around the agent:- Obstacles.
- Ego-Path to follow.
- Navigable space.
- Stop line: the position where the agent should stop if the entry cannot be made safely.
- Non-visual channel:
- Measurements:
- Agent speed.
- Last action performed by the vehicle.
- Goal specification:
-
"We coupled the visual input with some parameters whose values influence the agent policy, inducing different and tuneable behaviors: used to define the goal for the agent."
- Target speed.
Aggressiveness
in the manoeuvre execution:- A parameter fed to the net, which is taken into account in the reward computation.
- I previous works, the authors tuned this
aggressiveness level
as theelapsed time ratio
: time and distance left.
- I see that as a conditional-learning technique: one can set the level of aggressiveness of the agent during testing.
-
- Measurements:
- In previous works, frames were stacked, so that speed can be inferred.
- Visual channel:
- About robustness via noise injection:
- Perception noise:
position
,size
andheading
as well as errors in the detection of passive vehicles (probablyfalse positive
andfalse negative
). - Localisation noise: generate a new path of the active vehicle perturbing the original one using Cubic Bézier curves.
- Perception noise:
- Multi-env and
train-validate-test
:-
"Multi-environment System consists in four
training
roundabouts in which vehicles are trained simultaneously, and avalidation
environment used to select the best network parameters based on the results obtained on such scenario." - The generalization performance is eventually measured on a further
test
environment, which does not have any active role during the training phase.
-
- About the simulator.
- Synthetic representations of real scenarios are built with the Cairo graphic library.
- About the
RL
algorithm.- Asynchronous Advantage Actor Critic (
A3C
):- Reducing the correlation between experiences, by collecting experiences from agents running in parallel environment instances. An alternative to the commonly used experience replay.
- This parallelization also increases the time and memory efficiency.
- Delayed A3C (
D-A3C
):- Keep parallelism: different environment instances.
- Make interaction possible: several agents in each environment, so that they can sense each other.
- Reduce the synchronization burden of
A3C
: -
"In
D-A3C
, the system collects the contributions of each asynchronous learner during the whole episode, sending the updates to the global network only at the end of the episode, while in the classicA3C
this exchange is performed at fixed time intervals".
- Asynchronous Advantage Actor Critic (
- About frame-skipping technique and action repeat (previous work).
-
"We tested both with and without action repeat of
4
, that is repeating the last action for the following4
frames (repeated actions are not used for computing the updates). It has been proved that in some cases action repeat improves learning by increasing the capability to learn associations between temporally distant (state
,action
) pairs, giving toaction
s more time to affect thestate
." - However, the use of repeated actions brings a drawback, that is to diminish the rate at which agents interact with the environment.
-
"Delay-Aware Multi-Agent Reinforcement Learning"
Click to expand
Real cars exhibit action delay. Standard delay-free MDP s can be augmented to Delay-Aware MDP (DA-MDP ) by enriching the observation vector with the sequence of previous action . Source1 Source2. |
Combining multi-agent and delay-awareness. In the left-turn scenario, agent s decide the longitudinal acceleration based on the observation the position and velocity of other vehicles. They are positively rewarded if all of them successfully finish the left turn and penalized if any collision happens. Another cooperative task is tested: driving out of the parking lot. Source. |
Authors: Chen, B., Xu, M., Liu, Z., Li, L., & Zhao, D.
- Motivations:
1-
Deal withobservation
delays andaction
delays in theenvironment
-agent
interactions ofMDP
s.- It is hard to transfer a policy learnt with standard delay-free MDPs to real-world cars because of actuator delays.
-
"Most
DRL
algorithms are evaluated in turn-based simulators likeGym
andMuJoCo
, where theobservation
,action
selection and actuation of the agent are assumed to be instantaneous."
-
- Ignoring the delay of agents violates the Markov property and results in
POMDP
s, with historical actions as hidden states.- To retrieve the Markov property, a Delay-Aware
MDP
(DA-MDP
) formulation is proposed here.
- To retrieve the Markov property, a Delay-Aware
- It is hard to transfer a policy learnt with standard delay-free MDPs to real-world cars because of actuator delays.
2-
One option would be to gomodel-based
, i.e. learning the transition dynamic model. Here the authors prefer to staymodel-free
.-
"The control community has proposed several methods to address the delay problem, such as using Smith predictor [24], [25], Artstein reduction [26], [27], finite spectrum assignment [28], [29], and H∞ controller."
-
3-
Apply to multi-agent problems, using the Markov game (MG
) formulation.-
"Markov game is a multi-agent extension of
MDP
with partially observable environments. - Application: vehicles trying to cooperate at an unsignalized intersection.
-
- About delays:
- There exists two kinds:
-
"For simplicity, we will focus on the
action
delay in this paper, and the algorithm and conclusions should be able to generalize to systems withobservation
delays."
-
-
"The delay of a vehicle mainly includes:"
- Actuator delay.
-
"The actuator delay for vehicle powertrain system and hydraulic brake system is usually between
0.3
and0.6
seconds."
-
- Sensor delay.
-
"The delay of sensors (cameras, LIDARs, radars, GPS, etc) is usually between
0.1
and0.3
seconds."
-
- Time for decision making.
- Communication delay, in
V2V
.
- Actuator delay.
- There exists two kinds:
-
"Consider a velocity of
10 m/s
, the0.8
seconds delay could cause a position error of8 m
, which injects huge uncertainty and bias to the state-understanding of the agents." - Main idea:
observation
augmentation withaction
sequence buffer to retrieve the Markov property.- The authors show that a Markov game (
MG
) with multi-step action delays can be converted to a regularMG
by state augmentation (delay awareMG
=DA-MG
).- Proof by comparing their corresponding
Markov Reward Processes
(MRP
s). - Consequence: instead of solving
MG
s with delays, we can alternatively solve the correspondingDA-MGs
directly withDRL
.
- Proof by comparing their corresponding
- The input of the policy now consists of:
- The current information of the environment. E.g.
speeds
andpositions
. - The planned action sequence of length
k
that will be executed from.
- The current information of the environment. E.g.
- The agents interact with the environment not directly but through an action buffer.
- The state vector is augmented with an action sequence being executed in the next
k
steps wherek
∈
N
is the delay duration. - The dimension of
state
space increases, sinceX
becomesS
×Ak
. a(t)
is the action taken at timet
but executed at timet + k
due to thek
-step action delay.- Difference between select an action (done by the
agent
) and execute anaction
(done by theenvironment
).
- Difference between select an action (done by the
- The state vector is augmented with an action sequence being executed in the next
- The authors show that a Markov game (
- About delay sensitivity:
- Trade-off between delay-unawareness and complexity for the learning algorithm.
-
"When the delay is small (here less than
0.2s
), the effect of expanding state-space on training is more severe than the model error introduced by delay-unawareness".
- Papers about action delay in
MDP
s:- Learning and planning in environments with delayed feedback (Walsh, Nouri, Li & Littman, 2008) ->
model-based
predictions. - At human speed: Deep reinforcement learning with action delay (Firoiu, Ju, & Tenenbaum, 2018) ->
model-based
predictions. The delayed system was reformulated as an augmented MDP problem. - Real-time reinforcement learning (Ramstedt & Pal, 2019) -> solve the
1
-step delayed problem.
- Learning and planning in environments with delayed feedback (Walsh, Nouri, Li & Littman, 2008) ->
- About multi-agent
RL
:-
"The simplest way is to directly train each agent with single-agent
RL
algorithms. However, this approach will introduce the non-stationary issue." - Centralized training and decentralized execution.
- Centralized
Q-function
s: The centralized critic conditions on global information (globalstate
representation and theactions
of all agents).-
"The non-stationary problem is alleviated by centralized training since the transition distribution of the environment is stationary when knowing all agent actions."
-
- Decentralized
policy
for each agent: The decentralized actor conditions only on private observation to avoid the need for a centralized controller during execution.
- Centralized
-
"Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems"
-
[
2020
] [📝] [ 🎓UC Berkeley
] -
[
offline RL
]
Click to expand
Offline reinforcement learning algorithms are reinforcement learning algorithms that utilize previously collected data, without additional online data collection. The term ''fully off-policy '' is sometimes used to indicate that no additional online data collection is performed. Source. |
Authors: Levine, S., Kumar, A., Tucker, G., & Fu, J.
Note: This theoretical tutorial could also have been part of sections on model-based RL
and imitation learning
.
- Motivations:
- Make
RL
data-driven, i.e. forget about exploration and interactions and utilize only previously collected offline data.-
"Fully offline
RL
framework is enormous: in the same way that supervised machine learning methods have enabled data to be turned into generalizable and powerful pattern recognizers (e.g., image classifiers, speech recognition engines, etc.), offlineRL
methods equipped with powerful function approximation may enable data to be turned into generalizable and powerful decision making engines, effectively allowing anyone with a large enough dataset to turn this dataset into a policy that can optimize a desired utility criterion."
-
- Enable the use of
RL
for safety critical applications.-
"In many settings, the online interaction is impractical, either because data collection is expensive (e.g., in robotics, educational agents, or healthcare) and dangerous (e.g., in autonomous driving, or healthcare)."
- This could also remove the need for "simulation to real-world transfer" (
sim-2-real
), which is difficult: -
"If it was possible to simply train policies with previously collected data, it would likely be unnecessary in many cases to manually design high-fidelity simulators for simulation-to-real-world transfer."
-
- Make
- About the "data-driven" aspect.
- Because formulation resembles the standard supervised learning problem statement, technics / lesson-learnt from supervised learning could be considered.
-
"Much of the amazing practical progress that we have witnessed over the past decade has arguably been driven just as much by advances in datasets as by advances in methods. In real-world applications, collecting large, diverse, representative, and well-labeled datasets is often far more important than utilizing the most advanced methods." I agree!
- Both
RobotCar
andBDD-100K
are cited as large video datasets containing thousands of hours of real-life driving activity.
- One major challenge for
offline RL
formulation:-
"The fundamental challenge with making such counterfactual queries [given data that resulted from a given set of decisions, infer the consequence of a different set of decisions] is distributional shift: while our function approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution, due both to the change in visited states for the new policy and, more subtly, by the act of maximizing the expected return."
- This makes
offline RL
differ from Supervised learning methods which are designed around the assumption that the data is independent and identically distributed (i.i.d.
).
-
- About applications to
AD
:offline RL
is potentially a promising tool for enabling safe and effective learning in autonomous driving.-
"Model-based
RL
methods that employ constraints to keep the agent close to the training data for the model, so as to avoid out-of-distribution inputs as discussed in Section5
, can effectively provide elements of imitation learning when training on driving demonstration data.
-
- The work of (Rhinehart, McAllister & Levine, 2018)
"Deep Imitative Models for Flexible Inference, Planning, and Control"
, is mentioned:- It tries to combine the benefits of
imitation learning
(IL
) andgoal-directed planning
such asmodel-based RL
(MBRL
). -
"Indeed, with the widespread availability of high-quality demonstration data, it is likely that effective methods for offline
RL
in the field of autonomous driving will, explicitly or implicitly, combine elements ofimitation learning
andRL
.
- It tries to combine the benefits of
"Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey"
-
[
2020
] [📝] [ 🎓University of South Florida
] -
[
literature review
]
Click to expand
The review focuses on traffic signal control (TSC ) use-cases. Some AD applications are nonetheless shortly mentioned. Source. |
Authors: Haydari, A., & Yilmaz, Y.
"Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation"
Click to expand
The main idea is to use an ensemble of neural networks with additional randomized prior functions (RPF ) to estimate the uncertainty of decisions. Each member estimates Q(s, a) in a sum f + βp . Note that the prior p nets are initialized with random parameters θˆk that are kept fixed. Source. |
Bottom: example of situation outside of the training distribution. Before seeing the other, the policy chooses to maintain its current speed. As soon as the stopped vehicle is seen, the uncertainty cv becomes higher that the safety threshold. The agent chooses then the fallback action brake hard early enough and manages to avoid a collision. The baseline DQN agent also brakes when it approaches the stopped vehicle, but too late. Source. |
Authors: Hoel, C.-J., Wolff, K., & Laine, L.
-
Motivations:
1-
Estimate an uncertainty for each (s
,a
) pair when computing theQ(s,a)
, i.e. express some confidence about the decision inQ
-based algorithms.2-
Use this metric together with some safety criterion to detect situations are outside of the training distribution.- Simple
DQN
can cause collisions if the confidence of the agent is not considered: -
"A fundamental problem with these methods is that no matter what situation the agents are facing, they will always output a decision, with no information on the uncertainty of the decision or if the agent has experienced anything similar during its training. If, for example, an agent that was trained for a one-way highway scenario would be deployed in a scenario with oncoming traffic, it would still output a decision, without any warning."
-
"The DQN algorithm returns a
maximum-likelihood
estimate of theQ-values
. But gives no information about the uncertainty of the estimation. Therefore collisions occur in unknown situations".
- Simple
3-
And also leverage this uncertainty estimation to better train and transfer to real-world.-
"The uncertainty information could be used to guide the training to situations that the agent is currently not confident about, which could improve the sample efficiency and broaden the distribution of situations that the agent can handle."
-
If an agent is trained in a simulated environment and later deployed in real traffic, the uncertainty information could be used to detect situations that need to be added to the simulated world.
-
-
Previous works:
- Automated Speed and Lane Change Decision Making using Deep Reinforcement Learning (Hoel, Wolff, & Laine, 2018).
- Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving (Hoel, Driggs-Campbell, Wolff, Laine, & Kochenderfer, 2019) analysed in my
IV19
report. - Tactical decision-making for autonomous driving: A reinforcement learning approach. Hoel's 2019 thesis.
-
How to estimate uncertainty:
1-
Statistical bootstrapping (sampling).- The risk of an
action
is represented as the variance of the return (Q(s, a)
) when taking that action. The variance is estimated using an ensemble of models: -
"An ensemble of models is trained on different subsets of the available data and the distribution that is given by the ensemble is used to approximate the uncertainty".
- Issue: "No mechanism for uncertainty that does not come from the observed data".
- The risk of an
2-
Ensemble-RPF
.- Same idea, but a randomized untrainable prior function (
RPF
) is added to each ensemble member.- "untrainable": The
p
nets are initialized with random parametersθˆk
that are kept fixed.
- "untrainable": The
- This is inspired by the work of DeepMind: Randomized prior functions for deep reinforcement learning (Osband, Aslanides, & Cassirer, 2018).
- Efficient parallelization is required since
K
nets need to be trained instead of one.
- Same idea, but a randomized untrainable prior function (
-
About the
RPF
:- Each of the
K
ensemble members estimatesQ(s, a)
in a sumf + βp
, whereβ
balances the importance of the prior function. - Training (generate + explore): One member is sampled. It is used to generate an experience
(s, a, r, s')
that is added to each ensemble buffer with probabilityp-add
. - Training (evaluate + improve): A minibatch
M
of experiences is sampled from each ensemble buffer and the trainable network parameters of the corresponding ensemble member are updated.
- Each of the
-
About the "coefficient of variation"
cv
(s
,a
):- It is used to estimate the agent's uncertainty of taking different actions from a given state.
- It represents the relative standard deviations, which is defined as the ratio of the
standard deviation
to themean
. - It indicates how far
(s, a)
is from the training distribution.
-
At inference (during testing):
action
s with a level of uncertaintycv(s, a)
that exceeds a pre-defined threshold are prohibited.- If no
action
fulfils the criteria, a fallback actiona-safe
is used. - Otherwise, the selection is done maximizing the mean
Q-value
. - The
DQN
-baseline,SUMO
-baseline and proposedDQN
-ensemble-RPF
are tested on scenarios outside of the training distributions.-
"The
ensemble RPF
method both indicates a high uncertainty and chooses safe actions, whereas theDQN
agent causes collisions."
-
-
About the
MDP
:state
: relative positions and speeds.action
:- longitudinal: {
-4
,−1
,0
,1
}m/s2
. - lateral: {
stay in lane
,change left
,change right
}. - The fallback action
a-safe
is set tostay in lane
and apply−4 m/s2
- longitudinal: {
time
:- Simulation timestep
∆t
=1s
. Ok, they want high-level decision. Many things can happen within1s
though! How can it react properly? - A lane change takes
4s
to complete. Once initiated, it cannot be aborted. I see that as a de-bouncing method.
- Simulation timestep
reward
:v/v-max
, in[0, 1]
, encouraging the agent to overtake slow vehicles.-10
for collision / off-road (when changing lane when already on the side).done
=True
.-10
if the behaviour of the ego vehicle causes another vehicle to emergency brake, or if the ego vehicle drives closer to another vehicle than a minimum time gap.done
=False
.- This reminds me the idea of (Kamran, Lopez, Lauer, & Stiller, 2020) to add risk estimate to the reward function to offer more frequent signals.
-1
for each lane change. To discourage unnecessary lane changes.
-
One quote about the
Q
-neural net:-
"By applying
CNN
layers and amax pooling
layer to the part of the input that describes the surrounding vehicles, the output of the network becomes independent of the ordering of the surrounding vehicles in the input vector, and the architecture allows a varying input vector size."
-
-
One quote about hierarchy in decision-making:
-
"The decision-making task of an autonomous vehicle is commonly divided into
strategic
,tactical
, andoperational
decision-making, also callednavigation
,guidance
andstabilization
. In short,tactical
decisions refer to high level, often discrete, decisions, such as when to change lanes on a highway."
-
-
One quote about the
k-Markov
approximation:-
"Technically, the problem is a
POMDP
, since the ego vehicle cannot observe the internal state of the driver models of the surrounding vehicles. However, thePOMDP
can be approximated as anMDP
with ak-Markov
approximation, where the state consists of the lastk
observations." - Here the authors define full observability within
200 m
.
-
-
Why is it called Bayesian
RL
?- Originally used in
RL
to creating efficientexploration
methods. - Working with an ensemble gives probability distribution for the
Q
function. - The
prior
is introduced, herep
. - What is the posterior? The resulting
f+βp
. - How could the likelihood be interpreted?
- Originally used in
"Risk-Aware High-level Decisions for Automated Driving at Occluded Intersections with Reinforcement Learning"
Click to expand
The state considers the topology and include information about occlusions. A risk estimate is computed for each state using manually engineered rules: is it possible for the ego car to safely stop/leave the intersection? Source. |
Centre-up: The main idea is to punish risky situations instead of only collision failures. Left: Other interesting tricks are detailed: To deal with a variable number of vehicles, to relax the Markov assumption, and to focus on the most important (closest) parts of the scene. Centre-down: Also, the intersection is described as a region (start and end ), as opposed to just a single crossing point. Source. |
Source. |
Authors: Kamran, D., Lopez, C. F., Lauer, M., & Stiller, C.
-
Motivations:
1-
Scenario: deal with occluded areas.2-
Behaviour: Look more human-like, especially for creeping, paying attention to the risk.-
"This [
RL
with sparse rewards] policy usually drives fast at occluded intersections and suddenly stops instead of having creeping behavior similar to humans at risky situations."
-
3-
Policy: Find a trade-off betweenrisky
andover-cautious
.-
"Not as overcautious as the rule-based policy and also not as risky as the collision-based
DQN
approach"
-
-
reward
: The main idea is to compromise betweenrisk
andutility
.- Sparse
reward
functions ignoringrisk
have been applied in multiple previous works:1-
Penalize collisions.2-
Reward goal reaching.3-
Give tiny penalties at each step to favour motion.
-
"We believe that such sparse reward can be improved by explicitly providing risk measurements for each (
state
,action
) pair during training. The totalreward
value for the current state will then be calculated as the weighted average ofrisk
andutility
rewards." utility
reward: about the speed of ego vehicle.risk
reward: Assuming a worst-case scenario. Two safety conditions between the ego vehicle and each related vehicle are defined.1-
Can the ego-car safely leave the intersection before the other vehicle can reach it?2-
Can the ego-car safely stop to yield before the stop line?- They are not just binary. Rather continuous, based on time computations.
- Sparse
-
state
.- A vector of measurements that can be used by a rule-based or a learning-based policy.
- The map topology is strongly considered.
- As opposed to grid-based representations which, in addition, require more complex nets.
- About occlusion:
-
"For each intersecting lane
L.i
, the closest point to the conflict zone which is visible by perception sensors is identified and its distance along the lane to the conflict point will be represented asdo.i
. The maximum allowed velocity for each lane is also mapped and will be represented asvo.i
."
-
- About varying number of vehicles:
- The
5
most-important interacting vehicles are considered, based on some distance metric.
- The
- About the discretization.
- Distances are binned in a non-linear way. The resolution is higher in close regions: Using
sqrt
(x
/d-max
) instead ofx
/d-max
.
- Distances are binned in a non-linear way. The resolution is higher in close regions: Using
- About the temporal information:
5
frames are stacked. Withdelta-t
=0.5s
.- Probably to relax the
Markov
assumption. In particular, the "real" car dynamic suffers from delay (acceleration
is not immediate).
-
action
.- High-level actions define target speeds:
stop
: full stop with maximum deceleration.drive-fast
: reach5m/s
.drive-slow
: reach1m/s
.
- Hierarchical pipeline:
-
"Such high-level
action
space helps to learn decision-making instead of speed control for the automated vehicle. The trajectory planner andcontrol
module will take care of low-level control commands for the ego vehicle in order to follow these high-level actions. Therefore, high-level decisions can be updated with smaller rate (2Hz
) that improves the quality of learning and makes the policy to only focus on finding optimal behaviors instead of low level-control."
-
- High-level actions define target speeds:
-
About robustness:
- Challenging scenarios are implemented to test the robustness of the different approaches:
- Dense traffic.
- Severe occlusion.
- Sensor noise.
- Short sensor range.
- Decisions must be interaction-aware (first case) and uncertainty-aware (partial observability).
-
"The rule-based policy is very conservative and fails for test scenarios with short sensor range (
40m
)."
- Challenging scenarios are implemented to test the robustness of the different approaches:
"Learning Robust Control Policies for End-to-End Autonomous Driving from Data-Driven Simulation"
-
[
2020
] [📝] [📝] [🎞️] [] [ 🎓MIT
] [ 🚗Toyota Research Institute
,NVIDIA
] -
[
sim2real
,data-driven simulation
,robustness
,VISTA
]
Click to expand
Top left: one figure is better than many words. New possible trajectories are synthesized to learn virtual agent control policies. Bottom and right: robustness analysis compared to two baselines using a model-based simulator (CARLA ): domain randomization and domain adaptation and one real-world imitation learning approach. Interventions marked as red dots. Source. |
VISTA stands for 'Virtual Image Synthesis and Transformation for Autonomy'. One main idea is to synthesize perturbations of the ego-agent's position, to learn to navigate a some worse-case scenarios before cruising down real streets. Source. |
Source. |
Authors: Amini, A., Gilitschenski, I., Phillips, J., Moseyko, J., Banerjee, R., Karaman, S., & Rus, D.
-
Motivations:
1-
Learn a policyend-to-end
in a simulator that can successfully transfer into the real-world.- It is one of the rare works that go out of the lab to actually test on a road.
2-
Achieve robustness in the transfer, recovering from complex, near crash driving scenarios.- As stated, training AD requires
robustness
,safety
andrealism
.
- As stated, training AD requires
3-
Avoid explicit supervision where ground truth human control labels are used.- Supervised learning methods such as behavioural cloning is not an option.
- Instead, a single human collected trajectory is used.
-
"Preserving photorealism of the real world allows the virtual agent to move beyond imitation learning and instead explore the space using reinforcement learning."
-
About the data-driven simulator:
-
"[for
end2end
] Simulator must address the challenges of photorealism, real-world semantic complexities, and scalable exploration of control options, while avoiding the fragility of imitation learning and preventing unsafe conditions during data collection, evaluation, and deployment." Option 0
: no simulator, i.e.open-loop
replay.- One can use e.g. supervised behavioural cloning.
- One problem: real-world datasets do not include many hazardous edge cases.
- To generate realistic images for lateral control failure cases,
viewpoint augmentation
technique can be used, taking advantage of the large field of view.
"Augmenting learning with views from synthetic side cameras is the standard approach to increase robustness and teach the model to recover from off-center positions on the roads.
Option 1
: model-based simulator.- For instance
CARLA
. -
"While tremendous effort has been placed into making the
CARLA
environment as photorealistic as possible, a simulation gap still exists. We found that end-to-end models trained solely inCARLA
were unable to transfer to the real-world." - The authors implement two techniques with additional
data viewpoint augmentation
to increase its robustness in the real-world:Domain Randomization
.Domain Adaptation
: training a the policy with both simulated and real images, i.e. sharing the latent space.
- For instance
Option 2
: data-driven simulator.VISTA
synthesizes photorealistic and semantically accurate local viewpoints using a small dataset of human collected driving trajectories.- Motion is simulated in
VISTA
and compared to the human’s estimated motion in the real world:- From a
steering curvature
andvelocity
executed duringdt
,VISTA
can generate the next observation, used foraction
selection by the agent. - The process repeats.
- From a
- For more details about scene reconstruction, this post give good reformulation of the paper.
-
-
About
end-to-end
policy and theRL
training:input
: raw image pixels.- No temporal stacking is performed.
-
"We considered controllers that act based on their current perception without memory or recurrence built in."
-
output
: probability distribution for the desired curvature of motion (lateral control only), hence treated as a continuous variable.-
"Note that curvature is equal to the inverse turning radius [
m−1
] and can be converted to steering angle at inference time using a bike model."
-
reward
: only the feedback from interventions in simulation. i.e. very sparse reward signals.- This is as opposed to behaviour cloning where the agent has knowledge of how the human drove in that situation.
training
: Policy Gradient.- Happy to see that one can achieve such impressive results with an algorithm very far from the latest policy-gradient complex methods.
-
About robustness:
- The idea is to synthesize perturbations of the ego-agent's position.
- I.e. during training, generate new observations by applying translations and rotations to the views of the recorded trajectory.
- Robustness analysis is perform for recovery from near-crash scenarios.
-
"All models showed greater robustness to recovering from translations than rotations since rotations required significantly more aggressive control to recover with a much smaller room of error."
VISTA
successfully recovered over2×
more frequently than the next best: NVIDIA'sIMIT-AUG
which is based on imitation learning with augmentation.
-
- Future work:
VISTA
could synthesize other dynamic obstacles in the environment such as cars and pedestrians.
- The idea is to synthesize perturbations of the ego-agent's position.
"Interpretable Multi Time-scale Constraints in Model-free Deep Reinforcement Learning for Autonomous Driving"
Click to expand
Bottom-left: While learning the Q function , another quantity is estimated: JπH represents the amount of constraint violations within horizon H when following the current policy πk . It is used to defined safe action sets for action selection. This is one of the ideas of the Multi Time-scale Constrained DQN proposed to solve the constrained MDP . Bottom-right: example illustrating the need of long-term predictions/considerations for in constrained MDP. Here state s6 is marked as unsafe and has to be avoided. The one-step action masking cannot guarantee optimality: at the point of decision it can only choose the path leading to s10 with a non-optimal return of +0.5 . Source. |
Authors: Kalweit, G., Huegle, M., Werling, M., & Boedecker, J.
-
Motivations: how to perform model-free
RL
while considering constraints?- Previous works are limiting:
1-
Reward shaping: creating a weighted combination of different objectives in thereward
signal.-
"We add weighted penalties for
lane changes
and for not driving on theright lane
."
-
2-
Lagrangian optimization: including constraints in the loss function.-
"The constrained
MDP
is converted into an equivalent unconstrained problem by making infeasible solutions sub-optimal."
-
3-
Safe Policy Extraction: filter out unsafe actions (action masking
).-
"Restrict the
action
space during policy extraction, masking out all actions leading to constraint violations."
-
- The first two require parameter tuning, which is difficult in multi-objective tasks (trade-off
speed
/constraint
violations). - The last one does not deal with multi-step constraints.
- Previous works are limiting:
-
Here, I noted three main ideas:
1-
Separateconstraints
considerations from thereturn
optimization.-
"Instead of a weighted combination of the different objectives, agents are optimizing one objective while satisfying constraints on expectations of auxiliary costs."
-
2-
Define theconstraints
with different time scales.-
"Our approach combines the intuitive formulation of constraints on the short-term horizon as in
model-based
approaches with the robustness of amodel-free RL
method for the long-term optimization."
-
3-
Do not wait for policy extraction to mask actions.- Modify the
DQN
output to jointly estimateQ
-values and the constraint relatedcomfort
-values. - Instead, apply masking directly during the
Q
-update to learn the optimalQ
-function.
- Modify the
-
About the task:
- Highway lane changing.
-
"The discrete
action
space includes three actions:keep lane
,perform left lane change
andperform right lane change
." -
"Acceleration and maintaining safe-distance to the preceding vehicle are controlled by a low-level
SUMO
controller."
-
- There is a primary objective: drive as close as possible to a desired velocity.
- And there are multiple constraints:
safety
(change lane only if lane available and free),keep-right
andcomfort
(within the defined time span, a maximum of lane changes is allowed).
- Highway lane changing.
-
About the constrained Markov Decision Process (
CMDP
) framework:- The
MDP
is extended with twoaction
sets: 1-
Set of safeaction
s for a single-step constraint: fully known.2-
Set of expected safeaction
s of a multi-step constraintJπ
: function of policy.- For instance, limit the amount of lane changes over
Horizon=5
(10s
). - They are modelled as expectations of auxiliary costs:
- The idea is to introduce
constraint
signals for each transition (similar toreward
signals). - And estimate the cumulative sum when following a policy (similar to the
Return
).-
"Typically, long-term dependencies on constraints are represented by the expected sum of discounted or average
constraint
signals." - Here only the next
H
steps are considered (truncated estimation) and no discounting is applied (improving the interpretability of constraints).
-
- The idea is to introduce
- This quantity is estimated from experience.
-
"we jointly fit
Q-function
and multi-step constraint-valuesJh
in one function approximator."
-
- The set of expected safe actions for the constraint can then be defined as {
a ∈ A
|JH(st, a)
≤
threshold
}JπH
represents the amount of constraint violations within horizonH
when following the current policyπk
.
- For instance, limit the amount of lane changes over
- The
-
About
off-policy
learning.
"Review, Analyse, and Design a Comprehensive Deep Reinforcement Learning Framework"
Click to expand
The API is based on three core concepts: policy network , network configuration , and learner . Besides, the monitor is used to manage multiple learners (if multi-threading is used) and collect any data from the learners during training while factory components can offer higher abstraction. No need to reinvent the wheel: Plugins are gateways that extract learners from other libraries such as PyTorch , Tensorflow , Keras and plug them into our proposed framework. Source. |
Source. |
Source. |
Authors: Nguyen, N. D., Nguyen, T. T., Nguyen, H., & Nahavandi, S.
FruitAPI
: Yet anotherRL
framework?- Whatever! Some useful reminders about
RL
concepts and good insights about how software can be structured forRL
applications.
- Whatever! Some useful reminders about
- Motivations:
-
"Our ultimate goal is to build an educational software platform for deep
RL
."- For instance, they plan to develop a GUI application that can configure the neural network, modify the learner, and visualize the
RL
workflow.
- For instance, they plan to develop a GUI application that can configure the neural network, modify the learner, and visualize the
1-
The API should work with any deep learning libraries such asPyTorch
,Tensorflow
,Keras
, etc.-
"Instead of implementing a lot of deep
RL
algorithms, we provide a flexible way to integrate existing deep RL libraries by introducingplugins
. Plugins extract learners from other deep RL libraries and plug intoFruitAPI
."
-
2-
Support different disciplines inRL
such as multiple objectives, multiple agents, and human-agent interaction.
-
- Mentioned frameworks:
TensorForce
.-
"It is an ambitious project that targets both industrial applications and academic research. The library has the best modular architecture we have reviewed so far. However, the framework has a deep software stack (“pyramid” model) that includes many abstraction layers. This hinders novice readers from prototyping a new deep RL method."
-
OpenAI Baselines
.RLLib
. For scalableRL
. Built on top ofray
.RLLab
. Nowgarage
. A toolkit for reproducible reinforcement learning research.Keras-RL
.Chainer
.
"Learning hierarchical behavior and motion planning for autonomous driving"
-
[
2020
] [📝] [ 🎓Zhejiang University
,Arizona State University
]
Click to expand
To achieve transfer, the authors develop a sharable representation for input sensory data across simulation platforms and real-world environment. Left: static. Right: dynamic information. Source. |
Left - The behaviour layer makes a decision based on the current observation, e.g. lane change , while the motion planning layer yields the trajectory to complete this decision. More precisely, the output of the PPO actor-network is the distribution over the behaviour s. The motion planner is conditioned over this high-level decision. If you are also wondering: traffic light state and speed limit are represented probably since they are part of the road profile (static information) - surprising to limit a 3 -lanes highway at 30mps ? Source. |
Authors: Wang, J., Wang, Y., Zhang, D., Yang, Y., & Xiong, R.
-
Motivations:
1-
Jointly optimize both thebehaviour
andmotion
planning modules. I.e. do not treat them independently.- Methods that use
RL
to decide low-level commands such asthrottle
andsteering
often lead to weak tactical decision-making. -
"To demonstrate a similar level of driving as the conventional modular pipeline, a SINGLE
RL
network has to model theperception
,behavior
andmotion planner
altogether. Given the low dimensional control commands as supervision inIL
or handcrafted rewards inRL
, it is likely that the network only learns the route tracking with limited tactical decision-making ability."
- Methods that use
2-
Apply learning-based methods (model-freeRL
andIL
) to derive the decision policies.3-
Address the issues of sampling complexity and sparse reward in model freeRL
.-
"Reward engineering for
RL
is challenging, especially for long-horizon tasks."
-
4-
Do not use any "explicit handcrafted representation" for traffic situation.-
"[not clear to me] We follow the idea of
HBMP
in order to learn a representation of traffic situation implicitly and improve the tactical driving in learning-based solution." -
"[is it what they mean by 'not explicit handcrafted representation'?] Note that no cyclist recognition is utilized in the whole pipeline, but only occupied map directly extracted from lidar data."
-
-
One concept: hierarchical behaviour and motion planning (
HBMP
).- The
behaviour
layer makes a decision based on the current observation, e.g.lane change
, while the **motion planning
layer` yields the trajectory to complete this decision. -
"our idea is to introduce the hierarchical modeling of
behavior
andmotion
in the conventional modular pipeline to the learning-based solution."
- The
-
About the size of search spaces in
hierarchical RL
:- When coupling the two
action
spaces (behaviour
andmotion
), the search complexity becomes a challenge, especially in the long-horizon driving task. TheHRL
is therefore re-formulated:-
"[solution:]
RL
is applied to solve an equivalentbehavior
learning problem whosereward
s are assigned by thecost
from themotion planner
." -
"The key insight is assigning the optimal costs from low-level motion planner, i.e. a classical sampling-based planner, as
reward
s for learning behavior." - The ablation study shows the importance of
hierarchical
modelling to reduce theaction
space.
-
- When coupling the two
-
Improving the sampling complexity and sparse reward issues:
1-
The above-mentionned Hierarchy: Implement a temporal abstraction foraction
s.-
"To reduce the complexity, behaviour planning is introduced to restrict the search space."
-
"As the high-level
behaviour
bk
is a tactical decision for a longer horizon thancontrol
command, eachbk
is fixed for a time interval". - The low-level policy is conditioned on
bk
:ut
=πbu
(xt
,bk
). - The behaviour planner choses a
behaviour
abk
in {speed up
,speed down
,change left
,change right
,keep lane
}. - A Behaviour-Conditioned motion planner produces a trajectory for this behaviour, implemented by a subsequent
PID
controller.- Here it is sampling-based, i.e. optimal without training.
- The
cost
includes terms aboutspeed deviation
,lateral deviation
to some reference path as well asdistances to obstacles
.
-
2-
Good policy inititialization.-
"We initialize the
RL
policy network trained onCARLA
usingIL
-based policy (DAgger
) trained onSUMO
by presenting a sharable sensory data representation, further acceleratingRL
training". - The key ingredient for the transfer between the two simulators is the
state
representation, among other the deformed grid mapMt
.
-
-
How to deal with sparse rewards?
-
"Obviously, such
reward
design [sparse rewards] gives extremely sparse guidance only in the final stage, thus lots of trials may have almost the same rewards, causing the inefficiency ofRL
solver." - The above-mentioned Hierarchy: add
dense reward
s. - The low-level motion planner offers denser rewards than relying on the far-away
reward
oftermination state
s.
-
-
Generalization to real-world.
- Again, the sharable representation for input sensory data across simulation platforms and real-world environment is key for transfer.
- It consist of
1-
thevehicle-centric grid map
for dynamic information-
"Note that no detection and recognition is utilized to parse the sensory data but only geometric transform calculation."
-
2-
theroad profile
that contains static information:- The existence and direction of left and right lanes.
- The next traffic lights and the distance to the stop line of intersection.
- The current speed limit.
"Safe Reinforcement Learning for Autonomous Vehicles through Parallel Constrained Policy Optimization"
-
[
2020
] [📝] [ 🎓Tsinghua University
,University of Michigan
] [ 🚗Toyota
] -
[
safe RL
]
Click to expand
Risk-Actor-Critic: Similar to the critic which estimates the expectation of the value function (cumulated reward s), the risk network estimates the sum of coming risk signals, i.e. signals received when moving to unsafe states. In this case PCPO considers this risk function and bounds the expected risk within predefined hard constraints. Source. |
Note that what is wanted is approximate safety during the TRAINING phase - not just the testing one. PCPO is compared to CPO and PPO . All three methods can eventually learn a safe lane-keeping policy, however, the vehicle deviates from the lane multiple times during the learning process of the PPO . Source. |
Authors: Wen, L., Duan, J., Eben, Li, S., Xu, S., & Peng, H.
-
Motivations:
1-
Ensure the policy is safe in the learning process.-
"Existing RL algorithms are rarely applied to real vehicles for two predominant problems: (1) behaviors are unexplainable, (2) they cannot guarantee safety under new scenarios."
-
"Safety is the most basic requirement for autonomous vehicles, so a training process that only looks at
reward
, and not potential risk, is not acceptable."
-
2-
Improve the convergence speed.
-
About
safe RL
:-
"Observing safety constraints and getting high rewards are adversarial."
safe RL
could be divided into being strictly safe and approximately safe.- Two approaches:
1-
Modifying the optimization criterion.1-a.
maxi-min
: consider a policy to be optimal if it has the maximum worst-casereturn
.1-b.
risk-sensitive
: includes the notion of risk and return variance in the long termreward
maximization objective, e.g. probability of entering an error state.1-c.
constrained
: ensure that the expectation of return is subject to one or more constraints, e.g. the variance of thereturn
must not exceed a given threshold.
2-
Modifying the exploration process.2-a.
Incorporating external knowledge.- E.g. post-posed shielding: the shield monitors the agent's action and corrects them if the chosen action causes a violation.
2-b.
Risk directed exploration.- Encourage the agent to explore controllable regions of environment by introducing
risk
metric as an exploration bonus.
- Encourage the agent to explore controllable regions of environment by introducing
-
-
About Parallel Constrained Policy Optimization (
PCPO
). Three ingredients:1-
Constrained policy optimization (CPO
).-
"Besides the
reward
signalr
, the vehicle will also observe a scalar risk signal˜r
at each step." - It is designed by human experts and is usually assigned a large value when the vehicle moves into an unsafe state.
- Similar to the value function
Vπ
that estimate the discounted cumulated rewards, the risk function J˜(π) of policyπ
estimates the sum of coming risks.- Similar to the
critic
net, arisk
net is introduced.
- Similar to the
- The idea of
CPO
:-
"Bound this expected risk within predefined hard constraints."
-
-
2-
Trust-region constraint.- Monotonic improvement condition can be guaranteed only when the policy changes are not very large.
- The author go through the derivation of the surrogate objective function with importance sampling.
3-
Synchronous parallel agents learning.- Multiple agents (
4
cars in the experiment) are exploring different state spaces in parallel. - It helps to reduce the correlation and increase the coverage of all collected samples, which increases the possibility of finding feasible states - i.e. better exploration.
- Multiple agents (
-
About the
MDP
formulation for the intersection-crossing problem:S
= {l1
,v1
,l2
,v2
,l3
,v3
}, wherel
denotes the distance of the vehicle to the middle point of its track, andv ∈ [6, 14] (m/s)
denotes thevelocity
."A
= {a1
,a2
,a3
}, wherea ∈ [−3, 3] (m/s2)
.-
"We adopt a stochastic policy, the output of which is the
mean
andstandard deviation
of the Gaussian distribution."
-
reward
model:- The agents receive a reward of
10
for every passing vehicle, as well as a reward of-1
for every time step and an additional reward of10
for terminal success. - The agents are given a risk of
50
when a collision occurs.
- The agents receive a reward of
- That means "collision" is not part of the
reward
. Rather in therisk
.- Ok, but how can it be ensured to be safe during training if it has never encountered any collision and does not event know what it means (no "model")? This is not clear to me.
"Deep Reinforcement Learning for Autonomous Driving: A Survey"
-
[
2020
] [📝] [ 🎓ENSTA ParisTech
,National University of Ireland
] [ 🚗Navya
,Valeo
] -
[
review
]
Click to expand
Source. |
Authors: Ravi Kiran, B., Sobh, I., Talpaert, V., Mannion, P., Sallab, A. A. Al, Yogamani, S., & Pérez, P.
- Not too much to report. A rich literature overview and some useful reminders about general
RL
concepts.- Considering the support of serious industrial companies (
Navya
,Valeo
), I was surprised not to see any section about "RL
that works in reality".
- Considering the support of serious industrial companies (
- Miscellaneous notes about decision-making methods for
AD
:- One term: "holonomic" (I often met this term without any definition)
-
"A robotic agent capable of controlling
6
-degrees of freedom (DOF
) is said to be holonomic, while an agent with fewer controllableDOF
s than its totalDOF
is said to be non-holonomic." - Cars are non-holonomic (addressed by
RRT
variants but not byA*
orDjisktra
ones).
-
- About
optimal control
(I would have mentionedplanning
rather thanmodel-based RL
):-
"
Optimal control
andRL
are intimately related, whereoptimal control
can be viewed as a model-basedRL
problem where the dynamics of the vehicle/environment are modeled by well defined differential equations."
-
- About
GAIL
to generate a policy, as an alternative toapprenticeship learning
(IRL
+RL
):-
"The theory behind
GAIL
is an equation simplification: qualitatively, ifIRL
is going from demonstrations to a cost function andRL
from a cost function to a policy, then we should altogether be able to go from demonstration to policy in a single equation while avoiding the cost function estimation."
-
- Some mentioned challenges:
- Validation and safety.
- Sample efficiency and training stability in general.
MDP
definition, e.g. space discretization or not, level of abstraction.- Simulation-reality gap (domain transfer), mentioning domain adaptation solutions:
feature
-levelpixel
-levelreal-to-sim
: adapting the real camera streams to the synthetic modality, so as to map the unfamiliar or unseen features of real images back into the simulated style.
- Reward shaping. Mentioning intrinsic rewards:
-
"In the absence of an explicit reward shaping and expert demonstrations, agents can use intrinsic rewards or intrinsic motivation."
curiosity
: the error in an agent’s ability to predict the consequence of its own actions in a visual feature space learned by a self-supervised inverse dynamics model.
-
- Reproducibility. Multiple frameworks are mentioned:
- OpenAI Baselines.
- TF-Agents.
- Coach by Intel AI Lab.
- rlpyt by bair.berkeley.
- bsuite by DeepMind.
- One term: "holonomic" (I often met this term without any definition)
"Interpretable End-to-end Urban Autonomous Driving with Latent Deep Reinforcement Learning"
-
[
2020
] [📝] [ 🎓UC Berkeley
,Tsinghua University
] -
[
probabilistic graphical models
,MaxEnt RL
,mid-to-end
,CARLA
]
Click to expand
The RL problem is formulated with a probabilistic graphical model (PGM ). zt represents for the latent state, i.e. the hidden state. xt is the observation sensor inputs. at is the action. Ot denotes the optimality variable. The authors introduce a mask variable mt and learn its probability of emission conditioned on z . Source. |
The mask is a semantic representation of the scene, helpful for interpretation of the perception part (not directly applicable to understand the decision ). The learnt transition function (from z to z' conditioned on a ) can be used for prediction. Source. |
Authors: Chen, J., Li, S. E., & Tomizuka, M.
-
Motivations:
1-
Offer more interpretability to end-to-end model-freeRL
methods.2-
Reduce the sample complexity of model-freeRL
(actually not really addressed).
-
Properties of the environment:
1-
High-dimensional observations:camera
frame andlidar
point cloud.2-
Time-sequence probabilistic dynamics.3-
Partial observability.
-
One first idea: formulate the
RL
problem as a probabilistic graphical model (PGM
):- A binary random variable (
Ot
), called optimality variable, is introduced to indicate whether the agent is acting optimally at time stept
.- It turn out that its conditional probability is exponentially proportional to the one-step reward:
p
(Ot=1
|zt
,at
) =exp
(r
(zt
,at
)) - The stochastic exploration strategy has therefore the form of a
Boltzmann
-like distribution, with theQ
-function acting as the negative energy.
- It turn out that its conditional probability is exponentially proportional to the one-step reward:
- Then, the task is to make sure that the most probable trajectory corresponds to the trajectory from the optimal policy.
-
"
MaxEnt RL
can be interpreted as learning aPGM
."MaxEnt RL
is then used to maximize the likelihood of optimality variables in thePGM
.- More precisely, the original log-likelihood is maximized by maximizing the
ELBO
(c.f.variational methods
).
- A binary random variable (
-
About
MaxEnt RL
:- The standard
RL
is modified by adding an entropy regularization termH
(π
) =−log
(π
) to the reward. MaxEnt RL
has a stochastic policy by default, thus the policy itself includes the exploration strategy.- Intuitively, the goal is to learn a policy that acts as randomly as possible (encouraging uniform action probability) while is still aiming at succeeding at the task.
- About soft actor critic (
SAC
):- This is an example of
MaxEnt RL
algorithm: it incorporates the entropy measure of the policy into the reward to encourage exploration. - Why "soft"?
- For small values of
Q
, the approximationV
=log
(exp
(Q
)) ~=max
(Q
) is loose and the maximum is said soft, leading to an optimisticQ
-function.
- For small values of
- This is an example of
- The standard
-
One reference about the duality
RL
/PGM
andMaxEnt RL
:- "Reinforcement learning and control as probabilistic inference: Tutorial and review" by (Levine, 2018).
"In this article, we will discuss how a generalization of the
RL
or optimal control problem, which is sometimes termedMaxEnt RL
, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics." -
Another idea: add a second generated / emitted variable.
- So far, the
observation
variable is generated/emitted from thehidden state
. Here observations arecamera
andlidar
scans. - Here, a second variable is emitted ("decoded" is the term used by the authors): a semantic bird-eye mask, noted
mt
. - It contains semantic meanings of the environment in a human understandable way:
- Drivable areas and lane markings.
Waypoints
on the desired route.- Detected objects (bounding boxes) and their history.
- Ego bounding box.
- It makes me think of the image-like scene representations used in
mid-to-end
approaches. E.g.chaufferNet
. -
"At training time we need to provide the ground truth labels of the mask, but at test time, the mask can be decoded from the latent state, showing how the system is understanding the environment semantically."
- So far, the
-
One note:
- The author call "
latent
variable" what is sometimes referred to ashidden state
inPGM
(e.g. inHMM
).
- The author call "
-
Why "joint" model learning?
- The policy is learned jointly with the other
PGM
models.1-
policy
:π
(at
|zt
) - I would have conditioned it on theobservation
variable since thehidden
state is by definition not observable.2-
inference
: of the current latent state:p
(zt+1
|x1:t+1
,a1:t
).3-
latent dynamics
:p
(zt+1
|zt
,at
). This is then used for prediction.4-
generative model
forobservation
:p
(xt
|zt
), i.e. the emission of probability from the latent space to theobservation
space.5-
generative model
formask
:p
(mt
|zt
), i.e. the generation of the semantic mask from thehidden state
, to provide interpretability.
- The policy is learned jointly with the other
-
Why calling it "sequential"?
-
"Historical high-dimensional raw observations [
camera
andlidar
frames] are compressed into this low-dimensional latent space with a "sequential" latent environment model." - Actually I am not sure. Because they learn the transition function and are therefore able to estimate how the hidden state evolves.
-
-
One limitation:
- This intermediate representation shows how the model understands the scene.
-
"[but] it does not provide any intuition about how it makes the decisions, because the driving policy is obtained in a
model-free
way." - In this context,
model-based
RL
is deemed as a promising direction. - It reminds me the distinction between
learn to see
(controlled by the presentedmask
) andlearn to decide
.
"Hierarchical Reinforcement Learning for Decision Making of Self-Driving Cars without Reliance on Labeled Driving Data"
-
[
2020
] [📝] [ 🎓Tsinghua University
] -
[
hierarchical RL
,async parallel training
,actuator command increments
]
Click to expand
Hierarchical-RL with async parallel training is applied to improve the training efficiency and the final performance. It can be seen as a temporal / action abstraction, although here both high and low levels work at the same frequency. The master policy is trained once all the sub-policies have been trained. Note that only 4 surrounding cars are considered. Source. |
Each sub-policy (manoeuvre policies) has its specific state representation since not all information is relevant for each sub-task. The lateral and longitudinal actions are decoupled only for the drive-in-lane sub-policy. Noise is added to the state to form the observation used by the agent, to improve robustness. Source. |
Authors: Duan, J., Li, S. E., Cheng, B., Luo, Y., & Li, K.
-
Motivations:
1-
Perform the fullvertical
+horizontal
decision-making stack withRL
.RL
as opposed tobehavioural cloning
. Hence the title "without reliance on labeled driving data.-
[
vertical
] "Current works ofRL
only focus onlow-level
motion control lacer. It is hard to solve driving tasks that require many sequential decisions or complex solutions without considering thehigh-level
maneuver." - [
horizontal
] The control should be done in both thelateral
andlongitudinal
directions. - One idea could be to use a single policy, but that may be hard to train, since it should learn both
low-level
drive controls together withhigh level
strategic decisions.-
"The single-learner
RL
fails to learn suitable policies in these two cases because thestate
dimension is relatively high. Therefore,lane-change
maneuver requires more exploration and is more complicated thandriving-in-lane
maneuver, which is too hard for single car-learner."
-
2-
Improve the training efficiency and final performance of aRL
agent on a complex task.- Here the complexity comes, among others, from the long horizon of the sequential problem.
3-
Focus on two-lane highway driving.- It would be interesting to extend the idea to urban intersections.
- Main ingredients:
action
abstraction: hierarchicalRL
.- Training multiple agents in parallel: asynchronous parallel
RL
(APRL
)
-
Architecture:
1-
master
policy.- The high-level manoeuvre selection.
2-
manoeuvre
policies.- The low-level motion control.
- Input: The
state
representation differs for each sub-policy. - Output: Actuator commands.
- Each policy contains a
steer
policy network (SP-Net
) and anaccelerate
policy network (AP-Net
). They can be coupled or not.
- Each policy contains a
-
What manoeuvres?
1-
driving-in-lane
.-
"It is a combination of many behaviors including
lane keeping
,car following
andfree driving
." - In this case, the two
state
,rewards
andactions
are independent:path-speed
decomposition.
-
2-
left lane change
.3-
right lane change
.
-
action masking
in themaster
policy.- Not all manoeuvres are available in every
state
throughout driving.- E.g., the
left-lane-change
manoeuvre is only available in the right lane.
- E.g., the
-
"Thus, we predesignate a list of available maneuvers via the observations at each step. Taking an unavailable maneuver is considered an error. The self-driving car should filter its maneuver choices to make sure that only legal maneuvers can be selected."
- Not all manoeuvres are available in every
-
action
space for themanoeuvre
policies.-
"Previous
RL
studies on autonomous vehicle decision making usually take thefront wheel angle
andacceleration
as the policy outputs. This is unreasonable because the physical limits of the actuators are not considered." -
"To prevent large discontinuity of the control commands, the outputs of maneuver policy
SP-Net
andAP-Net
arefront wheel angle
increment andacceleration
increment respectively."
-
-
Which level is trained first?
1-
Each sub-policy is first trained. Each having their specificstate
representation.2-
Then, themaster
policy is learned to choose the maneuver policy to be executed in the currentstate
.-
"The results of asynchronous parallel reinforcement learners (
APRL
) show that themaster
policy converges much more quickly than themaneuver
policies."
-
-
Robustness.
- Gaussian noise is added to some
state
variables before it is observed by the car-learner. -
"We assume that the noise of all
states
in practical application isM
times the noise used in the training, so the real sensing noise obeysU
(−M ∗ Nmax
,M ∗ Nmax
). To assess the sensitivity of theH-RL
algorithm tostate
noise, we fix the parameters of previously trainedmaster
policy andmaneuver
policies and assess these policies for differentM
. [...] The policy is less affected by noise whenM ≤ 7
."
- Gaussian noise is added to some
-
Nets.
32
asynchronous parallel car-learners.n-step
return withn=10
.- "Back-propagation after
n=10
forward steps by explicitly computing10-step
returns."
- "Back-propagation after
ELU
activation for hidden layers.- Only
FC
layers.
-
Do the two level operate at same frequency?
- Apparently yes: "the system frequency is
40Hz
". - How to deal then with changing objectives coming from the
master
policy?
- Apparently yes: "the system frequency is
"Risk-Aware Autonomous Driving via Reinforcement Learning"
-
[
2019
] [📝] [] [ 🎓University of Prague
] -
[
safe RL
,risk directed exploration
,constrained optimization
,behavioural cloning
,GAIL
]
Click to expand
The goal is to reduce risk during training, as opposed to offering safety guarantees for deployment. First, an initial policy is derived from safe demonstrations (policy initialization techniques with BC and GAIL ). Then, risk -aware RL is performed either by restricting the exploration (Q-learning with risk-directed exploration ) or by modifying the objective (Policy Gradient with variance constraint ). Source. |
The goal of the agent is to overtake as many cars as possible while avoiding an accident. At each step, agent can turn left or turn right , speed up or slow down or do nothing . Source. |
Visualization of the policies derived by imitation learning (BC and GAIL ) for the policy initialization. States are embedded in a lower dimension using UMAP algorithm and colored by an action which agent or teacher takes in this state , Source. |
Author: Petrova, O.
-
Main motivation:
- Reduce some notion of
risk
during TRAINING, as opposed to offeringsafety
guarantees for deployment.
- Reduce some notion of
-
Main ideas:
1-
Reducerisk
related to initial exploration by deriving an initial policies fromsafe
demonstrations.-
"Safe
RL
is impossible without prior knowledge of the environment or building an internal belief about it."
-
2-
After initialization, reducerisk
by:- Either constraining the exploration
- Or by adding a risk term to the
RL
objective.
-
About
risk
,uncertainty
andsafe RL
:- Safe
RL
=RL
where some notion ofrisk
is considered into the process of decision making. risk
metrics (not very clear to me).-
"The real-world environment, in which autonomous car is acting, is naturally stochastic. It means that decision making is always performed under
uncertainty
.Uncertainty
about the outcome of anaction
serves as a source ofrisk
." -
"The most commonly used
risk
metrics in robotics are theexpected
cost andworst-case
metrics. Theexpected
cost corresponds torisk
neutrality while theworst-case
assessment corresponds to extremerisk
aversion." -
[Policy Improvement through Safe
RL
(PI-SRL
)] "Therisk
of anaction
with respect to policyπ
is defined as the probability that thestate
sequence generated by the execution of policyπ
, terminates in an errorstate
."
-
- Safe
-
The authors lists some aspects of
safe RL
:1-
Safe exploration.- During
training
. 1.1-
Some initialization is needed to safely start exploring.1.2-
Then some "teacher" must be available to guide the exploration.
- During
2-
Robustness.- After
training
. - C.f. Distributional shift = an inability of an agent to behave robustly in a slightly different environment.
- For instance when an agent is trained in some simulator and then deployed to the real environment.
-
"The problem can be addressed by using more robust models or combinations of them. Potentially related areas include change and anomaly detection, hypothesis testing and transfer learning."
- After
3-
risk
-sensitivity.action
s are selected based on their estimatedutilities
.- These can include some notions of
risk
.
- These can include some notions of
- To reduce the chance of reaching bad states,
risk-to-go
functions can be used.- This
risk-to-go
function is based onρ
, a one-step coherent conditional risk measure, and onc
(s
,a
), a cost resulting from taking actiona
in states
. - It estimated through simulation-based approximate value iteration.
- This
-
Phase
1
: Policy initialization.-
"Simply speaking, an agent does not know how and where to go. This problem can be mitigated using the proper
policy
initialization andsafe
space exploration." - Two approaches:
1-
Behavioural Cloning (BC
)2-
Generative Adversarial Imitation Learning (GAIL
)
- Four metrics:
collision rate
,episode return
,episode length
,return per step
. - The training set, with
24,000
(s
,a
) pairs generated by manual control, is not balanced. Actiondo nothing
is over-represented.GAIL
only uses a subset of actions in its strategy.-
[Take-away] "
GAIL
framework needs a greater amount of data in comparison toBC
, and additionally, for successful learning, the training set must coverstate
space densely."
-
-
Phase
2
: After initialization. Two approaches:1-
Q
-learning withrisk
-directed exploration.- Idea:
action
are selected based on their expected "risk-adjusted"utility
, composed of:- The (commonly-used) action value
Q
(a
,s
). - Some function of the
return
distributionG
:ρ
(G
(s
,a
)).- E.g.
CVaRα
is the expected value ofreturn
in the worstα%
of cases: pessimism should lead to more conservativeaction
s. - Here
ρ
is the entropy.
- E.g.
- The (commonly-used) action value
-
"The high-entropy
return
indicates highuncertainty
about futurereturn
, which helps to avoidrisky
actions
. It penalizesreturn
distributions with highentropy
and comparably low expected value." - Limitation: it requires a model of environment to estimate the
n=3
-stepreturn
distributionG
(s
,a
).
- Idea:
2-
Policy Gradient withreturn variance
constraint.- Ideas:
- The
return variance
is used as a measure ofrisk
. - It is a constrained optimization problem where policy parameters are updated using an estimated gradient with a
variance
penalty.
- The
- Take away:
-
"The results of the experiment point out that the
return variance
might not be the bestrisk
metric for this task since its minimization does not affect the collision rate." -
"The constrained algorithm prefers more passive policies. It overtakes less frequently, which in theory should lead to a safer policy, but due to the lack of experience, its policy still has the same collision rate."
-
- Ideas:
-
Side notes: about
(Hans et al., 2008)
.- Ingredients:
- A safety function to produce some "
state
safety degree". - A
backup policy
that is able to lead the controlled system from a criticalstate
back to a safe one.
- A safety function to produce some "
-
"It estimates the minimal reward
r-min
that would be observed when executing anaction
and afterward following thebackup policy
(min-reward
estimation)." -
"The
policy
is obtained by solving an altered Bellman optimality equation that does not maximize the expected sum ofrewards
, but the minimalreward
to come." - The backup policy is activated when the agent reaches an unknown
state
during exploration and thus is unable to choose a safeaction
.
- Ingredients:
"Worst Cases Policy Gradients"
-
[
safe RL
,distributional RL
,risk-sensitive
]
Click to expand
The critic predicts a distribution of the expected value function instead of a single scalar. Both the actor and critic take risk-tolerance α as input during training, which allows the learned policy to operate with varying levels of risk after training. Lower α values improved robustness dramatically. Top-right: the uncertainty can be quantified from the variance of the return distribution. Source. |
A related work (Bernhard and Knoll 2019): Working with distributions instead of mean expectation can offer uncertainty-aware action selection. Source. |
Authors: Tang, Y. C., Zhang, J., & Salakhutdinov, R.
-
One sentence:
-
"Our policies can be adjusted dynamically after deployment to select risk-sensitive
actions
."
-
-
Motivations:
1-
Models the uncertainty of the future.2-
Learn policies with risk-tolerance parameters.-
"The learned policy can map the same state to different
actions
depending on the propensity for risk." α
=0.0 for risk-aware.α
=1.0 for risk-neural.
-
-
Going beyond standard
RL
.-
"Standard
RL
maximizes for the expected (possibly discounted) futurereturn
. However, maximizing for averagereturn
is not sensitive to the possible risks when the futurereturn
is stochastic, due to the inherent randomness of the environment not captured by the observablestate
." -
"When the
return
distribution has high variance or is heavy-tailed, finding apolicy
which maximizes the expectation of the distribution might not be ideal: a high variancepolicy
(and therefore higher risk) that has higherreturn
in expectation is preferred over low variancepolicies
with lower expectedreturns
. Instead, we want to learn more robust policies by minimizing long-tail risks, reducing the likelihoods of bad outcomes."
-
-
Why "Worst Cases Policy Gradients" (
WCPG
)?-
"However, for risk-averse learning, it is desirable to maximize the expected worst cases performance instead of the average-case performance."
- You do not care about the average expected return. You care about what the best return will be under the worst possible scenario.
-
-
About safe RL. Two categories:
-
"The first type is based on the modification of the exploration process to avoid unsafe exploratory actions. Strategies consist of incorporating external knowledge or using risk-directed explorations."
- The second type modifies the optimality criterion used during training.
-
"Our work falls under this latter category, where we try to optimize our policy to strike a balance between pay-off and avoiding catastrophic events."
-
-
-
About conditional
Value-at-Risk
(CVaR
)-
"Working under the assumption that the future
return
is inherently stochastic, we first start by modeling its distribution. The risk of various actions can then be computed from this distribution. Specifically, we use the conditionalValue-at-Risk
as the criterion to maximize." -
"
WCPG
optimizes forCVaR
indirectly by first using distributional RL techniques to estimate the distribution ofreturn
and then computeCVaR
from this distribution."- Intuitively,
CVaR
represents the expectedreturn
should we experience the bottomα
-percentile of the possible outcomes.
- Intuitively,
- How to compute the loss for the critic?
- The scalar version with
L2
is replaced by the Wasserstein distance between the predicted and the expected distribution.
- The scalar version with
- How to model the distribution?
- The authors assume a Normal distribution. Therefore only the
mean
andvariance
are to be predicted by the critic.
- The authors assume a Normal distribution. Therefore only the
-
"Driving Reinforcement Learning with Models"
-
[
2019
] [📝] [] [ 🎓University College Dublin
,Imperial College London
,Salerno University
] -
[
planning
,MPC
,environment model
]
Click to expand
When a model of the environment is available (e.g. when the ball comes toward the ego paddle), planning is performed. Otherwise, a model-free RL agent is used. The RL agent maximizes the expected return , while imitating the MPC when it is used. This improves the sampling efficiency, while bringing some guarantees. Source. |
Authors: Rathi, M., Ferraro, P., & Russo, G.
-
Motivations:
1-
Model-freeRL
is sample inefficient and lacks (e.g.safety
) guarantees while learning.2-
planning
can efficiently solve aMDP
. But thetransition
andreward
models must be known.3-
In many real-world systems, tasks can be broken down into a set of functionalities and, for some of these, a mathematical model might be available.-
[Observation] "In many applications, such as applications requiring physical interactions between the agent and its environment, while a full model of the environment might not be available, at least parts of the model, for e.g. a subset of the
state
space, might be known/identifiable."
-
-
Main idea:
- Use a
planning
(here aMPC
) when a mathematical model is available andQ-learning
otherwise. MPC
is not only used foraction
selection. Even ifMPC
is used,RL
makes a prediction and is penalized if predictions differ. This accelerates the learning phase of theRL
part by:1-
Driving thestate
-space exploration ofQ-learning
.2-
Tuning its rewards.
- Use a
-
At first sight, it seems that the learning-agent simultaneously maintains two goals:
1-
Maximize the expected sum of discountedrewards
:RL
part.2-
Imitate theMPC
:behavioural cloning
.- If the supervised imitation is too strict, the performances degrade:
-
"Simulations show that, while the
MPC
component is important for enhancing the agent’s defence, too much influence of this component on theQ-L
can reduce the attack performance of the agent (this, in turn, is essential in order to score higher points)."
-
-
About
MPC
:-
"At each time-step, the algorithm computes a control
action
by solving an optimization problem having as constraint the dynamics of the system being controlled. In addition to the dynamics, other system requirements (e.g.safety
orfeasibility
requirements) can also be formalized as constraints of the optimization problem."
-
-
Example:
pong
.- During the defence phase,
MPRL
used itsMPC
component since this phase is completely governed by the physics of the game and by the moves of the ego-agent. - During the attack phase,
MPRL
used itsQ-L
component.-
"Indeed, even if a mathematical model describing the evolution of the position of the ball could be devised, there is no difference equation that could predict what our opponent would do in response (as we have no control over it)."
-
- The
MPC
takes over when the paddle is not correctly posed to receive the ball, ensuring that the ego-agent does not lose.-
"Intuitively, the paddle was moved by the
MPC
component when the ball was coming towards theMPRL
paddle and, at the same time, the future vertical position of the ball (predicted via the model) was far from the actual position of the agent’s paddle."
-
- [
MPC
<RL
for attack] "WhileMPC
allows the agent to defend, it does not allow for the learning of an attack strategy to consistently obtain points."
- During the defence phase,
-
How can it be applied to decision-making for AD?
- The task of driving is a partially observable and partially controllable problem:
1-
When surrounded by vehicles, the ego-action
has full control on thestate-transition
. Hence aRL
agent can be used to cope with uncertainty.2-
But when the car is driving alone, a rule-based controller suchPID
/MPC
could be used, since the ego-car dynamic is fully known (deterministic transitions).- The
planning
part would alleviates theRL
exploration, accelerating the learning process, but this does not provide any safety guarantees.
- The task of driving is a partially observable and partially controllable problem:
"End-to-end Reinforcement Learning for Autonomous Longitudinal Control Using Advantage Actor Critic with Temporal Context"
-
[
2019
] [📝] [ 🎓University of Surrey
] [ 🚗Jaguar Land Rover
] -
[
sampling efficiency
,switching actions
]
Click to expand
In the reward function, the time headway term encourages the agent to maintain a headway close to 2s , while the headway-derivative term rewards the agent for taking actions which bring it closer to the ideal headway. Source. |
Using recurrent units in the actor net leads to a smoother driving style and maintains a closer headway to the 2s target. Source. |
Authors: Kuutti, S., Bowden, R., Joshi, H., Temple, R. De, & Fallah, S.
- Motivations for "headway-keeping", i.e. longitudinal control, using model-free
RL
:1-
Address inherent sampling inefficiency.2-
Address common jerky driving behaviours, i.e. aim at smoother longitudinal trajectories.-
"Without any temporal context given, the agent has to decide the current action without any consideration for the previous actions, sometimes leading to rapid switching between the
throttle
andbrake
pedals."
-
- The task: keep a
2s
time headway from the lead vehicle inIPG CarMaker
.- The
state
consists in:ego-speed
ego-acc
delta speed to leader
time headway to leader
- Personal note: Since the
longitudinal
control of the ego-car has no influence on the speed of the leading vehicle, thetransition
function of thisMDP
should be stochastic.
- The
- One idea for sampling efficiency: "Proxy Network for Vehicle Dynamics".
- For training, the simulator was constrained to running at real-time (timestep =
40ms
). At the same time, model-free methods require many samples. - One idea is to learn the "ego-dynamics", i.e. one part of the
transition
function.input
= [ego-speed
,ego-acc
,previous pedal action
,current pedal action
,road coefficient of friction
]output
=ego-speed
at the next time-step.
- The authors claim that this model (derived with supervised learning), can be used to replace the simulator when training the
RL
agent:-
"The training was completed using the proxy network in under
19
hours." - As noted above, this
proxy net
does not capture the full transition. Therefore, I do not understand how it can substitute the simulator and "self-generate" samples, except if assuming constant speed of the leading vehicle - which would boil down to sometarget-speed-tracking
task instead."- In addition, one could apply
planning
techniques instead oflearning
.
- In addition, one could apply
-
- For training, the simulator was constrained to running at real-time (timestep =
- One idea again jerky action switch:
- The authors add a
16
LSTM
units to the actor network. - Having a recurrent cell provides a temporal context and leads to smoother predictions.
- The authors add a
- How to deal with continuous actions?
1-
The actor network estimates the action value meanµ
and the estimated varianceσ
.2-
This transformed into a Gaussian probability distribution, from which the control action is then sampled.
- About the "Sequenced" experience replay:
- The
RL
agent is trainoff-policy
, i.e. the experience tuples used for updating the policy were not collected from the currently-updated policy, but rather drawn from a replay buffer. - Because the
LSTM
has an internal state, experience tuples <s
,a
,r
,s'
> cannot be sampled individually. -
"Since the network uses
LSTMs
for temporal context, these minibatches are sampled as sequenced trajectories of experiences."
- The
- About the "Trauma memory":
- A second set of experiences is maintained and used during training.
- This "Trauma memory" stores trajectories which lead to collisions.
- The ratio of
trauma memory
samples to experiencereplay samples
is set to1/64
.
"Multi-lane Cruising Using Hierarchical Planning and Reinforcement Learning"
Click to expand
The high-level behavioural planner (BP ) selects a lane while the underlying motion planner (MoP ) select a corridor . Source. |
The decision-making is hierarchically divided into three levels. The first two (BP and MoP ) are learning-based. The keep-lane and switch-lane tasks are achieved using a shared MoP agent. The last module that decides of low-level commands such as throttle and steering is left rule-based since it is car-specific. Source. |
Authors: Rezaee, K., Yadmellat, P., Nosrati, M. S., Abolfathi, E. A., Elmahgiubi, M., & Luo, J.
- Motivation:
- Split the decision-making process based on different levels of abstraction. I.e. hierarchical planning.
- About the hierarchical nature of "driving" and the concept of "symbolic punctuation":
-
"Different from regular robotic problems, driving is heavily symbolically punctuated by signs and rules (e.g.
lane markings
,speed limit signs
,fire truck sirens
,turning signals
,traffic lights
) on top of what is largely a continuous control task." - In other words,
higher level
decisions on discrete state transitions should be coordinated withlower level
motion planning and control in continuous state space.
-
- About the use of learning-based methods for low-level control:
- The authors mentioned, among other, the work of
Mobileye
, whereRL
is used in the planning phase to model the vehicle’sacceleration
given results from the prediction module.-
"Given well-established controllers such as
PID
andMPC
, we believe that learning-based methods are more effective in the high and mid-level decision making (e.g.BP
andMotion Planning
) rather than low-level controllers."
-
- The authors mentioned, among other, the work of
- Another concept: "
skill
-based" planning:- It means that planning submodules are specialized for a driving sub-task (e.g.
lane keeping
,lane switching
). - The authors introduce a road abstraction: a
lane
(selected by theBP
) is divided into (5
)corridors
(selected by theMoP
).- Corridor selection is equivalent to selecting a path among a set of predefined paths.
- It means that planning submodules are specialized for a driving sub-task (e.g.
- Proposed structure:
1-
The Behavioural planner (BP
) outputs a high-level decisions in {keep lane
,switch to the left lane
,switch to the right lane
}.- Also, some target speed??? No, apparently a separated module sets some
target speed set-point
based on theBP
desire and the physical boundaries, such as heading cars, or any interfering objects on the road - Not clear to me.
- Also, some target speed??? No, apparently a separated module sets some
2-
The Motion planners (MoP
) outputs atarget corridor
and atarget speed
.3-
A separated and non-learnt trajectory controller converts that into low-level commands (throttle
andacceleration
)
- About the high-level
Option
: How to definetermination
?-
"The typical solution is to assign a fixed expiration time to each option and penalize the agent if execution time is expired."
- According the authors,
BP
should not wait until its command gets executed (since any fixed lifetime forBP
commands is dangerous).-
"BP should be able to update its earlier decisions, at every time step, according to the new states."
-
-
- How to train the two modules?
-
"We design a coarse-grained reward function and avoid any fine-grained rules in our reward feedback."
- The reward function of the
MoP
depends on theBP
objective:- Reward
+1
is given to theMoP
agent if being in the middle corridor (ok, but is the target lane considered?? Not clear to me),AND
:EITHER
the speed of the ego vehicle is within a threshold of theBP
target speed,OR
the minimum front gap is within a threshold of the safe distanced
(computed based onTTC
).
- Reward
- The
MoP
is first trained with randomBP
commands (target lanes are sampled every20s
). Then theBP
agent is trained: its gets positive rewards for driving above a threshold while being penalized at each lane change. - Main limitation:
- Training is done separately (both
BP
andMoP
agents cannot adapt to each other, hence the final decision might be sub-optimal).
- Training is done separately (both
-
- Another advantage: alleged easy
sim-to-real
transfer.-
"In practice, low-level policies may result in oscillatory or undesirable behaviours when deployed on real-world vehicles due to imperfect sensory inputs or unmodeled kinematic and dynamic effects."
-
"Our state-action space abstraction allows transferring of the trained models from a simulated environment with virtually no dynamics to the one with significantly more realistic dynamics without a need for retraining."
-
-
"Learning to Drive using Waypoints"
Click to expand
The communication of navigation goals at intersection is not done using high-level commands (c.f. conditional RL ), but rather by giving the PPO agent a list of predefined waypoints to follow. Source. |
Authors: Agarwal, T., Arora, H., Parhar, T., Deshpande, S., & Schneider, J.
- One idea (alternative to
conditional learning
):-
"Past approaches have used a higher-level planner that directs the agent using high-level commands on turning. Instead of this, we propose to use trajectory waypoints to guide navigation, which are readily available in real world autonomous vehicles."
- Using such a predefined path probably constrains applications to single-lane scenarios, and relies on an up-to-date HD map.
-
"We acknowledge that both the baseline methods use higher level navigation features and RGB images in contrast to richer low-level waypoint features and simpler semantically segmented images used in our approach."
-
- The agent policy network (
PPO
) takes two inputs:1-
The bottleneck embedding of an auto-encoder applied on semantically segmented images.2-
The comingn
waypoints to follow. They are2m
-spaced and are extracted from an offline map.- Both are independant of the realistism of the simulator it has trained on. One could therefore expect the approach to transfer well to real-world.
"Social Attention for Autonomous Decision-Making in Dense Traffic"
Click to expand
Weights in the encoding linear are shared between all vehicles. Each encoding contains individual features and has size dx . For each head in the stack, different linear projections (Lq , Lk , Lv ) are applied on them. Results of projections are key and values (plus a query for the ego-agent). Based on the similarity between the ego-query q0 and the key s, an attention matrix is built. This matrix should select a subset of vehicles that are important, depending on the context. It is multiplied with the concatenation of the individual values features , and then passed to a decoder where results from all heads are combined. The output are the estimated q-values .. Source. |
Example with a stack of two heads. Both direct their attention to incoming vehicles that are likely to collide with the ego-vehicle. Visualization of the attention matrix : The ego-vehicle is connected to every vehicle by a line whose width is proportional to the corresponding attention weight. The green head is only watching the vehicles coming from the left, while the blue head restricts itself to vehicles in the front and right directions.. Source. |
Authors: Leurent, E., & Mercat, J.
- Question: About the
MDP
state
(or representation of driving scene): how surrounding vehicles can be represented? - Motivations / requirements:
1-
Deal with a varying number of surrounding vehicles (problematic with function approximation which often expects constant-sized inputs).2-
The driving policy should be permutation-invariant (invariant to the ordering chosen to describe them).3-
Stay accurate and compact.
- Current approaches:
- List of features representation (fails at
1
and2
).Zero-padding
can help for varying-size inputs.
- Spatial grid representation (suffers from accuracy-size trade-off).
- List of features representation (fails at
- One concept: "Multi-head social attention mechanism".
- The
state
may contain many types of information. But the agent should only pay attention to vehicles that are close or conflict with the planned route.-
"Out of a complex scene description, the model should be able to filter information and consider only what is relevant for decision."
-
- About "attention mechanism":
-
"The attention architecture was introduced to enable neural networks to discover interdependencies within a variable number of inputs".
- For each head, a stochastic matrix called the
attention matrix
is derived. - The visualisation of this attention matrix brings interpretability.
-
- The
"End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances"
Click to expand
An encoder is trained to predict high-level information (called affordances ). The RL agent does not use directly them as input state but rather one layer before (hence implicit affordances). This compact representation offers benefits for interpretability, and for training efficiency (lighter to save in the replay buffer). A command {follow lane , turn left /right /straight , change lane left /right } for direction at intersection and lane changes is passed to the agent via a conditional branch. Source. |
Augmentation is needed for robustness and generalization (to address the distribution mismatch - also in IL ) (left). Here, the camera is moved around the autopilot. One main finding is the benefit of using adaptive target speed in the reward function (right). Source. |
Authors: Toromanoff, M., Wirbel, E., & Moutarde, F.
-
One quote:
-
"A promising way to solve both the data efficiency (particularly for DRL) and the black box problem is to use privileged information as auxiliary losses, also coined "affordances" in some recent papers."
-
-
One idea: The
state
of theRL
agent are affordance predictions produced by a separated network.1-
An encoder in trained in a supervised way to predict high-level information, using a stack of images as input.- Two main losses for this supervised phase:
1-
Traffic light state (binary classification).2-
semantic segmentation.
- Infordances inclues:
- Semantic segmentation maps (ground-truth available directly in CARLA).
- Distance and state of the incoming traffic light.
- Whether the agent is at an intersection or not.
- Distance from the middle of the lane.
- Relative rotation to the road.
- Why "implicit" affordance?
-
"We coined this scheme as "implicit affordances" because the RL agent implicit affordances because the RL agent do not use the explicit predictions but have only access to the implicit features (i.e the features from which our initial supervised network predicts the explicit affordances)."
- Hence multiple affordances are predicted in order to help the supervised training (similar to
auxiliary learning
).
-
- This feature extractor is frozen while training the
RL
agent.
- Two main losses for this supervised phase:
2-
These predicted affordances serve as input to a model-freeRL
agent.- The dueling network of the
Rainbow- IQN Ape-X
architecture was removed (very big in size and no clear improvement). - Training is distributed:
-
"CARLA is too slow for RL and cannot generate enough data if only one instance is used."
-
- Despite being limited to discrete actions (
4
forthrottle
/brake
and9
-24
forsteering
), it shows very good results.-
"We also use a really simple yet effective trick: we can reach more fine-grained discrete actions by using a bagging of multiple [here
3
consecutive] predictions and average them."
-
- Frequency of decision is not mentioned.
- The dueling network of the
- Benefits of this architecture:
- Affordances features are way lighter compared to images, which enables the use of a replay buffer (off-policy).
- All the more, since the images are bigger than usual (
4
288
x288
frames are concatenated, compared to "classical" single84
x84
frames) in order to capture the state of traffic lights.
- All the more, since the images are bigger than usual (
- This decomposition also brings some interpretable feedback on how the decision was taken (affordances could also be used as input of a rule-based controller).
- Affordances features are way lighter compared to images, which enables the use of a replay buffer (off-policy).
- This
2
-stage approach reminds me the concept"learn to see"
/"learn to act"
concept of (Chen et al. 2019) in"Learning by Cheating"
.-
"As expected, these experiments prove that training a large network using only
RL
signal is hard".
-
-
About "model-free"
RL
.- As opposed to "model-based", e.g. (Pan et al. 2019) where a network is also trained to predict high-level information.
- But these affordances rather relate to the transition model, such as probability of collision or being off-road in the near futures from a sequence of observations and actions.
-
About the reward function:
- It relies on
3
components:1-
Desiredlateral position
.- To stay in the middle of the lane.
2-
Desiredrotation
.- To prevent oscillations near the center of lane.
3-
Desiredspeed
, which is adaptive.-
"When the agent arrives near a red traffic light, the desired speed goes linearly to
0
(the closest the agent is from the traffic light), and goes back to maximum allowed speed when it turns green. The same principle is used when arriving behind an obstacle, pedestrian, bicycle or vehicle." - The authors find that without this adaptation, the agent fails totally at braking for both cases of red traffic light or pedestrian crossing.
-
- It relies on
"Hierarchical Reinforcement Learning Method for Autonomous Vehicle Behavior Planning"
-
[
2019
] [📝] [ 🎓Carnegie Mellon
] [ 🚗General Motors
] -
[
Hierarchical RL
]
Click to expand
Driving is treated as a task with multiple sub-goals. A meta controller choose an option among STOP AT STOP-LINE , FOLLOW FRONT VEHICLE . Based on this sub-goal, a low-level controller decides the THROTTLE and BRAKE action. Source. |
The state is shared across different hierarchical layers (option and action ). An attention model extracts the relevant information for the current sub-goal . Source. |
Authors: Qiao, Z., Tyree, Z., Mudalige, P., Schneider, J., & Dolan, J. M.
- Motivations:
1-
Have an unique learning-based method (RL
) for behavioural decision-making in an environment where the agent must pursue multiple sub-goals.2-
Improve convergence speed and sampling efficiency of traditionalRL
approaches.- These are achieved with hierarchical
RL
, by reusing the different learnt policies across similar tasks.
- About the hierarchical structure: concepts of "Actions" and "Options".
1-
First, a high-level option, seen as asub-goal
, is selected by some "meta-controller", depending on the scenario:Option_1
=STOP AT STOP-LINE
.Option_2
=FOLLOW FRONT VEHICLE
.
2-
Based on the selectedsub-goal
, an action network generates low-level longitudinal "actions" aboutthrottle
andbrake
.- The continues until a next sub-goal is generated by the meta controller.
- About the
state
:- The authors decide to share one state set for the whole hierarchical structure.
- For each sub-goal, only relevant information is extracted using an attention mechanism:
-
"An attention model is applied to define the importance of each state element
I
(state
,option
) with respect to each hierarchical level and sub-goal". - For instance:
-
"When the ego car is approaching the front vehicle, the attention is mainly focused on
dfc
/dfs
." (1
-
distance_to_the_front_vehicle
/
safety_distance
) -
"When the front vehicle leaves without stopping at the stop-line, the ego car transfers more and more attentions to
ddc
/dds
during the process of approaching the stop-line." (1
-
distance_to_stop_sign
/
safety_distance
)
-
- About the reward function:
- The authors call that "hybrid reward mechanism":
- An "
extrinsic
meta reward" for theoption
-level. - An "
intrinsic
reward" for theaction
-level, conditioned on the selectedoption
.
- An "
- For instance, the terms related to the
target_speed
in the intrinsic reward will be adapted depending if the meta controller defines ahighway driving
or anurban driving
sub-goal. - At first sight, this reward function embeds a lot of parameters.
- One could say that the effort parameter engineering in rule-based approaches has been transferred to the reward-shaping.
- The authors call that "hybrid reward mechanism":
- One idea: "Hierarchical Prioritized Experience Replay" (
HPER
).- When updating the weights of the two networks, transitions {
s
,o
,a
,ro
,ra
,s'
} are considered. -
But "If the output of the option-value network
o
is chosen wrongly, then the success or failure of the corresponding action-value network is inconsequential to the current transition." - In other words, for a fair evaluation of the
action
, it would be nice to have the correct option. - One solution consists in setting priorities when sampling from the replay buffer:
- In the option-level, priorities are based on the error directly (the
TD-error
for theoptionNet
, similar to the traditionalPER
). - In the lower level, priorities are based on the difference between errors coming from two levels, i.e. the difference between (
TD-error
for theactionNet
) and (TD-error
for theoptionNet
).
- In the option-level, priorities are based on the error directly (the
- When updating the weights of the two networks, transitions {
- About the simulator:
VIRES VTD
.
"DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning"
Click to expand
real2sim experiment for model-free RL end-to-end (from monocular camera to low-level controls) track following using a 1/18th scale car. Source. |
Comparison of tools to perform RL sim2real applications. Source. |
Authors: Balaji, B., Mallya, S., Genc, S., Gupta, S., Dirac, L., Khare, V., Roy, G., Sun, T., Tao, Y., Townsend, B., Calleja, E., Muralidhara, S. & Karuppasamy, D.
- What?
-
"
DeepRacer
is an experimentation and educational platform forsim2real
RL
."
-
- About the
POMDP
and the algorithm:- Observation:
160x120
grayscale front view. - Action:
10
discretized values:2
levels for throttle and5
for steering. - Timestep: One action per observation. At
15 fps
(average speed1.6 m/s
). model-free
(that is important)PPO
to learn the policy.
- Observation:
- Tools:
Amazon RoboMaker
: An extension of theROS
framework with cloud services, to develop and test deploy this robot software.Amazon SageMaker
: The Amazon cloud platform to train and deploy machine learning models at scale using theJupyter
Notebook.Amazon S3
(Simple Storage Service) to save and store the neural network models.- [
OpenAI Gym
] interface between theagent
and theenv
. Intel Coach
RL
framework for easy experimentation.Intel OpenVINO
, a toolkit for quick development of vision-based applications, to convert theTensorflow
models to an optimized binary.Gazebo
robotics simulator.ROS
for the communication between the agent and the simulation.ODE
(Open Dynamics Engine) to simulate the laws of physics using the robot model.Ogre
as graphics rendering engine.Redis
, an in-memory database, as a buffer to store the experience tuples <obs
,action
,reward
,next_obs
>.
- About
sim2real
:- The policy is learnt (+tested) from simulations ("replica track") and then transferred (+tested) to real world.
-
"The entire process from training a policy to testing in the real car takes
< 30
minutes." - The authors mention several approaches for
sim2real
:- Mixing
sim
andreal
experiences during training:- E.g. learn features from a combination of simulation and real data.
- Mix expert demonstrations with simulations.
- Model-based dynamics transfer:
- Assess simulation bias.
- Learn model ensembles.
- Perform calibration: here, they have to match the robot model to the measured dimensions of the car.
- Domain randomization: simulation parameters are perturbed during training to gain robustness.
- Add adversarial noise.
- Observation noise, such as random colouring.
- Action noise (up to
10%
uniform random noise tosteering
andthrottle
). - Add "reverse direction of travel" each episode (I did not understand).
Privileged learning
:- E.g. learn semantic segmentation at the same time, as an auxiliary task.
- Mixing
- They note that (I thought they were using grayscale?):
-
"Random colour was the most effective method for
sim2real
transfer."
-
- About the distributed rollout mechanism, to improve the sampling efficiency:
-
"We introduce a training mechanism that decouples
RL
policy updates with the rollouts, which enables independent scaling of the simulation cluster." - As I understand, multiple simulations are run in parallel, with the same dynamics model.
- Each worker sends its experience tuples (not the gradient) to the central
Redis
buffer where updates of the shared model are performed.
-
"Autonomous Driving using Safe Reinforcement Learning by Incorporating a Regret-based Human Lane-Changing Decision Model"
-
[
2019
] [📝] [ 🎓Michigan State University, Clemson University
] -
[
safe RL
,action masking
,motivational decision model
,regret theory
,CARLA
]
Click to expand
Every time after the RL agent chooses an action, the supervisor checks whether this action is safe or not. It does that by predicting the trajectories of other cars using a regret decision model. Source. |
The two actions (keep lane and change lane ) have different probabilities of occurrence, and different harms (costs) (I ) that can be formulated as utilities (II ), expressed with physical variables (III ). Affordance indicators are used in the world representation. Source. |
Authors: Chen, D., Jiang, L., Wang, Y., & Li, Z.
- Main motivation for
safeRL
for two-lane highway scenarios:-
"We found that when using conventional
RL
, about14.5%
training epochs ended with collisions. [...] Collisions cause training unstable as the RL algorithm needs to reset the simulation whenever collisions happen."
-
- One idea: A "safety supervisor" evaluates the consequences of an ego-action and decides whether an it is "safe" using a prediction model.
- "What is predicted?"
- The
lane-changing
decisions of other cars.
- The
- "How?"
- The approach belongs to the motivational methods, hence explainable (as opposed to the data-driven methods).
- A "regret decision model" is developed:
-
"
Risks
in driving have two dimensions:harm
(=costs
) andprobabilities
." - A
probability
(relating to thetime-to-collision
), aweighting effect
function (non-linear function of the probability), acost
function, autility
(cost normalized bycollision cost
) and aregret
(non-linear function of the utility) are defined. - The (deterministic) decision is based on the resulting the
net advantage
(~weighted regret) of optionC
(change) over optionK
(keep lane).
-
- "Can we draw a parallel with how human drivers reason?"
-
"Not all the drivers have experienced collisions, but all of them can perceive the threat of a potential collision. It is plausible to assume that the surrounding driver perceives the threat of the approaching vehicle to be proportional to its kinematic energy." Hence somehow proportional to
speed²
.
-
- "What does
safe
means here?"- Short-horizon rollouts are performed.
1-
The ego trajectory is anticipated based on theaction
candidate.2-
Trajectories are obtained for other all cars using the predictor and their current speeds.
- The action is considered
safe
if the ego agent stays on the road and if his trajectories does not intersect (in spatial-temporal domain) with any other trajectory.
- Short-horizon rollouts are performed.
- What if the action is deemed
unsafe
?1-
The supervisor replaces the identified risky action with a safe option (probably based on the model).- This avoids terminating the episode and starting a new one during training and obviously it improves safety during testing.
2-
An experience tuple <state
,action
,r_collision
,*
(no next-action)> is created and added to the experience replay buffer (together with safe tuples). This will be sampled during the update phase.- It reminds me the idea of another work by the authors ("Deep Q-Learning with Dynamically-Learned Safety Module : A Case Study in Autonomous Driving" - detailed below) where two distinct buffers (
safe
andcollision
) are maintained and equally sampled.
- It reminds me the idea of another work by the authors ("Deep Q-Learning with Dynamically-Learned Safety Module : A Case Study in Autonomous Driving" - detailed below) where two distinct buffers (
- "What values for the model parameters?"
- Ideally, these values should be inferred for each observed car (they are parts of the hidden state). For instance, within a
POMDP
formulation, using a belief tracker. - Here, a single set of parameters is derived via
max-likelihood
based on human demonstration.
- Ideally, these values should be inferred for each observed car (they are parts of the hidden state). For instance, within a
- "What is the impact on training?"
- It is faster and more stable (almost constant improving rate) compared to conventional
RL
.
- It is faster and more stable (almost constant improving rate) compared to conventional
- "What is predicted?"
- Another idea: hierarchical decision-making:
- The
RL
agent first selects "manoeuvre-like" decisions among {decelerating
,cruising
,accelerating
} and {lane change
,lane keeping
}. - Its decision is then implemented by a low-level controller (
PID
).-
"This hierarchical design greatly reduces the training time compared to methods using agents to output control signals directly."
-
- The
- One related previous work (about regret decision model):
- "A human-computer interface design for quantitative measure of regret theory" - (Jiang L. & Wang Y., 2019).
"Learning Resilient Behaviors for Navigation Under Uncertainty Environments"
-
[
2019
] [📝] [🎞️] [🎞️] [ 🎓Fuzhou University, University of Maryland, University of Hong Kong
] [ 🚗Baidu
] -
[
uncertainty estimation
,uncertainty-aware policy
,SAC
]
Click to expand
The confidence in the prediction (left) is used as an uncertainty estimate. This estimate impacts the decision (μ = mean of the steering action distribution) of the agent. Source. |
The variance of the steering action distribution (behavioural uncertainty ) is not estimated by the agent itself, but rather built by a simple mapping-function from the environmental uncertainty estimated by the prediction module. Source. |
Authors: Fan, T., Long, P., Liu, W., Pan, J., Yang, R., & Manocha, D.
- One motivation: derive an uncertainty-aware +
model-free
+RL
policy.- "Uncertainty-aware":
- The core idea is to forward the observation uncertainty to the behaviour uncertainty (i.e. uncertainty of the action distribution), in order to boost exploration during training.
Model-free
as opposed tomodel-based
RL
methods.- In
model-based
methods, the collision probability and uncertainty prediction can [explicitly] be formulated into the risk term for anMPC
to minimize. - Here, the action selection is directly output by a net, based on raw laser data (i.e. not from some
MPC
that would require challenging parameter tuning).
- In
RL
as opposed toIL
(where uncertainty estimation has already been applied, by recycling techniques of supervised learning).- Besides, in
IL
, it is difficult to learn policies to actively avoid uncertain regions.
- Besides, in
- "Uncertainty-aware":
- The proposed framework consists in
3
parts:1-
A prediction net:- Inputs: A sequence of laser scans and velocity data.
- Outputs:
1-1.
The motion prediction of surrounding environments (this seems to be trajectories).1-2.
Some associated uncertainty: "we expect to measure the environmental uncertainty by computing the confidence of the prediction results."
- It is trained with supervised learning with recorded trajectories. The loss function therefore discourages large uncertainty while minimizing prediction errors.
2-
A policy net:- Inputs:
2-1.
The two estimates (uncertainty
andmotion information
) of the predictor.2-2.
The currentlaser scan
, the currentvelocity
and the relativegoal position
.
- Outputs: The
mean
* of the action distribution (steering
). - It is trained with model-free
RL
, using theROS
-basedStage
simulator.
- Inputs:
3-
A parallel NON-LEARNABLEmapping function
estimates thevariance
of that action distribution directly from the environmental uncertainty.- Input: The "predicted" environmental uncertainty. (I would have called it
observation uncertainty
). - Output: The
variance
of the action distribution (steering
). - About the mapping: The uncertainty predictions are weighted according to the distance of each laser point:
-
"We consider that the closer the laser point, the higher the impact on the action."
-
- Again, the variance of the action distribution is not learnt!
- It reminds me the work of (Galias, Jakubowski, Michalewski, Osinski, & Ziecina, 2019) where best results are achieved when the policy outputs both the
mean
andvariance
. - Here, the policy network should learn to adaptively generate the
mean
value according to thevariance
value (capturing the environment uncertainty), e.g. exhibit more conservative behaviours in the case of high environmental uncertainty.
- It reminds me the work of (Galias, Jakubowski, Michalewski, Osinski, & Ziecina, 2019) where best results are achieved when the policy outputs both the
- Input: The "predicted" environmental uncertainty. (I would have called it
- About the
RL
method:SAC
= Soft Actor-Critic.- The above defined mapping forwards environmental uncertainties to the action variance.
- The idea is then to encourage the agent to reduce this action variance (
distribution entropy
) in order to obtain some "uncertainty-averse" behaviour. SAC
is appropriate for that:
- The idea is then to encourage the agent to reduce this action variance (
-
"The key idea of
SAC
is to maximize theexpected return
andaction entropy
together instead of the expected return itself to balance the exploration and exploitation."SAC
trains a stochastic policy with entropy regularization, and explores in an on-policy way.- My interpretation:
1-
The agent is given (non-learnable) anaction variance
from the uncertainty mapping.2-
This impacts its objective function.3-
It will therefore try to decrease this uncertainty of action distribution and by doing so will try to minimize the environmental uncertainty.4-
Hence more exploration during the training phase.
- Similar to the
ε
-greedy annealing process inDQNs
, thetemperature
parameter is decayed during training to weight between the two objectives (entropy
of policy distribution andexpected return
).SAC
incorporates theentropy
measure of the policy into the reward to encourage exploration, i.e. the agent should act as randomly as possible [encourage uniform action probability] while it is still able to succeed at the task.
- The above defined mapping forwards environmental uncertainties to the action variance.
- Bonus (not directly connected to their contributions): How to model uncertainties in
DNN
s?-
"The
aleatoric
uncertainty (data uncertainty) can be modelled by a specific loss function for the uncertainty term in the network output". -
"The
epistemic
uncertainty (i.e. model uncertainty) can be captured by the Monte-Carlo Dropout (MC
-Dropout) technique" - dropout can be seen as a Bayesian approximation.
-
"Deep Q-Learning with Dynamically-Learned Safety Module: A Case Study in Autonomous Driving"
-
[
2019
] [📝] [ 🎓University of Michigan
,West Virginia University
] [ 🚗Ford
] -
[
DQN
]
Click to expand
The affordance indicator refers to the MDP state . It has length 20 and contains and represents the spatio-temporal information of the nearest traffic vehicles. The agent controls the discrete acceleration (maintain , accelerate , brake , and hard brake ) and selects its lane (keep lane , change to right , and change to left ). Source. |
Two purposes: 1- Accelerates the learning process without inhibiting meaningful exploration. 2- Learn to avoid accident-prone states. Note that collisions still occur, but less often. Source. |
Authors: Baheri, A., Nageshrao, S., Kolmanovsky, I., Girard, A., Tseng, E., & Filev, D.
- Main motivation: guide the exploration process when learning a
DQN
-based driving policy by combining two safety modules (rule-based
andlearning-based
):- A handcrafted safety module.
- A heuristic rule ensures a minimum relative gap to a traffic vehicle.
- A dynamically-learned safety module.
1-
It first predicts future states within a finite horizon.2-
And then determines if one of the future sates violates the handcrafted safety rule.
- A handcrafted safety module.
- One idea for
DQN
: Two buffers are used to store experience (off-policy learning). They are equally sampled in the update stage.1-
Safe buffer: As in classic DQN.2-
Collision buffer: The TD-target only consists in the observedreward
(i.e. thes
is a terminal state).- An experience tuple ends in this collision buffer:
1-
If thehandcrafted
safety module considers the action is not safe or if the action leads to a static collision in the next observed state.2-
If thedynamically-learned
safety module detects a dynamic collision for any future predicted states. In that case, an additional negative term is assigned to the reward.
- An experience tuple ends in this collision buffer:
- Although the sampling is uniform in each buffer, the use of two buffer can been seen as some variant of prioritized experience replay (
PER
) where the sampling is biased to expose the agent to critical situations. - Contrary to
action masking
, the "bad behaviours" are not discarded (half the batch is sampled from the collision buffer).- The agent must therefore generalize the states that could lead to undesirable states.
- Relation to
model-based RL
:- The transition function of the
MDP
is explicitly learnt by theRNN
(mapping <a
,s
> tos'
). - The dynamically-learned safety module incorporates a model lookahead. But prediction is not used for explicit planning.
- Instead, it determines whether the future states lead to undesirable behaviours and consequently adapts on-the-fly the reward function.
- The transition function of the
- Related works:
- "Autonomous Highway Driving using Deep Reinforcement Learning" - (Nageshrao, Tseng, & Filev, 2019).
- Basically, the same approach is presented.
-
"Training a standard
DDQN
agent without explicit safety check could not learn a decent policy and always resulted in collision. [...] Even withcontinuous adaptation
the mean number safety trigger never converges to zero." - Since the
NN
function approximation can potentially chose a non-safe action, the agent should be augmented with some checker that detects safety violation. - The authors introduce the idea of a "short horizon safety checker".
- If the original
DDQN
action choice is deemed unsafe, then the safety check replaces it with an alternate "safe" action, in this case relying on anIDM
controller. - This technic enhances learning efficiency and without inhibiting meaningful exploration
- For additional contents on "Safe-RL in the context of autonomous vehicles", the first author @AliBaheri has written this github project.
- "Autonomous Highway Driving using Deep Reinforcement Learning" - (Nageshrao, Tseng, & Filev, 2019).
"Simulation-based reinforcement learning for autonomous driving"
-
[
2019
] [📝] [🎞️] [🎞️] [ 🎓Universities of Warsaw and Jagiellonian
] [ 🚗deepsense.ai
] -
[
sim-to-real
,end-to-end
,CARLA
]
Click to expand
Neural architecture of the policy function trained with PPO : the RGB image is concatenated with its semantic segmentation. Randomisation is performed to prevent over-fitting and increase sampling-efficiency. It is also worth mentioning the high-level navigation command that is provided to guide the agent when approaching intersections. Source. |
Several option for producing the std of the steering distribution. Best results are achieved when the policy outputs both mean and std . The left screenshot illustrates that shaped rewards (as opposed to sparse rewards where rewards are only five at the goal state) can bias learning and lead to un-intended behaviours: to make the agent stay close to the centre line, the authors originally penalized the gap in X , Y but also Z coordinates. ''Due to technical reasons our list of lane-centre positions was actually placed above the road in the Z axis. This resulted in a policy that drives with two right side wheels placed on a high curb, so its elevation is increased and distance to the centre-line point above the ground is decreased''. Source-1 Source-2. |
Authors: Galias, C., Jakubowski, A., Michalewski, H., Osinski, B., & Ziecina, P.
- One goal: learn the continuous
steering
control of the car to stay on its lane (no consideration of traffic rules) in anend-to-end
fashion usingRL
.- The
throttle
is controlled by aPID
controller with constant speed target. - In addition, the simulated environment is static, without any moving cars or pedestrians.
- The authors want to test how good a policy learnt in simulation can transfer to real-world. This is sometimes called
sim-to-real
.- For this second motivation, this reminds me the impressive
Wayve
's work "Simulation Training, Real Driving".
- For this second motivation, this reminds me the impressive
- The
- How to model the standard deviation parameter of the continuous
steering
action distribution?- It could be set it to a constant value or treated as an external learnable variable (detached from the policy).
- But the authors found that letting the policy control it, as for the
mean
, gave the best results.- It allows the policy to adjust the degree of exploration on a per-observation basis.
-
"An important implementation detail was to enforce an upper boundary for the standard deviation. Without such a boundary the standard deviation would sometime explode and never go down below a certain point (the entropy of the policy climbs up), performing poorly when deployed on real-world cars."
- An interesting variant:
end-to-mid
, i.e. do not directly predict raw control commands.- In another work, the task was not to directly predict the
CARLA
steering
command, but rather some target waypoint on the road, and "outsource" the steering control task to some external controller. -
"Given a target waypoint, low-level steering of the driving wheel is executed in order to reach this point. In simulation, it is realized by a
PID
controller while in the case of the real car, we use a proprietary control system. The action space is discrete - potential waypoints are located every5
degrees between−30
and30
, where0
is the current orientation of the vehicle."
- In another work, the task was not to directly predict the
- Two ideas: The use of
3
sources of observations. And the inclusion of segmentation mask to thestate
(input) of theRL
net:1-
ARGB
image.- It is concatenated by its semantic segmentation: it passes through a previous-learnt segmentation network.
- This can be thought as an intermediate human-designed or learned representations of the real world.
- As said, the
seg-net
has been learnt before with supervised learning.- But it could also be (further) trained online, at the same time as the policy, leading to the promising concept of
auxiliary learning
. - This has been done for instance by (Tan, Xu, & Kong, 2018), where a framework of
RL
with image semantic segmentation network is developped to make the whole model adaptable to reality.
- But it could also be (further) trained online, at the same time as the policy, leading to the promising concept of
2-
A high-level navigation command to guide the agent when approaching an intersection.- In {
LANE FOLLOW
,GO STRAIGHT
,TURN RIGHT
,TURN LEFT
}. - This is inspired by (Codevilla et al., 2017).
- In {
3-
The currentspeed
andacceleration
.- The authors tried to also include the information about the last action.
- Without success: the car was rapidly switching between
extreme left
andextreme right
. - In other words, the steering was controlling in a pulse width modulation-like manner.
- Without success: the car was rapidly switching between
- One idea to promote generalisation and robustness: Randomisation.
- It should also prevent overfitting and improve sample efficiency.
- Some randomizations are applied to visual camera input.
- An additional loss term is introduce to check if the policy outputs a similar distribution for the perturbed and unperturbed images.
- Some perturbations are also applied to the car dynamics.
- One idea for debug: generate saliency map.
- The idea is to find which region of the image has the most impact in the prediction of the steering command.
- This can be done by blurring different patches of the input image, i.e. removing information from that patch, and measuring the output difference.
- Some good ideas mentioned for future works:
- To improve the driving stability, try to focus the training on fragments of scenarios with the highest uncertainty. c.f. concepts of Bayesian
RL
. - Go to
mid-to-end
, using an intermediate representation layer, for example a2D
-map or a bird's-eye view. e.g.ChauffeurNet
- also detailed on this page. - To further improve the sampling efficiency, model-based methods could be integrated. c.f. "Wayve Simulation Training, Real Driving"
- To improve the driving stability, try to focus the training on fragments of scenarios with the highest uncertainty. c.f. concepts of Bayesian
"Dynamic Interaction-Aware Scene Understanding for Reinforcement Learning in Autonomous Driving"
-
[
2019
] [📝] [ 🎓University of Freiburg
] [ 🚗BMW
] -
[
feature engineering
,graph neural networks
,interaction-aware networks
,SUMO
]
Click to expand
Some figures:
In the proposed DeepSet approach, embeddings are first created depending on the object type (using φ1 for vehicles and φ2 for lanes), forming the encoded scene . They are 'merged' only in a second stage to create a fixed vector representation. Deep Set can be extended with Graph Convolutional Networks when combining the set of node features to capture the relations - interaction - between vehicles. Source. |
Authors: Huegle, M., Kalweit, B., Werling, M., & Boedecker, J.
- Two motivations:
1-
Deal with an arbitrary number of objects or lanes.- The authors acknowledge that a fix-size state will be enough for scenarios like highways driving where interactions with the direct neighbours of the agent are most important.
- But they also note that a variable-length list can be very important in certain situations such as urban driving.
- To deal with the variable-length dynamic input set
X-dyn
, there encodings are justsummed
.- This makes the
Q-function
permutation invariant w.r.t. the order of the dynamic input and independent of its size.
- This makes the
- The authors acknowledge that a fix-size state will be enough for scenarios like highways driving where interactions with the direct neighbours of the agent are most important.
2-
Model interactions between objects in the scene representations.- The structure of Graph Convolutional Networks (
GCN
) is used for that. All node features are combined by thesum
. -
"
Graph Networks
are a class of neural networks that can learn functions on graphs as input and can reason about how objects in complex systems interact."
- The structure of Graph Convolutional Networks (
- Baselines:
-
"Graph-Q is compared to two other interaction-aware
Q
-learning algorithms, that use input modules originally proposed for supervised vehicle trajectory prediction."- Convolutional Social Pooling (
SocialCNN
) is using a grid-map: "a social tensor is created by learning latent vectors of all cars by an encoder network and projecting them to a grid map in order to learn spatial dependencies". - Vehicle Behaviour Interaction Networks (
VBIN
) imposes working with a fixed number of cars since the embedding vectors are just concatenated, i.e. not summarizing as in the Deep Sets approach.
- Convolutional Social Pooling (
- A built-in
SUMO
rule-based controller is also used for comparison.
-
- Previous works:
Dynamic input for deep reinforcement learning in autonomous driving
- detailed below.- Introducing the idea of
Deep Set
.
- Introducing the idea of
High-level Decision Making for Safe and Reasonable Autonomous Lane Changing using Reinforcement Learning
- detailed below.- How to ensure safety when working with a
DQN
? - The concept of
action masking
is applied, i.e. the technique of not exposing to the agent dangerous or non-applicable actions during the action selection.
- How to ensure safety when working with a
"Driving in Dense Traffic with Model-Free Reinforcement Learning"
-
[
2019
] [📝] [simulator
] [MPC
] [ML
] [🎞️] [ 🎓UC Berkeley
,Carnegie Mellon
] [ 🚗Honda
] -
[
SISL
,PPO
,MPC
,merging scenarios
]
Click to expand
Some figures:
Source. |
The occupancy-grid-like observation space is divided into 4 channels, each containing 3 lanes. An ego-vehicle specific feature vector is also considered. The authors use policy-gradient Proximal Policy Optimisation - PPO - method and decided not to share parameters between the actor and the critic. Source. |
In another work, the authors try to incorporate an RNN as a prediction model into an MPC controller, leading to a reliable, interpretable, and tunable framework which also contains a data-driven model that captures interactive motions between drivers. Source. |
Authors: Saxena, D. M., Bae, S., Nakhaei, A., Fujimura, K., & Likhachev, M.
- One motivation: learn to perform comfortable merge into dense traffic using model-free RL.
- Dense traffic situations are difficult: traditional rule-based models fail entirely.
- One reason is that in heavy traffic situations vehicles cannot merge into a lane without cooperating with other drivers.
- Model-free means it does not rely on driver models of other vehicles, or even on predictions about their motions. No explicit model of inter-vehicle interactions is therefore needed.
- Model-free also means that natively, safety cannot be guaranteed. Some masking mechanisms (called "overseer") are contemplated for future work.
- Dense traffic situations are difficult: traditional rule-based models fail entirely.
- One idea for merging scenarios:
-
Many other works "only accommodate a fixed merge point as opposed to the more realistic case of a finite distance for the task (lane change or merge) as in our work."
-
- One idea to adapt
IDM
to dense scenarios:-
"
IDM
is modified to include astop-and-go
behaviour that cycles between a non-zero and zero desired velocity in regular time intervals. This behaviour is intended to simulate real-world driving behaviours seen in heavy-traffic during rush-hour."
-
- One idea about action space:
-
"Learning a policy over the acceleration and steering angle of a vehicle might lead to jerky or oscillatory behaviour which is undesirable. Instead, we train our network to predict the time derivatives of these quantities, i.e. jerk
j
and steering rateδ˙
. This helps us maintain a smooth signal over the true low-level control variables." - The policy for jerk and steering rate is parameterised as Beta distributions (allegedly "this makes training more stable as the policy gradients are unbiased with respect to the finite support of
β
-distributions").
-
- One of their related works used as baseline:
- "Cooperation-Aware Lane Change Maneuver in Dense Traffic based on Model Predictive Control with Recurrent Neural Network" from (Bae et al., 2019).
-
"A
RNN
generates predictions for the motions of neighbouring vehicles based on a history of their observations. These predictions are then used to create safety constraints for anMPC
optimisation."1
- For the prediction part:- A
Social-GAN
(socially acceptable trajectories) is used to predict the other vehicles’ interactive motions that are reactive to the ego vehicle’s actions. - Other prediction modules are tested, such as a constant velocity model.
-
"When all drivers are cooperative, all three prediction models can lead to successful lane change. That is, the imprecision of predictions on drivers’ interactive motions is not critical when the drivers are very cooperative, since the drivers easily submit space to other vehicles, even with rough control inputs resulting from inaccurate motion predictions. This, however, is no longer valid if the drivers are aggressive."
- A
2
- For the MPC part:- A heuristic algorithm based on Monte Carlo simulation along with a roll-out is used to deal with the non-convexity of the problem (some constraints are non-linear).
- To reduce the search complexity during the sampling phase, the action space is adapted. For instance only steering angles to the left are considered when turning left.
-
"This
SGAN
-enabled controller out-performs the learning-based controller in the success rate, (arguably) safety as measured by minimum distances, and reliability as measured by variances of performance metrics, while taking more time to merge".
"Behavior Planning at Roundabouts"
Click to expand
Some figures:
Using recurrent units in a DQN and considering the action history. Source. |
Hidden modes framework: decomposing the non-stationary environment into multiple stationary environments, where each mode is an MDP with distinct dynamics. Source. |
Author: Khurana, A.
- One idea: using recurrent nets (hence
D
R
DQN
) to integrate history in order to cope with partial observability.- Two
LSTM
layers, (considering15
past observations) was added afterDQN
layers. - They also try to include the action history of the agent, leading to
A
DRQN
. But this does not modify the score.
- Two
- Another idea: several context-based action spaces:
- Decision in roundabouts consists in:
- Merging -
action_space
= {go
,no go
} - Traversal -
action_space
= {acc
,decelerate
,change-left
,change-right
,cv
} - Exit -
action_space
= {acc
,decelerate
,change-left
,change-right
,cv
}
- Merging -
- Justification for using discrete action space: behavioural planning happens on a slower time scale than motion planning or trajectory control.
- This reminds me some works on hierarchical RL (e.g.
Options
framework).
- Decision in roundabouts consists in:
- Another idea: Curriculum learning
-
"Each model is first trained without any other interacting vehicle so that it learns the most optimal policy and later with other vehicles with random initialization. In later stages, an additional bonus reward is given to merging and traversal if they lead to successful exit to enable long-term consistent behaviours."
-
- One note about the
POMDP
formulation:-
"This also enables us to integrate planning and prediction into a single problem, as the agent learns to reason about its future."
- I am a little bit confused by their formulation of the
PO
MDP
.- I would have expected some hidden parameters to be defined and some belief on them to be tracked, as often done for AD, e.g. the intention of other participants.
- Instead, the
partial observability
refers here to the uncertainty in perception: "there is a0.2
probability that a car present in the agent’s perception field is dropped in the observation". - This imperfect state estimation encourages the robustness.
-
- One note about model-free RL:
- Using RL seems relevant to offer generalization in complex scenarios.
- But as noted by the authors: "the rule-based planner outperforms others in the case of a single-lane roundabout as there’s no scope for lane change."_
- One addressed problem: "non-stationary" environments.
- A single policy learned on a specific traffic density may perform badly on another density (the dynamic of the world modelled by the
MDP
changes over time). - The goal is to generalize across different traffic scenarios, especially across different traffic densities.
- One idea is to decompose the non-stationary environment into multiple stationary environments, where each mode is an MDP with distinct dynamics: this method is called
Hidden modes
.- How to then switch between modes? The authors proposed to use external information (Google Maps could for instance tell ahead if traffic jams occur on your planned route).
- But as the number of discrete modes increases, the
hidden-mode
method suffers from oscillations at the boundaries of the mode transitions.
- Thus the second idea is to consider one single model: this method is called
Traffic-Conditioned
.- Some continuously varying feature (ratio of
velocity of other vehicles
totarget speed
) is used. It should be representative of the non-stationary environment.
- Some continuously varying feature (ratio of
- One quote about the relation of hidden-mode formulation to hierarchical RL:
-
"For generalization, the hidden-mode formulation can also be viewed as a hierarchical learning problem where one
MDP
/POMDP
framework selects the mode while the other learns the driving behaviour given the mode".
-
- A single policy learned on a specific traffic density may perform badly on another density (the dynamic of the world modelled by the
"Reinforcement Learning Approach to Design Practical Adaptive Control for a Small-Scale Intelligent Vehicle"
-
[
2019
] [📝] [ 🎓Chongqing University
] -
[
non-deep RL
,online learning
,model-based RL
]
Click to expand
Authors: Hu, B., Li, J., Yang, J., Bai, H., Li, S., & Sun, Y.
- What I like in this paper:
- The authors make sure they understand tabular RL methods, e.g. the difference between on-policy
SARSA(1)
,SARSA(λ)
and off-policyQ-learning
, before going to deep RL.-
"Compared with
Q-learning
, which can be described as greedy, bold, and brave,SARSA(1)
is a conservative algorithm that is sensitive to control errors."
-
- They include a model-based algorithm (
Dyna-Q
) in their study. This seems promising when training directly in real world, where the sampling efficiency is crucial. - They claim RL methods bring advantages compared to PID controllers in term of adaptability (generalization - i.e. some PID parameters appropriate for driving on a straight road may cause issues in sharp turns) and burden of parameter tuning.
- They consider the sampling efficiency (better for model-based) and computational time per step (better for
1
-stepTD
methods than forSARSA(λ)
).-
"
Q-learning
algorithm has a poorer converging speed than theSARSA(λ)
andDyna-Q
algorithm, it balances the performance between the converging speed, the final control behaviour, and the computational complexity."
-
- Obviously, this remains far from real and serious AD. But this paper gives a good example of application of basic RL methods.
- The authors make sure they understand tabular RL methods, e.g. the difference between on-policy
"Learning When to Drive in Intersections by Combining Reinforcement Learning and Model Predictive Control"
-
[
2019
] [📝] [ 🎓Chalmers University
] [ 🚗Zenuity
] -
[
model-free RL
,MPC
,Q-Masking
,POMDP
]
Click to expand
Two figures:
Sort of feedback loop in the hierarchical structure: the MPC notifies via the reward signal if the decision is feasible, safe and comfortable. Source Left - Source Right. |
This work applies the path-velocity decomposition and focuses on the longitudinal control. Three intentions are considered: aggressive take-way , cautious (slows down without stopping), and passive give-way . Source. |
Authors: Tram, T., Batkovic, I., Ali, M., & Sjöberg, J.
- Main idea: hierarchy in learnt/optimized decision-making.
- A high-level decision module based on RL uses the feedback from the MPC controller in the reward function.
- The MPC controller is also responsible for handling the comfort of passengers in the car by generating a smooth acceleration profile.
- Previous works:
- "Learning Negotiating Behavior Between Cars in Intersections using Deep Q-Learning" - (Tram, Batkovic, Ali, & Sjöberg, 2019)
- "Autonomous Driving in Crossings using Reinforcement Learning" - (Jansson & Grönberg, 2017)
- In particular they reused the concept of "actions as Short Term Goals (
STG
)". e.g. keep set speed or yield for crossing car instead of some numerical acceleration outputs.- This allows for comfort on actuation and safety to be tuned separately, reducing the policy selection to a classification problem.
- The use of such abstracted / high-level decisions could be a first step toward hierarchical RL techniques (
macro-action
andoption
framework).
- Another contribution consists in replacing the Sliding Mode (
SM
) controller used previously by anMPC
, allegedly to "achieve safer actuation by using constraints".- The intention of all agents is implemented with a
SM
controller with various target values.
- The intention of all agents is implemented with a
- I find it valuable to have details about the training phase (no all papers do that). In particular:
- The normalization of the input features in [
-1
,1
]. - The normalization of the reward term in [
-2
,1
]. - The use of equal weights for inputs that describe the state of interchangeable objects.
- Use of a
LSTM
, as an alternative to a DQN with stacked observations. (Findings from (Jansson & Grönberg, 2017)).
- The normalization of the input features in [
- Additional notes:
- The main benefits of the combination seem to be about avoiding over conservative behaviours while improving the "sampling-efficient" of the model-free RL approach.
- Such approach looks to be particularly relevant (in term of success rate and collision-to-timeout ratio [
CTR
]) for complex scenarios, e.g.2
-crossing scenarios. - For simple cases, the performance stays close to the baseline.
- Such approach looks to be particularly relevant (in term of success rate and collision-to-timeout ratio [
- The reliance (and the burden) on an appropriate parametrisation inherent to rule-based has not disappeared and the generalisation seems limited:
- "Since MPC uses predefined models , e.g. vehicle models and other obstacle prediction models, the performance relies on their accuracy and assumptions ."
- The problem is formulated as a
POMDP
.- Honestly, from a first read, I did not find how
belief tracking
is performed. Maybe something related to the internal memory state of the LSTM cells?
- Honestly, from a first read, I did not find how
- Sadly the simulator seems to be home-made, which makes reproducibility tricky.
- The main benefits of the combination seem to be about avoiding over conservative behaviours while improving the "sampling-efficient" of the model-free RL approach.
- One quote about
Q-masking
, i.e. the technique of not exposing to the agent dangerous or non-applicable actions during the action selection.-
"Q-masking helps the learning process by reducing the exploration space by disabling actions the agent does not need to explore."
- Hence the agent does not have to explore these options, while ensuring a certain level of safety (but this requires another rule-based module 😊 ).
-
"Cooperation-Aware Reinforcement Learning for Merging in Dense Traffic"
-
[
2019
] [📝] [] [ 🎓Stanford
] [ 🚗Honda
] -
[
POMDP
,offline RL
,value-based RL
,interaction-aware decision making
,belief state planning
]
Click to expand
One figure:
Source. |
Authors: Bouton, M., Nakhaei, A., Fujimura, K., & Kochenderfer, M. J.
- One idea: offline belief state RL to solve dense merging scenarios modelled as a POMDP.
- A belief updater explicitly maintains a probability distribution over the driver cooperation levels of other cars.
- Another idea: projection of the merging vehicle on the main lane. This reduces the problem to one dimension, allowing for IDM. Similar to the abstraction presented by R. Regele.
- One term: "C-IDM": Cooperative Intelligent Driver Model.
- It builds on IDM to control the longitudinal acceleration of the other vehicle and adds a cooperation level
c
∈
[0
,1
]. - Non-cooperative vehicles (
c=0
) are blind to the merging ego car and follow the vanilla IDM, while cooperative vehicles will yield (c=1
).
- It builds on IDM to control the longitudinal acceleration of the other vehicle and adds a cooperation level
- Another term: "Burn-in time"
- When creating a new configuration, vehicles and their parameters are drawn from distributions and then move according to driver models.
- The idea is between
10s
and20s
before spawning the ego car, to allows the initial state to converge to a more realistic situation. - It should help for generalization.
- Another term: "Curriculum Learning": The idea is to train the agent by gradually increasing the difficulty of the problem. In this case the traffic density.
- Two take-aways (similar to what I identified at IV19)
"Previous works has shown that only relying on deep RL is not sufficient to achieve safety. The deployment of those policies would require the addition of a safety mechanism."
"Using deep reinforcement learning policies to guide the search of a classical planner (MCTS) may be a promising direction."
"Interaction-aware Decision Making with Adaptive Behaviors under Merging Scenarios"
-
[
multi agent RL
,interaction-aware decision making
,curriculum learning
,action masking
]
Click to expand
One figure:
Note: the visibility of each agent is assumed to be 100m in front and back, with 0.5m /cell resolution, for both its current lane (obs_cl ) and the other lane (obs_ol ). Source. |
Authors: Hu, Y., Nakhaei, A., Tomizuka, M., & Fujimura, K.
-
One term: "IDAS": interaction-aware decision making with adaptive strategies.
- The main goal is to generate manoeuvres which are safe but less conservative than rule-based approaches such as IDM and/or FSM.
- The idea is to learn how to negotiate with other drivers, or at least consider interactions in the decision process.
-
One idea: use multi-agent RL (MARL) to consider interactions between the multiple road entities.
- In particular, the agent receives rewards for its personal objective as well as for its contribution to the team’s "success" (multi-agent credit assignment).
-
One idea: a masking mechanism prevents the agent from exploring states that violate common sense of human judgment (c.f. RSS) and increase the learning efficiency.
- This idea of adding rule-based constraints to a RL policy has been applied in many works. Recently in Wang, J. et al. for instance where prediction is also considered.
- Here, masking is not only based on safety, but also on considers vehicle kinematics and traffic rules.
- A remaining question is where to apply the masking: either before the action selection (exposing only a subset of feasible actions), or after (penalizing the agent if it takes a forbidden action).
-
One quote (on the way to transfer to the real world):
"A motion control module will convert the discretized acceleration of the behaviour planner into continuous acceleration by applying algorithms like MPC at a higher frequency (100Hz)".
"Microscopic Traffic Simulation by Cooperative Multi-agent Deep Reinforcement Learning"
-
[
2019
] [📝] [ 🎓University of Parma
] [ 🚗VisLab
] -
[
multi-agent A3C
,off-policy learning
]
Click to expand
One figure:
Source. |
Authors: Bacchiani, G., Molinari, D., & Patander, M.
- One idea: "parallel actor-learners": to decorrelate the sequence of experiences used to update the policy network, a common approach is to sample <
s
,a
,r
,s'
> tuples from a memory buffer (experience replay).- Here, a multiple-agent setting is used instead: each agent acts in a different instance of the environment (hence diverse experiences) and sends its updates asynchronously to a central network.
- Another idea: "hybrid state representation": coupling some grid-like representation (
path to follow
,obstacles
,navigable space
) with a vector of explicit features (e.g.target speed
,distance to goal
,elapsed time ratio
).- This combination offers a generic scene representation (i.e. independent of the number of vehicles) while allowing for tuning explicit goals (
target speeds
) in the state. - Such hybrid representation seems popular, as identified at IV19).
- This combination offers a generic scene representation (i.e. independent of the number of vehicles) while allowing for tuning explicit goals (
- Other ideas:
- Stacking the
n=4
most recent views to capture the evolution of the scene (e.g. relative speeds). - Action repeat technique for temporal abstraction to stabilize the learning process (c.f. "frame skip").
- Stacking the
- One concept: "Aggressiveness tuning". Together with the
target speed
, theelapsed time ratio
(ETR
) feature is used to tune the aggressiveness of the car:-
"
ETR
Values close to1
will induce the agent to drive faster, in order to avoid the predicted negative return for running out of time. Values close to0
will tell the driver that it still has much time, and it is not a problem to yield to other vehicles."
-
"Dynamic Input for Deep Reinforcement Learning in Autonomous Driving"
-
[
feature engineering
,off-policy learning
,DQN
,SUMO
]
Click to expand
- One diagram is better than 100 words:
By summing all the dynamic terms (one per surrounding vehicle), the input keeps a constant size. Source. |
Authors: Huegle, M., Kalweit, G., Mirchevska, B., Werling, M., & Boedecker, J.
- One goal: find a flexible and permutation-invariant representation to deal with variable sized inputs (variable number of surrounding vehicles) in high-level decision making in autonomous lane changes, using model-free RL.
- Four representations are considered:
- Relational Grid (fixed-size vector)
- Occupancy grid (fixed-size grid processed by some
CNN
) - Set2Set-Q (some
RNN
combined with an attention mechanism to create a set representation which is permutation invariant w.r.t. the input elements) - DeepSet-Q (proposed approach)
- They are used as inputs to train and evaluate DQN agents.
- Four representations are considered:
- One finding:
- "
Deep Sets
were able to outperform CNN and recurrent attention approaches and demonstrated better generalization to unseen scenarios".
- "
- One quote: about the use of policy-based (as opposed to value-based), on-policy (as opposed to off-policy), model-free RL (here
PPO
).
"Due to the higher demand for training and the non-trivial application to autonomous driving tasks of
on-policy
algorithms, we switched tooff-policy
Q-learning."
"Seeking for Robustness in Reinforcement Learning : Application on Carla Simulator"
Click to expand
- Some background:
n -step TD learning. Source. |
Authors: Jaâfra, Y., Laurent, J.-L., Deruyver, A., & Naceur, M. S.
- One related work: reading this paper reminded me of one conclusion of the 2019 CARLA AD Challenge:
"Robust open-source AV stacks are not a commodity yet: No public AV stack has solved the challenge yet."
- One idea: use an actor-critic architecture with multi-step returns (
n
-stepA2C
) to "achieve a better robustness".- The introduction of a critic aims at reducing the variance of the gradient of policy-based methods.
- As illustrated in the above figure, in value-based methods, the TD-target of a critic can be computed in several ways:
- With bootsrapped, using the current estimate for the next state
s'
:1
-step TD - low variance but biased estimate ... - ... Up to considering all the steps in the trajectory until termination:
Monte Carlo
- high variance but unbiased estimate. - In between are multi-step returns critics (
MSRC
). Obviously a trade-off between bias/variance.
- With bootsrapped, using the current estimate for the next state
- Some limitations:
- The
MDP
is not very detailed, making reproduction and comparison impossible.- For instance, the
action
space allegedly contains3
discrete driving instructions [steering
,throttle
, andbrake
], but not concrete number is given. - The same applies to the
state
andreward
function: no quantitative description. - Based on their text, I can assume the authors were participating to the 2019 CARLA AD challenge. Maybe track
1
. But again, information about the town/scenario is missing.
- For instance, the
- No real comparison is performed: it should be feasible to use the built-it rule-based agent present in CARLA as a baseline.
- Why not also supplying a video of the resulting agent?
- The
- One additional work: Meta-Reinforcement Learning for Adaptive Autonomous Driving, (Jaafra, Luc, Aline, & Mohamed, 2019) [pdf] [poster].
- This idea is to use multi-task learning to improve generalization capabilities for an AD controller.
- As detailed, "Meta-learning refers to learn-to-learn approaches that aim at training a model on a set of different but linked tasks and subsequently generalize to new cases using few additional examples".
- In other words, the goal is to find an optimal initialization of parameters, to then quickly adapt to a new task through a few standard gradient descents(few-shot generalization).
- A gradient-based meta-learner inspired from Model-Agnostic Meta-Learning (
MAML
- Finn et al., 2017) is used. - RL performance in non-stationary environments and generalisation in AD are interesting topics. But no clear benefit is demonstrated, and the above limitations apply also here.
"High-level Decision Making for Safe and Reasonable Autonomous Lane Changing using Reinforcement Learning"
-
[
2018
] [📝] [ 🎓University of Freiburg
] [ 🚗BMW
] -
[
Q-Masking
,RSS
]
Click to expand
One figure:
Source. |
Authors: Mirchevska, B., Pek, C., Werling, M., Althoff, M., & Boedecker, J.
- It relates to the
RSS
andQ-masking
principles.- The learning-based algorithm (
DQN
) is combined with arule-based
checker to ensure that only safe actions are chosen at any time. - A
Safe Free Space
is introduced.- For instance, the agent must keep a safe distance from other vehicles so that it can stop without colliding.
- The learning-based algorithm (
- What if the
Safe Free Space
is empty?-
"If the action is considered safe, it is executed; if not, we take the second-best action. If that one is also unsafe, we stay in the current lane."
-
- About the
PELOPS
simulator:- It has been developed between [ 🚗
fka
(ZF
) ] and [ 🚗BMW
]. - In future works (see above), they switch to an open source simulator:
SUMO
.
- It has been developed between [ 🚗
"Safe Reinforcement Learning on Autonomous Vehicles"
Click to expand
The prediction module masks undesired actions at each time step. Source. |
Here a related patent from the authors 🔒 😉. Source. |
Authors: Isele, D., Nakhaei, A., Fujimura, K.
- Previous works about learning control with
DQN
s at diverse intersections:- "Navigating occluded intersections with autonomous vehicles using deep reinforcement learning" - (Isele et al., 2017)
- "Transferring Autonomous Driving Knowledge on Simulated and Real Intersections" - (Isele et al., 2017)
- How to make model-free RL "safe"? Two options are mentioned (both required expert knowledge):
- 1- Modifying the reward function (requires careful tuning).
- 2- Constraining exploration (e.g.
action masking
,action shielding
).-
"The methods can completely forbid undesirable states and are usually accompanied by formal guarantees".
- In addition, the learning efficiency can be increased (fewer states to explore).
-
- For additional contents on "Safe-RL in the context of autonomous vehicles", one could read this github project by @AliBaheri.
- Here: The
DQN
is augmented with some action-masking mechanism.- More precisely, a prediction model is used:
- The predicted position of the ego car is compared against the predicted position of all other traffic cars (called
forward predictions
). - If an overlap of the regions is detected, the action is marked as unsafe.
- Else, the agent is allowed to freely explore the safe state space, using traditional model-free RL techniques.
- The predicted position of the ego car is compared against the predicted position of all other traffic cars (called
- Note: this illustrates the strong relation between
prediction
andrisk assessment
.
- More precisely, a prediction model is used:
- One challenge: Ensuring "safety" not just for the next step, but for the whole trajectory.
-
"To ensure that the agent never takes an unsafe action, we must check not only that a given action will not cause the agent to transition to an unsafe state in the next time step, but also that the action will not force the agent into an unsafe state at some point in the future."
-
"Note that this is closely related to the credit assignment problem, but the risk must be assigned prior to acting".
- This made me think of tree search techniques, where a path is explored until its terminal node.
- To cope with the exponential complexity, the authors proposed some approximation to restrict the exploration space.
- One of them being the
temporal abstraction
for actions (see this video series for a quick introduction). - The idea of this so-called
option
orintentions
framework, is to distinguish betweenlow-level
andhigh-level
actions-
"This can be thought of as selecting an open-loop high-level decision followed by subsequent bounded closed-loop low-level corrections."
- For a given time horizon, the trajectories described with these options are way shorter, hence reducing the size of the state set that is to be checked.
-
- This leads to the definition of a functional local-state (I did not understand all the details) including some variance term:
-
"The variance acts as a bound that encompasses the variety of low-level actions that produce similar high-level actions. Additionally, we will use the variance to create safety bounds".
-
- One of them being the
-
- One remark: similar to the "collision checker" for path planning, I can imagine that this prediction module becomes the computational bottleneck of the framework.
"Automating Vehicles by Deep Reinforcement Learning Using Task Separation with Hill Climbing"
-
[
2017
] [📝] [ 🎓IMT Lucca
] -
[
stochastic policy search
,gradient-free RL
,policy-gradient RL
,reward shaping
]
Click to expand
Source. |
Author: Plessen, M. G.
-
One remark: to be honest, I find this publication not very easy to understand. But it raises important questions. Here are some take-aways.
-
One term:
(TSHC)
= Task Separation with Hill Climbing- Hill Climbing has nothing to do with the gym MountainCar env.
- It rather refers to as a gradient-free optimization method: the parameters are updated based on greedy local search.
- For several reasons, the author claims derivative-free optimization methods are simpler and more appropriate for his problem, compared to policy-gradient RL optimization methods such as
PPO
andDDPG
where tuned-parameters are numerous and sparse rewards are propagating very slowly.
- The idea of Task Separation concerns the main objective of the training phase: "encode many desired motion primitives (training tasks) in a neural network", hoping for generalisation when exposed to new tasks.
- It is said to serve for exploration in optimization: each task leads to a possible region with locally optimal solution, and the best solution among all identified locally optimal solutions is selected.
- Hill Climbing has nothing to do with the gym MountainCar env.
-
One concept:
sparse reward
.- Reward shaping is an important problem when formulation the decision-making problem for autonomous driving using a (PO)MDP.
- The reward signal is the main signal used for the agent to update its policy. But if it only receives positive reward when reaching the goal state (i.e. sparse reward), two issues appear:
- First, it will take random actions until, by chance, it gets some non-zero reward. Depending on how long it takes to get these non-zero rewards, it might take the agent extremely long to learn anything.
- Secondly, because nonzero rewards are seen so rarely, the sequence of actions that resulted in the reward might be very long, and it is not clear which of those actions were really useful in getting the reward. This problem is known as credit assignment in RL. (Explanations are from here).
- Two options are considered in this work:
- "Rich reward signals", where a feedback is provided at every time step (
r
becomes function oft
). - "Curriculum learning", where the learning agent is first provided with simpler examples before gradually increasing complexity.
- "Rich reward signals", where a feedback is provided at every time step (
- After trials, the author claims that no consistent improvement could be observed with these two techniques, adding that the design of both rich reward signals and "simple examples" for curriculum learning are problematic.
- He rather kept working with sparse rewards (maximal sparse rewards), but introduced some "virtual velocity constraints" to speed up the training.
-
I like the points he made concerning feature selection for the state, i.e. how to design the state
s(t)
of the MDP.- He notes that
s(t)
must always relate the current vehicle state with reference to a goal state.- In other words, one should use relative features for the description of the position and velocity, relative to their targets.
- In addition,
s(t)
should also consider the past and embed a collection of multiple past time measurements.- It seems sounds. But this would indicate that the "Markov property" in the MDP formulation does not hold.
- He notes that
"Deep Learning of Robotic Tasks without a Simulator using Strong and Weak Human Supervision"
-
[
IL initialization
,BC
,reward induction
,safe RL
,action masking
,supervision
,end-to-end
]
Click to expand
Top-left: Stage 3 learns to regress the reward function demonstrated by the supervisor. The goal being to drive on the second lane only. This dense signals are used during Stage 5 for RL . Note that defining a reward signal with e.g. distances from the roadsides or angle with respect to the road is impossible without explicit image processing, since no access to the internal state of a simulator is possible. Right and bottom: In Stage 4 a risk function is regressed. It is used to limit the exploration of the RL agent to reduce the number of accidents mostly in the initial stage. This accelerate its training. Source. |
Initializing the RL Q- net with some layers of a learnt BC policy net significantly accelerates the RL training. It extends the imitation strategy to previously unseen areas of the state space, without loosing time exploring unrelevant parts. In addition, a learnt risk function is used to mask dangerous actions , which also speed up the training. Finally, a reward function is learnt from demonstrations. Bottom-left: Impact of the reward signal's quality on the agent’s overall performance during the RL stage: the reward model R-0.2-lane was trained with only 20% of the dataset. It struggles to correctly generalize the human instructor's knowledge to unseen states . Source. |
The task is to drive in a designated lane only (the second lane from the right). The acceleration is controlled by an external module with a 50km/h target. Source. |
Authors: Hilleli, B., & El-Yaniv, R.
-
Main motivations:
1-
end-to-end
: perform a complex task, using only raw visual signals.- No image processing or other hand-crafted features.
2-
Train robotic agents:- ... in the absence of realistic simulator for real-world environments.
- ... but where some (human) instructor can offer some king of "weak" supervision.
-
"The proposed method enables in principle the creation of an agent capable of performing tasks that the instructor cannot perform herself."
3-
AccelerateRL
learning and reach better performance.- Done by considering
safety
during training. -
"The integration of the safety module into the
RL
allows us to improve the agent’s exploration of thestate
space and speed up the learning, thus avoiding the pitfalls of random exploration policies such as e-greedy, where the agent wastes a lot of time on unimportant regions of thestate
space that the optimal policy would never have encountered." - Note: No safety guarantee is provided for deployment.
- Done by considering
4-
Assume no access to the internal state of the simulator.- One can only access raw images.
- It is impossible to query other information such as exact
position
/speed
to compute somereward
function for instance.
-
Main idea:
- Deploy
RL
after aIL
stage. - It compensates for the noisy demonstrations and extends the imitation strategy to previously unseen areas of the
state
space. -
"Initializing the
Q
-network’s parameters with those learned in theIL
stage increased the learning rate at the beginning of theRL
, and significantly reduced the number of critical driving mistakes made by the agent (compared to randomly initializing theQ
-network)."
- Deploy
-
About
RL
/BC
:RL
alone is limited.- Constructing an accurate simulator is extremely difficult.
-
[
reward
shaping is hard] "The handcrafting of areward
function can be a complicated task, requiring experience and a fair amount of specific domain knowledge."
IL
alone is limited.- Mainly by the quality and diversity of the observed demonstration.
-
About the task:
- Simulator: Assetto Corsa
- Task: steer the race car.
- The
acceleration
is controlled by an external module that ensures50km/h
. - The agent must drive in a the second lane only.
- The
- Metrics: accumulated average
reward
.-
"Without a natural performance evaluation measure for driving skills and without access to the internal state of the game, our performance evaluation procedure was based on the accumulated average
reward
achieved by the agent." - It reminds me Goodhart's law:
-
"When a measure becomes a target, it ceases to be a good measure."
-
- No safety metrics?
- At least counting the off-lane situations would have been interesting.
-
state
= a stack ofRGB
front view images.-
"In all our experiments, each input instance consisted of two sequential images with a
0.5
-second gap between them." -
"Pixels corresponding to the speed indicator were set to zero in order to guarantee that the agent doesn’t use the speed information during the learning process."
-
action
- steering commands of
left
/right
using keyboard keys.
- steering commands of
dt
-
"The agent must always act under
70 ms
latency (from the time the image is captured to the time the corresponding steering key is pressed). During this gap, the previous key continues to be pressed until a new one is received. This added difficulty can be avoided when using a simulator."
-
-
How do humans learn to drive a car or a bike?
1-
Observe driving scenes.- Try to extract relevant information required to successfully perform the task.
-
"The unsupervised learning phase starts long before her formal driving lessons, and typically includes great many hours where the student observes driving scenes while sitting as a passive passenger in a car driven by her parents during her childhood."
2-
Observe (and can copy) how other drivers drives.3-
Improve under supervision and learn to evaluate which situation is good or bad.-
"While the student is performing the task, the instructor provides real-time feedback and the student improves performance by both optimizing a
policy
as well as learning thefeedback
function itself."
-
4-
Improve without supervision.-
"The student continues to teach herself (without the instructor), using both the
reward
function previously induced by the instructor and futurereward
signals from the environment." - Probably the horn of surroundings cars can serve as additional
reward
signals.
-
-
Inspired from this, the presented framework consists of five elements:
-
Phase
1
: unsupervised feature learning.- Goal: learn an informative low-dimensional representations of the raw image.
- External (high level) features could be added to accelerate the learning.
- For instance road boundaries are relevant to the driving task and can be easily extracted from the images.
-
"Such engineered features that are based on domain expertise should definitely be used whenever they exist to expedite learning and/or improve the final performance."
-
Phase
2
supervisedIL
:BC
.- Leverage: the low-dimensional representation
F0(s)
learnt at phase1
. - Human work:
70,000
samples, equivalent to2
hours of human driving, are collected. 1-
The first goal is to generate an initial agent policyπ0
capable of operating in the environment without too much risk.2-
The second goal is to improve the low-dimensional feature representation (learned in the previous unsupervised stage) and generate a revised representationF1(s)
, which is more relevant/informative for the task.
- Leverage: the low-dimensional representation
-
Phase
3
supervisedreward
induction.- Leverage: low-dim representation learnt from previous states.
-
"The
reward
function is not learned from the raw states visited by the learner, but from some assumed to be known, low-dimensional feature representation of them."
-
- Human work:
- Drive the car while intentionally alternating between:
- edge conditions (i.e., driving on, or even outside road boundaries)
- safe driving.
- Label raw image pixels with {
bad
,good
}.- It will correspond to {
-1
,+1
}reward
signals.
- It will correspond to {
- Drive the car while intentionally alternating between:
- Task:
- A net
Sθ
maps a gamestate
into a real number in the range [−1
,+1
]. -
"Corresponding to the amount of
risk
in thatstate
, where the value+1
is assigned to the safest possiblestate
and the value−1
is assigned to the riskiest possiblestate
."
- A net
- Motivation:
- Train subsequent
RL
faster. -
"Since we don’t have access to the internal state of a simulator, defining a
reward
signal using thestate
parameters of the car (distances from the roadsides, angle with respect to the road, etc.) is impossible without explicit image processing." - These dense
reward
signals make aRL
problem easier to learn.
- Train subsequent
- Idea:
-
"It receives an image and outputs a number in the range [
−1
,1
] that indicates the instantaneous reward of being in thatstate
. We call this method "reward
induction" since thereward
function is induced (in aSL
manner) from a finite set of labeled examples obtained from a human instructor. This method is related to a technique known asreward
shaping, where an additionalreward
signal is used to guide the learning agent."
-
- Note: the supervisor does not indicates what
action
should be done.- Rather she estimates whether each
state
is desirable or not, which should be easier. - Therefore the agent should be able to realize a task that the supervisor can't achieve.
- Rather she estimates whether each
- Leverage: low-dim representation learnt from previous states.
-
Phase
4
supervised safety module construction.safety module
=safety function
+safe policy
- Human work:
- Demonstrate
safe
/unsafe
actions. - Label each
state
as {safe
,unsafe
}.
- Demonstrate
- Motivation: learn subsequent
RL
faster, by reducing the number of accidents mostly in the initial stage. - Task:
- Learn a "safe"
policy
viaBC
. Already done in stage2
. - Learn a safety network to classify the
state
space into two classes: {safe
,unsafe
}. Same architecture as thereward
net.
- Learn a "safe"
- How is the
safety module
used?- Two main approaches for integrating the safety concept into the
RL
algorithms:1-
Transforming the optimization criteria to include a notion ofrisk
(e.g.,variance of return
,worst-outcome
).2-
Modifying the agent’s exploration process while utilizing prior knowledge of the task to avoid riskystates
.
- Here the second case. A sort of
action masking
:-
"During the
RL
training, if the safety network labels the currentstate
asunsafe
, the control is taken from the agent and given to a pre-trainedsafe
policy until the agent is out of danger."
-
- Two main approaches for integrating the safety concept into the
-
Phase
5
RL
.- Leverage: learnt
BC
policy, learntreward
function, learntsafety module
. - Motivation:
- Improve that
BC
policy that may suffer from distributional shift, i.e. it does not how to react tostates
it has never seen.
- Improve that
-
"We also note that stages
3
and5
can be iteratively repeated several times to improve final performance."
- Leverage: learnt
-
About
BC
initialization: How to use the initialized policy netPθ
?- On the one hand, this
IL
net contains informative state representation. - On the other, the
Q
-netQθ
net should outputaction
-values. - Architecture and initialization:
-
"We trained a
Q
-network, denoted byQθ
, that has the same architecture asPθ
except for the final soft-max nonlinearity, which was removed. TheQ
-network’s parameters were initialized from those of the policy network except for the final layer’s parameters, which were initialized from a uniform distribution [−0.001
,+0.001
].
-
- Two stages:
1-
Policy evaluation.Pθ
is driving.- Only the last two fully-connected layers of
Qθ
are updated. -
"This step can be viewed as learning a critic (and more precisely, only the last two layers of the critic’s network) to estimate an
action-value
function while fixing the acting policy."
2-
RL
is started usingQθ
.- All of its parameters can be updated.
- On the one hand, this
-
Limitations:
- Multiple manual demonstration and labelling tasks are required:
- Demonstrate good behaviours.
- Demonstrate dangerous and recovery manoeuvres.
- Classify
safe
/unsafe
situations.
- No surrounding cars.
- No
throttle
/brake
control.
- Multiple manual demonstration and labelling tasks are required:
"A Comprehensive Survey on Safe Reinforcement Learning"
-
[
2015
] [📝] [ 🎓University Carlos III of Madrid
] -
[
optimization criteria
,safe exploration/exploitation
,risk sensitivity
,teacher advice
,model uncertainty
]
Click to expand
Note: S/a means that the method is applied to domains with continuous or large state space and discrete and small action space. Source. |
Source. |
Authors: Garcia, J., & Fernandez, F.
-
Personal notes:
- While reading, try to classify these two commonly-used approaches in
AD
:- Add a high penalty for unsafe
states
. - Apply rule-based logics to mask unsafe
actions
.
- Add a high penalty for unsafe
- When trying to design a
RL
agent, ask the questions:- What
objective
should be optimized? - How to quantify / detect
risk
?
- What
- While reading, try to classify these two commonly-used approaches in
-
Remaining open questions for me:
- Are there safety guarantees?
- How to avoid overly conservative behaviours?
- What about
POMDP
s? - How to deal with large continuous
states
such as images? - Can "teach advice" also be used during deployment? E.g.
action-masking
. - How "solvers" differ from those used for unconstrained problems?
-
Definition:
-
"
Safe RL
can be defined as the process of learning policies that maximize the expectation of the return in problems in which it is important to ensure reasonable system performance and/or respectsafety constraints
during thelearning
and/ordeployment
processes." - For
AD
,RL
agents are probably trained in a simulator. Therefore:training
: Crashes can happen.deployment
: No crash should happen!-
"If the goal is to obtain a
safe
policy
at the end, without worrying about the number of dangerous or undesirable situations that occur during thelearning
process, then theexploratory
strategy used to learn thispolicy
is not so relevant from asafety
point of view."
-
-
-
Although most approaches are model-free
RL
, ideas and challenges for model-basedRL
are also mentioned.-
"Having such a [learnt] model is a useful entity for
Safe RL
: it allows the agent to predict the consequences ofactions
before they are taken, allowing the agent to generate virtual experience." [but how to make sure the learnt model is correct?] - How to safely explore the relevant parts of the
state
/action
spaces to build up a sufficiently accurate dynamics model from which derive a good policy?- For instance by learning the
dynamics
model from teachersafe
demonstrations. - But what when the agent sees a
state
it as never encountered? I.e.covariate shift
.
- For instance by learning the
-
-
Two big classes:
1-
Modify the optimality criterion with somesafety
term- For instance by including the probability of visiting error
states
. -
[Issue] "The transformation of the optimization criterion produces a distortion in the
action values
and the true long term utility of theactions
are lost." -
[Issue] "It is difficult to find an optimization objective which correctly models our intuition of
risk
awareness. InSection 3.1
, most of the approaches are updated based on a conservative criterion and the resulting policy tends to be overly pessimistic."
- For instance by including the probability of visiting error
2-
Modify the exploration process to avoid risk situations through:- ... Either the incorporation of external knowledge,
- ... or the guidance of a risk metric.
- This is especially important when
safety
is required duringtraining
.-
"The use of random exploration would require an undesirable
state
to be visited before it can be labelled as undesirable."
-
- Rule-based approaches can be used in parallel. E.g.
action-masking
.
- Interesting idea: combine both families!
-
About the concept of
risk
:- A
policy
optimal for the vanillaMDP
, i.e. maximizing theexpected return
, is said "risk
-neutral". - The authors associate
risk
to the uncertainty of the environment, i.e. its stochastic nature.-
"A robust
MDP
deals with uncertainties in parameters; that is, some of the parameters, namely,transition
probabilities, of theMDP
are not known exactly." - E.g. based on the variance of the return or its worst possible outcome.
- E.g. based on the level of knowledge of a particular state, or on the difference between the highest and lowest
Q-values
. - For me, they more express the confidence of an agent.
- But an agent can be wrong, being 100% sure that a bad action is good.
- I would have rather expected the
risk
to be related to the probability for something bad to happen in the future ifa
is taken ins
.- This can for instance rely on some notion of
reachibility
and some partition ofstate
space into: - A
safe
sub-space. - A
unsafe
sub-space.
- This can for instance rely on some notion of
-
- Distinction for
risk
detection:1-
Immediate risk: if based of the currentstate
, it may be too late to react.2-
Long-term risk: some risk function,ρπ
(s
), ensures the selection ofsafe
actions
, preventing long-term risk situations once therisk
function is correctly approximated.
- How to estimate
risk
when starting from scratch?-
"Teacher advice techniques can be used to incorporate prior knowledge, thus mitigating the effects of immediate
risk
situations until therisk function
is correctly approximated."
-
- Not clear to me, is how the agent can learn this
risk function
.-
"The mechanism for
risk
detection should be automatic and not based on the intuition of a human teacher."
-
- A
-
Class
1
: what for extended optimization criteria?-
1-
Maximize the worst-casereturn
.- Here, the goal is to deal with
uncertainties
:- ... Either the inherent uncertainty related to the stochastic nature of the system,
- ... And/or the parameter uncertainty related to some of the parameters of the
MDP
are not known exactly (c.f. model-basedRL
).
- It can lead to over-conservative behaviours.
- Here, the goal is to deal with
-
2-
Maximize an objective function that includes arisk
termω
.- For instance the linear combination
Eπ(R) − β*ω
, or the exponential utility function:log[Eπ(exp[βR])] / β
.- Where
ω
could also reflect the probability of entering into an errorstate
.
- Where
- The "risk sensitivity" scalar parameter
β
allows the desired level ofrisk
to be controlled.β=0
impliesrisk
neutrality.
- The
risk
ω
is often associated with thevariance
of thereturn
.-
"Higher
variance
implies more instability and, hence, morerisk
." - But there are several limitations when using the
return variance
as a measure ofrisk
.
-
- But for
AD
it does not make much sense to describerisk
by thevariance
of thereturn
. Think of the tails of the distribution.
- For instance the linear combination
-
3-
Maximize theexpected return
, under some constraints.- It puts constraints on the
policy
space. -
"We want to maximize the expectation of the
return
while keeping other types ofexpected utilities
lower than some given bounds." - Soft or hard constraints?
-
"The previous constraint is a hard constraint that cannot be violated, but other approaches allow a certain admissible
chance
of constraint violation. Thischance
-constraint metric,P(E(R) ≥ α) ≥ (1 − ε)
)." [Here theconstraint
is on thereturn
, but could be on theprobability of collision
] - Leading to
chance
-constrainedMDP
s.
-
- It puts constraints on the
-
-
Class
2.1
: How theexploration
process can be modified by including prior knowledge of the task?-
1-
Provide initial knowledge, used to bootstrap the learning algorithm.- To obtain for instance an initial
value function
. - Or perform transfer learning, i.e. using a
policy
trained on a similar task. -
"The learning algorithm is exposed to the most relevant regions of the
state
andaction
spaces from the earliest steps of the learning process, thereby eliminating the time needed in randomexploration
for the discovery of these regions." -
[limitation] "The
exploration
process following the initial training phase can result in visiting newstates
for which the agent has no information on how to act."
- To obtain for instance an initial
-
2-
Derive a policy from a finite set of demonstrations.-
"All (
state
-action
) trajectories seen so far are used to learn a model from the system’sdynamics
. For this model, a (near-)optimalpolicy
is to be found using anyRL
algorithm." - It relates to
- off-line
RL
- model-based
RL
. - Learning from Demonstration (
LfD
), with inverseRL
and behavioural cloning.
- off-line
- How the agent should act when it encounters a
state
for which no demonstration exists?- See "teacher advice" techniques below. Basically it needs some life-long mentoring.
-
-
3-
Provide teach advice, e.g.,safe
actions
.-
[Motivation] "However, while furnishing the agent with initial knowledge helps to mitigate the problems associated with random exploration, this initialization alone is not sufficient to prevent the undesirable situations that arise in the subsequent explorations undertaken to improve learner ability. An additional mechanism is necessary to guide this subsequent exploration process in such a way that the agent may be kept far away from catastrophic
states
." -
It generally includes four main steps:
(i)
Request or receive the "advise".- Either the agent can explicitly ask for help. As a "joker" / "ask for help" approach. Using some confidence parameter to detect risky situations, e.g. when all
actions
have similarQ-values
. - Or the teacher can interrupt the agent learning execution at any point, e.g. to prevent catastrophic situations. Close to
action masking
.
- Either the agent can explicitly ask for help. As a "joker" / "ask for help" approach. Using some confidence parameter to detect risky situations, e.g. when all
(ii)
Convert advice into a usable form.- The teacher does not necessary use the same input as the agent.
- For instance for simple
action masking
, knowing about the distance to the leading vehicle can be enough to imposebraking
when driving too close. - In (Bouton et al., 2019) a probabilistic model checker is derived prior to learning a
policy
.- It computes the probability of reaching the goal safely for each (
state
,action
) pair. - And is subsequently queried to mask actions.
- Issue: require discrete states and full knowledge of the
dynamics
.
- It computes the probability of reaching the goal safely for each (
(iii)
Integrate the reformulated advice into the agent’s knowledge base.(iv)
Judge the value of advice.-
"The teacher is giving advice on
policies
rather thanQ-values
, which is a more natural way for humans to present advice (e.g., when the nearest taker is at least8
meters, holding the ball is preferred to passing it)."
-
-
It assumes the availability of a teacher for the learning agent, which is not necessary an expert in the task.
- Following the advice of the teacher can lead the learner to promising parts of the
state
space, which also helps for safe exploration.
- Following the advice of the teacher can lead the learner to promising parts of the
-
It does not modify the
objective
.-
"The
Safe RL
algorithm should incorporate a mechanism to detect and react to immediaterisk
by manifesting different risk attitudes, while leaving the optimization criterion untouched."
-
-
-
-
Class
2.2
: How to do risk-directed exploration?- For instance the agent can be encouraged to seek controllable regions of the environment.