"Imitation Learning for Autonomous Highway Merging with Safety Guarantees"
Click to expand
The learnt BC policy is implemented, except if the ego-car drives too close to another car (deemed as unsafe situation). In this case a rule-based 'safe' policy takes over. Source. |
Dark blue = learnt cars. Light blue = cars from the dataset. Note there is no cooperation because of open-loop testing: the other traffic participants do no react to ego actions . Source. |
Author: D’Alessandro, E.
-
A Master Thesis.
-
Motivations:
1-
Ensure somesafety
constraints while imitating an expert.- Imitating a safe expert cannot be enough.
- Safety must be enforced as an external factor.
2-
Master thesis - Explain concepts:RL
,BC
, max-marginIRL
, max-entropyIRL
,GAIL
.state-action
visitation distribution (occupancy measure
)µ
(π
,d0
).- Bellman flow constraints.
- Linear Programming.
-
"It is also possible to formulate a
MDP
problem as a linear programming problem [primal
] that can be translated in itsdual
formulation."
-
- Linear Programming Apprenticeship Learning (
LPAL
)-
The key idea is to find, among all the feasible occupancy measures in the feasibility set
K
, the one whosepolicy
has the relativevalue
function that is the closest as possible to the expert’svalue
function."
-
-
Main idea:
- A rule-based
policy
takes over at test time if the currentstate
is deemed "unsafe". - It prevents the visitation of
unsafe
states
and limits the risk of cascading errors."When an
unsafe
situation arises and the cascading error issue is more likely to lead the agent to catastrophic scenarios, thoseactions
are modified so that to assist the agent in its recovery towardssafe
states
."
- A rule-based
-
MDP
: whichhorizon
for aramp merging
?- [
no
] Infinite horizon: it never ends, e.g. controlling the traffic at the intersection.- Merging ends (
success
/fail
) at some point.
- Merging ends (
- [
no
] Finite horizon: the time horizonT
is finite, fixed and predetermined.- Merging can be quickly done or can take longer.
- The terminal time is unknown before running the
policy
.
- [
yes
] Episodic: the time window cannot be predefined, but it only extends over a finite number of steps until some terminalstate
is reached.
- [
-
Why
BC
?- Advantages:
1-
Here, thetransition
model of the environment is unknown.-
"... unlike other methods such as
distribution matching
, (LPAL
) and every form ofIRL
that involves solving a forwardRL
as inner loop."
-
2-
Extremely simple and efficient since it mainly consists in applying supervised learning to a set of expert’s trajectories.3-
It completely skips thereward
function recovery step, that carries the ambiguity problem.
- Drawbacks:
1-
A large and exhaustively set of demonstrations is required.2-
Cascading errors that may lead the agent to astate
never encountered by the expert, i.e. no demonstration is available to recover.
- Advantages:
-
Env:
- Based on
NGSIM
dataset. state
: ?- From the code, the net expects a vector of size
10
. Describing relativeposition
andspeed
to the leader and to the yielder.
- From the code, the net expects a vector of size
action
: A pair of (d_to_left
,d_forward
) of displacements.- Open-loop:
position
,speed
andacceleration
of other cars are kept unchanged.- I.e. they consider the original real car that is now replaced by the ego-agent.
- Therefore no interaction or reaction of the other vehicles.
-
"To compensate for the lack of cooperation, the
NGSIM
dataset could be used only for training, while implementing a lane following algorithm to control the surrounding cars. It would generate an adjustable traffic situation, where all the vehicles react to the presence of the merging cars."
dt
- What is the duration between two decisions?
- Based on
-
About the "safe"
policy
:- It takes over when the distance to a surrounding car is below a certain threshold.
- Depending where this closest car is, relative to ego:
right
: impossible since ego is on the merging ramp.left
:- If the gap is very small: the vehicle is driven away from the left lane.
- Otherwise: the manoeuvre is not aborted but paused preventing the car from further moving laterally.
in front
: slow down.behind
: speed up.
-
Results:
35
disengagements for280
merging manoeuvres: activation rate of12.5%
.- No accident, but over-conservative:
-
"
3
of the280
cases result in afailed
merging, meaning that the car does not manage to merge before the off-ramp."
-
"PILOT: Planning by Imitation Learning and Optimisation"
-
[
2020
] [📝] [ 🎓University of Oxford
,University of Edinburgh
] [ 🚗FiveAI
] -
[
optimisation-based planner
,coords transform
,DAgger
,runtime
]
Click to expand
Main idea: speed-up a two-stage motion planner by replacing the first expensive optimisation by a net trained to imitate it. Therefore the name PILOT = Planning by Imitation Learning and Optimisation . Note the use of transform and inverse transform to solve the problem in the reference path -based coordinate frame. Source. |
Authors: Pulver, H., Eiras, F., Carozza, L., Hawasly, M., Albrecht, S., & Ramamoorthy, S.
-
Previous non-learning-based work:
-
An Optimization-based Motion Planner for Safe Autonomous Driving
by (Eiras et al., 2020) 🎞️. -
Motion planning problem solved in two stages:
1-
Solve a linearised version.- Mixed-Integer Linear Programming (
MILP
) approximation, solved withGurobi
. -
"Considering a linear dynamical system
xk
=F∆t
(xk
,uk
) under a non-holonomic vehicle model and mixed-integer collision avoidance constraints." -
"Instead of the quadratic terms, the
cost
function is approximated by the| · |
operator."
- Mixed-Integer Linear Programming (
2-
Use this first solution to initialized a second non-linear optimizer.- Nonlinear Programming (
NLP
) formulation solved withIPOPT
. -
"This second optimisation stage ensures that the output trajectory is smooth and feasible, while maintaining safety guarantees."
- Nonlinear Programming (
-
"We show that initializing the
NLP
stage with theMILP
solution leads to better convergence, lower costs, and outperforms a state-of-the-art Nonlinear Model Predictive Control (NMPC
) baseline in both progress and comfort metrics."
-
-
Motivations:
-
1-
TheMILP
initialization is time-consuming.- The idea is to perform imitation learning offline to imitate this expensive-to-run optimisation-based planner.
-
"
PILOT
shows a clear advantage in runtime efficiency (saving of∼78%
and84%
respectively) with no significant deterioration in solution quality (drop of∼3%
and4%
respectively)."
-
2-
Still guarantee the satisfaction of requirements of safety and comfort.- I.e. keep the
NLP
as a second stage. -
"A constrained optimisation stage ensures, online, that the output is feasible and conforms to various constraints related to road rules, safety and passenger comfort."
- What if the proposal of the "cloner" is bad?
-
"Most of them suggest
end-to-end
learning approaches that do not include an optimisation stage at runtime which has the potential to create suboptimal solutions if the network yields a poor result."
- I.e. keep the
-
Both stages share the same objective.
-
"To guarantee the safety of the output of the system we employ an efficient optimisation step that uses the same objective function as the expert but benefits form the network output as a warm-start."
-
-
-
About
safety
: The authors mention the diversity of possible definitions:1-
Maintaining the autonomous system inside a safe subset of possible futurestates
.2-
Preventing the system from breaking domain-specific constraints.3-
Exhibiting behaviours that "match" the safe behaviour of an expert.
-
About
behavioural cloning
:- In-the-loop Dataset Aggregation (
DAgger
). - The expert planner can be the expensive
MILP
or theCARLA
autopilot
can be queried (no human expert). - The expert is not better-informed.
- In-the-loop Dataset Aggregation (
-
input
1-
BEV
including the ego vehicle, other road users and the relevant features of the static layout.-
"Similar to the input representation in previous works, e.g.
ChauffeurNet
, with the exception that in our case the static layout information is repeated on all channels."
-
2-
route
plan as areference path
.- In case of multiple lane, e.g. highways, what is the reference? The rightmost lane?
3-
Predicted traces for all road users, provided by a prediction module.- The world
state
, together withpredictions
, are projected into areference path
-based coordinate frame. -
"[Hybrid aggregation] Additional information of the planning problem that is not visualised in the top-down images (such as the
initial speed
of the ego vehicle) is appended as scalar inputs, along with the flattenedCNN
output, to the first dense layer of the network."
-
output
- A trajectory
ρ
in thereference path
coordinate frame.- It is then transformed back to the global coordinate frame by the inverse transform.
-
[
smoothing
] "To enforce smoothness in the output, an alternative is to train the network to produce parameters of a smooth function family over time which can then be sampled to obtainρ
."
- What about the time duration between two positions?
- How do they get the expert’s time-stamped positions from
CARLA
? The autopilot only produces one-stepaction
, noWPs
sequences.
- A trajectory
"Learning Scalable Self-Driving Policies for Generic Traffic Scenarios"
-
[
BEV
,VAE
,GAT
,close-loop evaluation
,generalization
,CARLA
]
Click to expand
The core idea is to encode the BEV with VAE . And then incorporate this latent vector with state vectors (e.g., locations and speeds) into GNNs to model the complex interactions among road agents. The resulting state is finally fed to a policy net that clones demonstrations and produce high-level commands (target speed vT and course angle θT ) implemented by a PID controller. Source. |
Source. |
Authors: Cai, P., Wang, H., Sun, Y., & Liu, M.
-
Motivations:
1-
Derive apolicy
that can generalize to multiple scenarios.- As opposed to most works that address one particular scenario such as
T-intersections
.- A state machine could aggregate individual the scattered cases.
-
"Current learning models have not been well designed for scalable self-driving in a uniform setup."
-
"We propose a graph-based deep network to achieve unified and scalable self-driving in diverse dynamic environments."
- As opposed to most works that address one particular scenario such as
2-
Obey traffic rules such as speed limits and traffic lights.- As opposed to works that neglect the environmental structures.
-
"... which is not that important for indoor robot navigation in restricted areas, but is non-negligible for outdoor self-driving problems where traffic rules should be strictly obeyed."
-
- As opposed to works that neglect the environmental structures.
3-
Open-loop evaluation ofBC
is not always a good indicator.-
"As reported in
ChauffeurNet
andExploring the limitations of behavior cloning for autonomous driving
, the offline evaluation of driving models may not reflect the real driving performance." - For instance, a
MLP
-based model has smaller or similar prediction errors thanGNN
-based methods, but its closed-loop performance is much worse. -
"We conduct
19,200
episodes to thoroughly evaluate8
driving models (over7,500
km)."
-
-
Key idea:
mid-to-mid
behavioural cloning.-
The driving scene is encoded as a semantic
BEV
, as inChauffeurNet
. -
Benefits of a
BEV
:1-
Can cope with arbitrary number of objects.2-
Can include environmental structures, such as drivable area and lane markings.-
"Very helpful because it provides valuable structural priors on the motion of surrounding road agents. For example, vehicles normally drive on lanes rather than on sidewalks. Another benefit is that vehicles need to drive according to traffic rules, such as not crossing solid lane markings."
-
3-
sim-2-real
: there is no domain difference between the simulation and real world, thus the policy transfer problem can be alleviated.
-
Output:
-
"we adopt a mid-level output representation indicating the target
speed
vT
and course angleθT
rather than direct vehicle control commands (e.g.,steering
andthrottle
)." 1-
First, thepolicy
regresses two scalars:κv
in [0
,1
]κc
in [−1
,1
]
2-
Then the target control values can be computed:vT
=vlim
×κv
θT
=90◦
×κc
-
-
-
Should the
BEV
be used directly as an input for the "cloner" net?- No, it is too big!
-
"Such high-dimensionality makes it not only hard to learn good policies in data scarce tasks, but also suffer from over-fitting problems. Therefore, a lower-dimensional embedding for the multi-channel
BEV
input is needed for us to train a high-performance driving policy."
-
- Rather some compressed version of it.
- A variational autoencoder (
VAE
) is trained to: 1-
Reconstruct the inputBEV
.2-
Make sure that the latent space is regular enough with a standard normal distributionN
(0
,I
).
- A variational autoencoder (
- No, it is too big!
-
Input of the cloning
policy
. A fusion of:1-
Processed (MLP
)ego-route
.-
"The first waypoint is the closest waypoint in
Gf
to the current vehicle location, and the distance of every two adjacent points is0.4
m." -
"Note the redundancy of route information in
G
andBEV
is meaningful, where thestate
vector here provides more specific waypoint locations for the vehicle to follow, while theroute
mask ofBEV
can indicate if there are any obstacles on conflicting lanes of planned routes, as well as encode traffic light information."
-
2-
Processed (MLP
) ego-motion vectorm
:- Current control command (
steering
,throttle
andbrake
). speed limit
.- Difference between this
speed limit
and the currentspeed
. - Cross track error and
heading angle
error. - Two binary indicators that indicate whether the neighbouring line is crossable (a
broken
line) or not (solid
line).
- Current control command (
3-
Processed (MLP
) graph (twoGAT
layers) fed with:- The context embedding, i.e. the
BEV
encoded by theVAE
. - Processed (
MLP
)node
information for each vehicle:location
, distance to the ego-vehicle,yaw
angle,velocity
,acceleration
andsize
.
-
"We are interested in the result
ho1
of the firstnode
, which represents the influence on the ego-vehicle."
- The context embedding, i.e. the
-
BC
task:- Dataset:
260
episodes =7.6
hours =150
km. L1
loss in terms ofvT
andθT
.- Weighting between the two?
- Dataset:
-
About graph convolutional networks (
GCN
s):-
"They generalize the
2D
convolution on grids to graph-structured data. When training aGCN
, a fixed adjacency matrix is commonly adopted to aggregate feature information of neighboringnodes
." -
"Graph attention network (
GAT
) is aGCN
variant which aggregates node information with weights learned in a self-attention mechanism. Such adaptiveness ofGAT
makes it more effective thanGCN
in graph representation learning." GAT
is used here to model the interaction among road agents during driving, which is composed of multiple graph layers.
-
-
To sum up: ingredients of
DiGNet
("driving in graphs"):1-
Semantic bird’s-eye view (BEV
) images to model road topologies and traffic rules.7
channels of80
×140
:- HD map:
drivable area
and lane markings (solid
andbroken
lines). ego-route
.- Objects:
ego-vehicle
, othervehicles
andpedestrians
. - No
traffic light
?
- HD map:
2-
Variational auto-encoder (VAE
) to extract effective and interpretable environmental features.3-
Graph attention networks (GAT
) to model the complex interactions among traffic agents.
"Driving Through Ghosts: Behavioral Cloning with False Positives"
-
[
uncertainty
,false positive
,high recall
,noise injection
]
Click to expand
The observation is represented in a birds-eye-view grid, where detections are coded as pairs: each dimension represents an object or feature type together with the respective estimated confidences. To build the grid, ghosts are added based on a sampled confidence value. Source. |
A threshold on the confidence scores can be hand-crafted. This leads to conservative or inconsistent behaviours. On the other hand, the Soft BEV agent remains invariant to the noise in the input features: it is able to ignore ghosts but stop before real vehicles. Without the need for tuning. Source. |
Authors: Bühler, A., Gaidon, A., Cramariuc, A., Ambrus, R., Rosman, G., & Burgard, W.
-
Motivations:
1-
Reduce conservative behaviour in uncertain situations, while simultaneously acting safely.- In particular, to better deal with
false positives
without the need for thresholds.
- In particular, to better deal with
2-
Focus on uncertaininputs
, i.e. imperfect perception, as opposed to dynamics or intention uncertainties for instance.-
"We focus on critical perception mistakes in the form of detection
false positives
, as this is typically the main failure mode of safe systems biased towards negligiblefalse negative
rates (asfalse negatives
are irrecoverable inmediated perception
approaches)."
-
3-
Aim at robustness.- In particular cope with the fact that the
noise distribution
duringtraining
may differ from thetest
one. -
"In the case of imitation learning, perceptual errors at training time can lead to learning difficulties, as expert demonstrations might be inconsistent with the world state perceived by the vehicle."
-
"We therefore show that our
Soft BEV
model is able to maintain the same level of safety despite not knowing the true underlying noise distribution and even when this distribution diverges from the training distribution, which is the more realistic scenario."
- In particular cope with the fact that the
-
Behavioural cloning.
- Challenge: it is operating over a different input space.
- The expert has access to the ground truth while the robot might suffer from false positive perceptions.
-
"A key challenge lies in potential inconsistencies between observations
o
and the true states
, for instance in the presence of false positives ino
."
-
How to represent/encode the perceptual uncertainty?
- Probabilistic birds-eye-view (
Soft BEV
) grid. - Why
soft
?- Because
observation
are pairs: the object presence is not binary, but have aconfidence value
in [0
;1
]. -
"We model observations
o
= (ˆs
,c
) as pairs of estimated perceptual statesˆs
and black-boxconfidence values
c
in [0
;1
] for eachstate
variable."
- Because
- About the training noise.
-
"We do not make explicit assumptions about the distribution of
ˆs
w.r.t. the true states
. Instead, we assume the perception system is tuned for high recall, i.e., that all critical state variables are (noisily) captured in the estimated state. This comes at the potential expense of false positives, but corresponds to the practical setup where safety requirements are generally designed to avoid partial observability issues, as false negatives are hard to recover from. -
"We use two truncated
Normal
distributions (values between0
and1
) to simulate the confidence score generated by a generic perception stack, and approximate the perception system’s confidence estimate on its failure modes." -
"After sampling true positive confidence values for the actual objects, we sample ghost-objects in the vicinity of the ego agent with a probability of
10%
and sample a confidence value for this false positive object as well."
-
- Probabilistic birds-eye-view (
-
Two agents:
1-
BEV
Agent is trained on ground truth noise-free data, meaning it will have no notion of uncertainty.2-
Soft BEV
Agent is trained on noisy data and its input is theSoft BEV
grid.
-
Test setup.
- NoCrash Carla Benchmark. Unfortunately for an old version:
carla 0.8.4
. A non-official version for0.9.6
is available. - Metrics include the
time to completion
. - How to assess the
average speed
of agent in scenarios where traffic lights or jams require the agent to stop?-
"To reduce the influence of traffic lights, we only consider
velocities
above1m/s
when computing theaverage velocity
. The reason for this is that traffic lights act as a binary gate and essentially either increase variance significantly, or eliminate the influence of speed difference along the routes. Similarly, we only consideraccelerations
with an absolute value above zero."
-
- About the testing noise:
"To model temporal consistency of the false positives we employ a survival model. In other words a false positive is created with probability
pFP
, and it will be present in the consecutive frames with a decaying probability. In our experiments, we set the decay rate to0.8
, thus allowing false positives to typically survive2-3
frames."
- NoCrash Carla Benchmark. Unfortunately for an old version:
"Action-based Representation Learning for Autonomous Driving"
-
[
representation learning
,affordances
,self-supervised
,pre-training
]
Click to expand
So--called direct perception approach uses affordances to select low-level driving commands. Examples of affordances include is there a car in my lane within 10m? or relative angles deviation between my car's heading and the lane to control the car. This approach offers interpretability. The question here is ''how to efficiently extract these affordances ?''. For this classification / regression supervised learning task, the encoder is first pre-trained on another task (proxy tasks) involving learning from action demonstration (e.g. behavioural cloning ). The intuition is that for a learnt BC model able to take good driving decisions most of the time, relevant information should have been captured in its encoder. Source. |
Different self-supervised learning tasks based on action prediction can be used to produce the encoder. An other idea is to use models trained with supervised learning. For instance ResNet , whose encoder performs worse. Probably because of the synthetic images? Source. |
Four affordances are predicted from images. They represent the explicit detection of hazards involving pedestrians and vehicles, respecting traffic lights and considering the heading of the vehicle within the current lane. PID controllers convert these affordances into throttle , brake and steering commands. Source. |
Authors: Xiao, Y., Codevilla, F., Pal, C., & López, A. M.
-
One sentence:
-
"Expert demonstrations can act as an effective
action
-based representation learning technique."
-
-
Motivations:
1-
Leverage driving demonstration data that can be easily obtained by simply recording theactions
of good drivers.2-
Be more interpretable than pureend-to-end
imitation methods.3-
Be less annotation dependent, i.e. rely preferably on self-supervision and try to reduce data supervision (i.e., human annotation).-
"Our method uses much less densely annotated data and does not use dataset aggregation (
DAgger
)."
-
-
Main idea:
- Use both manually annotated data (
supervised learning
) and expert demonstrations (imitation learning
) to learn to extractaffordances
from images (representations learning
). - This combination is beneficial, compared to taking each approach separately:
1-
Pureend-to-end
imitation methods such asbehavioural cloning
could be used to directly predictcontrol
actions (throttle
,brake
,steering
).-
"In this pure data-centered approach, the supervision required to train deep
end-to-end
driving models does not come from human annotation; instead, the vehicle’s state variables are used as self-supervision (e.g.speed
,steering
,acceleration
,braking
) since these can be automatically collected from fleets of human-driven vehicles." - But it lacks interpretability, can have difficulty dealing with spurious correlations and training may be unstable.
-
2-
Learning to extract theaffordances
from scratch would not be very efficient.- One could use a pre-trained backbone such as
ResNet
. But it does not necessary capture the relevant information of the driving scene to make decisions. -
"Action-based pre-training (
Forward
/Inverse
/BC
) outperforms all the other reported pre-training strategies (e.g.ImageNet
). However, we see that the action-based pre-training is mostly beneficial to help on reliably estimating the vehicle’s relative angle with respect to the road."
- One could use a pre-trained backbone such as
- Use both manually annotated data (
-
About
driving affordances
anddirect
perception approach:-
"A different paradigm, conceptually midway between pure modular and end-to-end driving ones, is the so-called
direct perception
approach, which focuses on learning deep models to predict drivingaffordances
, from which an additional controller can maneuver the AV. In general, suchaffordances
can be understood as a relatively small set of interpretable variables describing events that are relevant for an agent acting in an environment. Drivingaffordances
bring interpretability while only requiring weak supervision, in particular, human annotations just at the image level (i.e., not pixel-wise)." - Four
affordances
used by a rule-based controller to selectthrottle
,brake
andsteering
commands:1-
Pedestrian hazard (bool
): Is there is a pedestrian in our lane at a distance lower than10m
?2-
Vehicle hazard (bool
): Is there is a vehicle in our lane at a distance lower than10m
?3-
Red traffic light (bool
): Is there is a traffic light in red affecting our lane at a distance lower than10m
?4-
Relative heading angle (float
): relative angle of the longitudinal vehicle axis with respect to the lane in which it is navigating.
- Extracting these
affordances
from the sensor data is called "representation learning". - Main idea here:
- Learn to extract these
affordances
(supervised learning), with the encoder being pre-trained on the task ofend-to-end
driving, e.g.BC
.
- Learn to extract these
-
-
Two stages with two datasets: A self-supervised dataset and a weakly-supervised dataset.
1-
Train anend-to-end
driving model (e.g.BC
) from (e.g. expert) demonstrations.- The final layers for
action
prediction are discarded. - Only the first part, i.e. the encoder, is used for the second stage.
- It is called "self-supervised" since it does not require external manual annotations.
- The final layers for
2-
Use this pre-trained encoder together with a multi-layer perceptron (MLP
) to predictaffordances
.- The pre-training (stage
1
) is beneficial because all the relevant information for driving should have been extracted by the encoder. - It is called "weakly-supervised" since it only requires image-level
affordance
annotations, i.e. not pixel-level.
- The pre-training (stage
-
Why is it called "
action
-based" method?- So far, I mention
behaviour cloning
(BC
) as a learning method that focus on predicting the controlactions
and whose learnt encoder can be used. - For instance,
inverse dynamics
models.-
"Predicting the next states of an agent, or the action between state transitions, yields useful representations."
-
-
[Is it a good idea to use random
actions
?] "We show that learning from expert data in our approach leads to better representations compared to traininginverse dynamics
models. This shows that expert driving data (i.e. coming from human drivers) is an important source for representation learning."
- So far, I mention
"Learning to drive by imitation: an overview of deep behavior cloning methods"
-
[
2020
] [📝] [ 🎓University of Moncton
] -
[
overview
]
Click to expand
Some simulators and datasets for supervised learning of end-to-end driving. Source. |
Instead of just single front-view camera frames (top and left), other sensor modalities can be used as inputs, for instance event-based cameras (bottom-right). Source. |
The temporal evolution of the scene can be captured by considering a sequence of past frames. Source. |
Other approaches also address the longitudinal control (top and right), while some try to exploit intermediate representations (bottom-left). Source. |
Source. |
Authors: Ly, A. O., & Akhloufi, M.
-
Motivation:
- An overview of the current state of the art deep
behaviour cloning
methods for lane stable driving. -
[No
RL
] "Byend-to-end
, we mean supervised methods that map raw visual inputs to low-level (steering
angle,speed
, etc) or high-level (drivingpath
, drivingintention
, etc.) of actuation commands using almost only deep networks."
- An overview of the current state of the art deep
-
Five classes of methods:
-
1-
Pure imitative methods that make use of vanillaCNNs
and take standard camera frames only as input.- The loss can be computed using the Mean Squared Error (
MSE
) between predictions andsteering
labels. -
"Recovery from mistakes is made possible by adding synthesized data during training via simulations of car deviation from the center of the lane."
-
"Data augmentation was performed using a basic viewpoint transformation with the help of the
left
andright
cameras."
- The loss can be computed using the Mean Squared Error (
-
2-
Models that use other types of perceptual sensors such asevents
orfisheye
cameras etc.-
"A more realistic label augmentation is achieved with the help of the wide range of captures from the front fisheye camera compared to previous methods using shearing with side (right and left) cameras."
-
"
Events
based cameras consist of independent pixels that record intensity variation in an asynchronous way. Thus, giving more information in a time interval than traditional video cameras where changes taking place between two consecutive frames are not captured."
-
-
3-
Methods that consider previous driving history to estimate future driving commands. -
4-
Systems that predicts bothlateral
andlongitudinal
control commands.-
"It outputs the vehicle
curvature
instead of thesteering angle
as generally found in the literature, which is justified by the fact thatcurvature
is more general and does not vary from vehicle to vehicle."
-
-
5-
Techniques that leverage the power ofmid-level
representations for transfer learning or give more explanation in regards to taken actions.-
"The motivation behind using a
VAE
architecture is to automatically mitigate the bias issue which occurs because generally the driving scenes in the datasets does not have the same proportions. In previous methods, this issue is solved by manually reducing the over represented scenes such asstraight driving
orstops
."
-
-
-
Some take-aways:
-
"Models making use of non-standard cameras or intermediate representations are showing a lot of potential in comparison to pure imitative methods that takes conventional video frames as input."
-
"The diversity in metrics and datasets used for reporting the results makes it very hard to strictly weigh the different models against each other."
- Explainability and transparency of taken decisions is important.
-
"A common approach in the literature is to analyse the pixels that lead to the greatest activation of neurons."
-
-
"Advisable Learning for Self-driving Vehicles by Internalizing Observation-to-Action Rules"
-
[
2020
] [📝] [ 🎓UC Berkeley
] [] -
[
attention
,advisability
]
Click to expand
Source. |
Source. |
Authors: Kim, J., Moon, S., Rohrbach, A., Darrell, T., & Canny, J.
-
Related PhD thesis: Explainable and Advisable Learning for Self-driving Vehicles, (Kim. J, 2020)
-
Motivation:
- An
end-to-end
model should be explainable, i.e. provide easy-to-interpret rationales for its behaviour: 1-
Summarize / the visual observations (input) in natural language, e.g. "light is red".Visual attention
is not enough, verbalizing is needed.
2-
Predict an appropriate action response, e.g. "I see a pedestrian crossing, so I stop".- I.e. Justify the decisions that are made and explain why they are reasonable in a human understandable manner, i.e., again, in natural language.
3-
Predict a control signal, accordingly.- The command is conditioned on the predicted high-level action command, e.g. "maintain a slow speed".
- The output is a sequence of waypoints, hence
end-to-mid
.
- An
-
About the dataset:
- Berkeley
DeepDrive-eXplanation
(BDD-X
) dataset (by the first author). - Together with camera front-views and IMU signal, the dataset provides:
1-
Textual descriptions of the vehicle's actions: what the driver is doing.2-
Textual explanations for the driver's actions: why the driver took that action from the point of view of a driving instructor.- For instance the pair: (
"the car slows down"
,"because it is approaching an intersection"
).
- For instance the pair: (
- Berkeley
"Feudal Steering: Hierarchical Learning for Steering Angle Prediction"
-
[
2020
] [📝] [ 🎓Rutgers University
] [ 🚗Lockheed Martin
] -
[
hierarchical learning
,temporal abstraction
,t-SNE embedding
]
Click to expand
Feudal learning for steering prediction. The worker decides the next steering angle conditioned on a goal (subroutine id ) determined by the manager. The manager learns to predict these subroutine ids from a sequence of past states (break , steer , throttle ). The ground truth subroutine ids are the centres of centroids obtained by unsupervised clustering. They should contain observable semantic meaning in terms of driving tasks. Source. |
Authors: Johnson, F., & Dana, K.
-
Note: Although terms and ideas from hierarchical reinforcement learning (
HRL
) are used, noRL
is applied here! -
Motivation: Temporal abstraction.
- Problems in
RL
: delayed rewards and sparse credit assignment. - Some solutions: intrinsic rewards and temporal abstraction.
- The idea of temporal abstraction is to break down the problem into more tractable pieces:
-
"At all times, human drivers are paying attention to two levels of their environment. The first level goal is on a finer grain: don’t hit obstacles in the immediate vicinity of the vehicle. The second level goal is on a coarser grain: plan actions a few steps ahead to maintain the proper course efficiently."
-
- Problems in
-
The idea of feudal learning is to divide the task into:
1-
A manager network.- It operates at a lower temporal resolution and produces
goal
vectors that it passes to the worker network. - This
goal
vector should encapsulate a temporally extended action called asubroutine
,skill
,option
, ormacro-action
. - Input: Sequence of previous
steering
. - Output:
goal
.
- It operates at a lower temporal resolution and produces
2-
A worker network: conditioned on thegoal
decided by the manager.- Input:
goal
decided by the manager,previous own prediction
, sequence offrames
. - Output:
steering
.
- Input:
- The
subroutine ids
(manager net) and thesteering angle
prediction (worker net) are jointly learnt.
-
What are the ground truth
goal
used to train the manager?- They are ids of the centres of centroids formed by clustering (unsupervised learning) all the training data:
1-
Data:Steering
,braking
, andthrottle
data are concatenated everym=10
time steps to make a vector of length3m=30
.2-
Encoding: projected in at-SNE
2d
-space.3-
Clustering:K-means
.- The
2d
-coordinates of centroids of the clusters are thesubroutine ids
, i.e. the possiblegoals
.- How do they convert the
2d
-coordinates into a single scalar?
- How do they convert the
-
"We aim to classify the
steering
angles into their temporally abstracted subroutines, also calledoptions
ormacro-actions
, associated with highway driving such asfollow the sharp right bend
,bumper-to-bumper traffic
,bear left slightly
."
- They are ids of the centres of centroids formed by clustering (unsupervised learning) all the training data:
-
What are the decision frequencies?
- The worker considers the last
10
actions to decide thegoal
. - It seems like a smoothing process, where a window is applied?
- It should be possible to achieve that with a recurrent net, shouldn't it?
- The worker considers the last
-
About
t-SNE
:-
"
t
-Distributed Stochastic Neighbor Embedding (t-SNE
) is an unsupervised, non-linear technique primarily used for data exploration and visualizing high-dimensional data. In simpler terms,t-SNE
gives you a feel or intuition of how the data is arranged in a high-dimensional space [fromtowardsdatascience
]." - Here it is used as an embedding space for the driving data and as the subroutine ids themselves.
-
"A Survey of End-to-End Driving: Architectures and Training Methods"
-
[
2020
] [📝] [ 🎓University of Tartu
] -
[
review
,distribution shift problem
,domain adaptation
,mid-to-mid
]
Click to expand
Left: example of end-to-end architecture with key terms. Right: difference open-loop / close-loop evaluation. Source. |
Source. |
Source. |
Authors: Tampuu, A., Semikin, M., Muhammad, N., Fishman, D., & Matiisen, T.
-
A rich literature overview and some useful reminders about general
IL
andRL
concepts with focus toAD
applications.- It constitutes a good complement to the "Related trends in research" part of my video "From RL to Inverse Reinforcement Learning: Intuitions, Concepts + Applications to Autonomous Driving".
-
I especially like the structure of the document: It shows what one should consider when starting an
end-to-end
/IL
project forAD
:- I have just noted here some ideas I find interesting. In no way an exhaustive summary!
-
1-
Learning methods: working withrewards
(RL
) or withlosses
(behavioural cloning
).- About
distribution shift problem
inbehavioural cloning
:-
"If the driving decisions lead to unseen situations (not present in the training set), the model might no longer know how to behave".
- Most solutions try to diversify the training data in some way - either by collecting or generating additional data:
data augmentation
: e.g. one can place two additional cameras pointing forward-left and forward-right and associate the images with commands toturn right
andturn left
respectively.data diversification
: addition of temporally correlated noise and synthetic trajectory perturbations. Easier on "semantic" inputs than on camera inputs.on-policy learning
: recovery annotation andDAgger
. The expert provides examples how to solve situations the model-driving leads to. Also "Learning by cheating" by (Chen et al. 2019).balancing the dataset
: by upsampling the rarely occurring angles, downsampling the common ones or by weighting the samples.-
"Commonly, the collected datasets contain large amounts of repeated traffic situations and only few of those rare events."
- The authors claim that only the joint distribution of
inputs
andoutputs
defines the rarity of a data point. -
"Using more training data from
CARLA
Town1
decreases generalization ability inTown2
. This illustrates that more data without more diversity is not useful." - Ideas for augmentation can be taken from the field of
supervised Learning
where it is already an largely-addressed topic.
-
-
- About
RL
:- Policies can be first trained with IL and then fine-tuned with RL methods.
-
"This approach reduces the long training time of RL approaches and, as the RL-based fine-tuning happens online, also helps overcome the problem of IL models learning off-policy".
- About
domain adaptation
andtransfer
from simulation to real world (sim2real
).- Techniques from
supervised
learning, such asfine tuning
, i.e. adapting the driving model to the new distribution, are rarely used. - Instead, one can instead adapt the incoming data and keep the driving model fixed.
- A first idea is to transform real images into simulation-like images (the opposite - generating real-looking images - is challenging).
- One can also extract the semantic segmentation of the scene from both the real and the simulated images and use it as the input for the driving policy.
- Techniques from
- About
-
2-
Inputs.- In short:
Vision
is key.Lidar
andHD-map
are nice to have but expensive / tedious to maintain.- Additional inputs from independent modules (
semantic segmentation
,depth map
,surface normals
,optical flow
andalbedo
) can improve the robustness.
- About the
inertia problem
/causal confusion
when for instance predicting the nextego-speed
.-
"As in the vast majority of samples the current [observed] and next speeds [to be predicted] are highly correlated, the model learns to base its speed prediction exclusively on current speed. This leads to the model being reluctant to change its speed, for example to start moving again after stopping behind another car or a at traffic light."
-
- About
affordances
:-
"Instead of parsing all the objects in the driving scene and performing robust localization (as modular approach), the system focuses on a small set of crucial indicators, called
affordances
."
-
- In short:
-
3-
Outputs.-
"The outputs of the model define the level of understanding the model is expected to achieve."
- Also related to the
time horizon
:-
"When predicting instantaneous low-level commands, we are not explicitly forcing the model to plan a long-term trajectory."
-
- Three types of predictions:
3-1
Low-level commands.-
"The majority of end-to-end models yield as output the
steering angle
andspeed
(oracceleration
andbrake
commands) for the next timestep". - Low-level commands may be car-specific. For instance vehicles answer differently to the same
throttle
/steering
commands.-
"The function between steering wheel angle and the resulting turning radius depends on the car's geometry, making this measure specific to the car type used for recording."
-
-
[About the regression loss] "Many authors have recently optimized
speed
andsteering
commands usingL1
loss (mean absolute error,MAE
) instead ofL2
loss (mean squared error,MSE
)".
-
3-2
Future waypoints or desired trajectories.- This higher-level output modality is independent of car geometry.
3-3
Cost map, i.e. information about where it is safe to drive, leaving the trajectory generation to another module.
- About multitask learning and auxiliary tasks:
- The idea is to simultaneously train a separate set of networks to predict for instance
semantic segmentation
,optical flow
,depth
and other human-understandable representations from the camera feed. -
"Based on the same extracted visual features that are fed to the decision-making branch (main task), one can also predict
ego-speed
,drivable area
on the scene, andpositions
andspeeds
of other objects". - It offers more learning signals - at least for the shared layers.
- And can also help understand the mistakes a model makes:
-
"A failure in an auxiliary task (e.g. object detection) might suggest that necessary information was not present already in the intermediate representations (layers) that it shared with the main task. Hence, also the main task did not have access to this information and might have failed for the same reason."
-
- The idea is to simultaneously train a separate set of networks to predict for instance
-
-
4-
Evaluation: the difference betweenopen-loop
andclose-loop
.4-1
open-loop
: like insupervised
learning:- one question = one answer.
- Typically, a dataset is split into training and testing data.
- Decisions are compared with the recorded actions of the demonstrator, assumed to be the ground-truth.
4-2
close-loop
: like in decision processes:- The problem consists in a multi-step interaction with some environment.
- It directly measures the model's ability to drive on its own.
- Interesting facts: Good
open-loop
performance does not necessarily lead to good driving ability inclosed-loop
settings.-
"Mean squared error (
MSE
) correlates withclosed-loop
success rate only weakly (correlation coefficientr = 0.39
), soMAE
,quantized classification error
orthresholded relative error
should be used instead (r > 0.6
for all three)." - About the
balanced-MAE
metric foropen-loop
evaluation, which correlates better withclosed-loop
performance than simpleMAE
.-
"
Balanced-MAE
is computed by averaging the mean values of unequal-length bins according tosteering angle
. Because most data lies in the region around steering angle0
, equally weighting the bins grows the importance of rarely occurring (higher)steering angle
s."
-
-
-
5-
Interpretability:5-1
Either on thetrained
model ...-
"Sensitivity analysis aims to determine the parts of an input that a model is most sensitive to. The most common approach involves computing the gradients with respect to the input and using the magnitude as the measure of sensitivity."
VisualBackProp
: which input pixels influence the cars driving decision the most.
-
5-2
... or already duringtraining
.-
"
visual attention
is a built-in mechanism present already when learning. Where to attend in the next timestep (the attention mask), is predicted as additional output in the current step and can be made to depend on additional sources of information (e.g. textual commands)."
-
-
About
end-to-end
neural nets and humans:-
"[
StarCraft
,Dota 2
,Go
andChess
solved withNN
]. Many of these solved tasks are in many aspects more complex than driving a car, a task that a large proportion of people successfully perform even when tired or distracted. A person can later recollect nothing or very little about the route, suggesting the task needs very little conscious attention and might be a simple behavior reflex task. It is therefore reasonable to believe that in the near future anend-to-end
approach is also capable to autonomously control a vehicle."
-
"Efficient Latent Representations using Multiple Tasks for Autonomous Driving"
-
[
2020
] [📝] [ 🎓Aalto University
] -
[
latent space representation
,multi-head decoder
,auxiliary tasks
]
Click to expand
The latent representation is enforced to predict the trajectories of both the ego vehicle and other vehicles in addition to the input image, using a multi-head network structure. Source. |
Authors: Kargar, E., & Kyrki, V.
- Motivations:
1-
Reduce the dimensionality of thefeature representation
of the scene - used as input to someIL
/RL
policy.- This is to improve most
mid-to-x
approaches that encode and process a vehicle’s environment as multi-channel and quite high-dimensional bird view images. ->
The idea here is to learn anencoder-decoder
.- The latent space has size
64
(way smaller than common64 x 64 x N
bird-views).
- This is to improve most
2-
Learn alatent representation
faster / with fewer data.- A single head decoder would just consider
reconstruction
. ->
The idea here is to use have multiple heads in the decoder, i.e. make prediction of multiple auxiliary application relevant factors.-
"The multi-head model can reach the single-head model’s performance in
20
epochs, one-fifth of training time of the single-head model, with full dataset." -
"In general, the multi-heal model, using only
6.25%
of the dataset, converges faster and perform better than single head model trained on the full dataset."
- A single head decoder would just consider
3-
Learn apolicy
faster / with fewer data.
- Two components to train:
1-
Anencoder-decoder
learns to produce a latent representation (encoder
) coupled with a multiple-prediction-objective (decoder
).2-
Apolicy
use the latent representation to predict low-level controls.
- About the
encoder-decoder
:inputs
: bird-view image containing:- Environment info, built from
HD Maps
andperception
. - Ego trajectory:
10
past poses. - Other trajectory:
10
past poses. - It forms a
256 x 256
image, which is resized to64 x 64
to feed them into the models
- Environment info, built from
outputs
: multiple auxiliary tasks:1-
Reconstruction head: reconstructing the input bird-view image.2-
Prediction head:1s
-motion-prediction for other agents.3-
Planning head:1s
-motion-prediction for the ego car.
- About the
policy
:- In their example, the authors implement
behaviour cloning
, i.e. supervised learning to reproduce the decision ofCARLA autopilot
. 1-
steering
prediction.2-
acceleration
classification -3
classes.
- In their example, the authors implement
- How to deal with the unbalanced dataset?
- First, the authors note that no manual labelling is required to collect training data.
- But the recorded
steering
angle is zero most of the time - leading to a highly imbalanced dataset. - Solution (no further detail):
-
"Create a new dataset and balance it using sub-sampling".
-
"Robust Imitative Planning : Planning from Demonstrations Under Uncertainty"
-
[
2019
] [📝] [ 🎓University of Oxford
,UC Berkeley
,Carnegie Mellon University
] -
[
epistemic uncertainty
,risk-aware decision-making
,CARLA
]
Click to expand
Illustration of the state distribution shift in behavioural cloning (BC ) approaches. The models (e.g. neural networks) usually fail to generalize and instead extrapolate confidently yet incorrectly, resulting in arbitrary outputs and dangerous outcomes. Not to mention the compounding (or cascading) errors, inherent to the sequential decision making. Source. |
Testing behaviours on scenarios such as roundabouts that are not present in the training set. Source. |
Above - in their previous work, the authors introduced Deep imitative models (IM ). The imitative planning objective is the log posterior probability of a state trajectory, conditioned on satisfying some goal G . The state trajectory that has the highest likelihood w.r.t. the expert model q (S given φ ; θ ) is selected, i.e. maximum a posteriori probability (MAP ) estimate of how an expert would drive to the goal. This captures any inherent aleatoric stochasticity of the human behaviour (e.g., multi-modalities), but only uses a point-estimate of θ , thus q (s given φ ;θ ) does not quantify model (i.e. epistemic ) uncertainty. φ denotes the contextual information (3 previous states and current LIDAR observation) and s denotes the agent’s future states (i.e. the trajectory). Bottom - in this works, an ensemble of models is used: q (s given φ ; θk ) where θk denotes the parameters of the k -th model (neural network). The Aggregation Operator operator is applied on the posterior p(θ given D ). The previous work is one example of that, where a single θi is selected. Source. |
To save computation and improve runtime to real-time, the authors use a trajectory library: they perform K-means clustering of the expert plan’s from the training distribution and keep 128 of the centroids, allegedly reducing the planning time by a factor of 400 . During optimization, the trajectory space is limited to only that trajectory library. It makes me think of templates sometimes used for path-planning. I also see that as a way restrict the search in the trajectory space, similar to injecting expert knowledge about the feasibility of cars trajectories. Source. |
Estimating the uncertainty is not enough. One should then forward that estimate to the planning module. This reminds me an idea of (McAllister et al., 2017) about the key benefit of propagating uncertainty throughout the AV framework. Source. |
Authors: Tigas, P., Filos, A., Mcallister, R., Rhinehart, N., Levine, S., & Gal, Y.
-
Previous work:
"Deep Imitative Models for Flexible Inference, Planning, and Control"
(see below).- The idea was to combine the benefits of
imitation learning
(IL
) andgoal-directed planning
such asmodel-based RL
(MBRL
).- In other words, to complete planning based on some imitation prior, by combining generative modelling from demonstration data with planning.
- One key idea of this generative model of expert behaviour: perform context-conditioned density estimation of the distribution over future expert trajectories, i.e. score the "expertness" of any plan of future positions.
- Limitations:
- It only uses a point-estimate of
θ
. Hence it fails to capture epistemic uncertainty in the model’s density estimate. -
"Plans can be risky in scenes that are out-of-training-distribution since it confidently extrapolates in novel situations and lead to catastrophes".
- It only uses a point-estimate of
- The idea was to combine the benefits of
-
Motivations here:
1-
Develop a model that captures epistemic uncertainty.2-
Estimating uncertainty is not a goal at itself: one also need to provide a mechanism for taking low-risk actions that are likely to recover in uncertain situations.- I.e. both
aleatoric
andepistemic
uncertainty should be taken into account in the planning objective. - This reminds me the figure of (McAllister et al., 2017) about the key benefit of propagating uncertainty throughout the AV framework.
- I.e. both
-
One quote about behavioural cloning (
BC
) that suffers from state distribution shift (co-variate shift
):-
"Where high capacity parametric models (e.g. neural networks) usually fail to generalize, and instead extrapolate confidently yet incorrectly, resulting in arbitrary outputs and dangerous outcomes".
-
-
One quote about model-free
RL
:-
"The specification of a reward function is as hard as solving the original control problem in the first place."
-
-
About
epistemic
andaleatoric
uncertainties:-
"Generative models can provide a measure of their uncertainty in different situations, but robustness in novel environments requires estimating
epistemic uncertainty
(e.g., have I been in this state before?), where conventional density estimation models only capturealeatoric uncertainty
(e.g., what’s the frequency of times I ended up in this state?)."
-
-
How to capture uncertainty about previously unseen scenarios?
- Using an ensemble of density estimators and aggregate operators over the models’ outputs.
-
"By using demonstration data to learn density models over human-like driving, and then estimating its uncertainty about these densities using an ensemble of imitative models".
-
- The idea it to take the disagreement between the models into consideration and inform planning.
-
"When a trajectory that was never seen before is selected, the model’s high
epistemic
uncertainty pushes us away from it. During planning, the disagreement between the most probable trajectories under the ensemble of imitative models is used to inform planning."
-
- Using an ensemble of density estimators and aggregate operators over the models’ outputs.
"End-to-end Interpretable Neural Motion Planner"
-
[
2019
] [📝] [ 🎓University of Toronto
] [ 🚗Uber
] -
[
interpretability
,trajectory sampling
]
Click to expand
The visualization of 3D detection, motion forecasting as well as learned cost-map volume offers interpretability. A set of candidate trajectories is sampled, first considering the geometrical path and then then speed profile. The trajectory with the minimum learned cost is selected. Source. |
Source. |
Authors: Zeng W., Luo W., Suo S., Sadat A., Yang B., Casas S. & Urtasun R.
-
Motivation is to bridge the gap between the
traditional engineering stack
and theend-to-end driving
frameworks.1-
Develop a learnable motion planner, avoiding the costly parameter tuning.2-
Ensure interpretability in the motion decision. This is done by offering an intermediate representation.3-
Handle uncertainty. This is allegedly achieved by using a learnt, non-parametric cost function.4-
Handle multi-modality in possible trajectories (e.gchanging lane
vskeeping lane
).
-
One quote about
RL
andIRL
:-
"It is unclear if
RL
andIRL
can scale to more realistic settings. Furthermore, these methods do not produce interpretable representations, which are desirable in safety critical applications".
-
-
Architecture:
Input
: raw LIDAR data and a HD map.1st intermediate result
: An "interpretable" bird’s eye view representation that includes:3D
detections.- Predictions of future trajectories (planning horizon of
3
seconds). - Some spatio-temporal cost volume defining the goodness of each position that the self-driving car can take within the planning horizon.
2nd intermediate result
: A set of diverse physically possible trajectories (candidates).- They are
Clothoid
curves being sampled. First building thegeometrical path
. Then thespeed profile
on it. -
"Note that
Clothoid
curves can not handle circle and straight line trajectories well, thus we sample them separately."
- They are
Final output
: The trajectory with the minimum learned cost.
-
Multi-objective:
1-
Perception
Loss - to predict the position of vehicles at every time frame.- Classification: Distinguish a vehicle from the background.
- Regression: Generate precise object bounding boxes.
2-
Planning
Loss.-
"Learning a reasonable cost volume is challenging as we do not have ground-truth. To overcome this difficulty, we minimize the
max-margin
loss where we use the ground-truth trajectory as a positive example, and randomly sampled trajectories as negative examples." - As stated, the intuition behind is to encourage the demonstrated trajectory to have the minimal cost, and others to have higher costs.
- The model hence learns a cost volume that discriminates good trajectories from bad ones.
-
"Learning from Interventions using Hierarchical Policies for Safe Learning"
- [
2019
] [📝] [ 🎓University of Rochester, University of California San Diego
] - [
hierarchical
,sampling efficiency
,safe imitation learning
]
Click to expand
The main idea is to use Learning from Interventions (LfI ) in order to ensure safety and improve data efficiency, by intervening on sub-goals rather than trajectories. Both top-level policy (that generates sub-goals) and bottom-level policy are jointly learnt. Source. |
Authors: Bi, J., Dhiman, V., Xiao, T., & Xu, C.
- Motivations:
1-
Improve data-efficiency.2-
Ensure safety.
- One term: "Learning from Interventions" (
LfI
).- One way to classify the "learning from expert" techniques is to use the frequency of expert’s engagement.
High frequency
-> Learning from Demonstrations.Medium frequency
-> learning from Interventions.Low frequency
-> Learning from Evaluations.
- Ideas of
LfI
:-
"When an undesired state is detected, another policy is activated to take over actions from the agent when necessary."
- Hence the expert overseer only intervenes when it suspects that an unsafe action is about to be taken.
-
- Two issues:
1-
LfI
(as forLfD
) learn reactive behaviours.-
"Learning a supervised policy is known to have 'myopic' strategies, since it ignores the temporal dependence between consecutive states".
- Maybe one option could be to stack frames or to include the current speed in the
state
. But that makes the state space larger.
-
2-
The expert only signals after a non-negligible amount of delay.
- One way to classify the "learning from expert" techniques is to use the frequency of expert’s engagement.
- One idea to solve both issues: Hierarchy.
- The idea is to split the policy into two hierarchical levels, one that generates
sub-goals
for the future and another that generatesactions
to reach those desired sub-goals. - The motivation is to intervene on sub-goals rather than trajectories.
- One important parameter:
k
- The top-level policy predicts a sub-goal to be achieved
k
steps ahead in the future. - It represents a trade-off between:
- The ability for the
top-level
policy to predict sub-goals far into the future. - The ability for the
bottom-level
policy to follow it correctly.
- The ability for the
- The top-level policy predicts a sub-goal to be achieved
- One question: How to deal with the absence of ground- truth sub-goals ?
- One solution is "Hindsight Experience Replay", i.e. consider an achieved goal as a desired goal for past observations.
- The authors present additional interpolation techniques.
- They also present a
Triplet Network
to train goal-embeddings (I did not understand everything).
- The idea is to split the policy into two hierarchical levels, one that generates
"Urban Driving with Conditional Imitation Learning"
Click to expand
The encoder is trained to reconstruct RGB , depth and segmentation , i.e. to learn scene understanding. It is augmented with optical flow for temporal information. As noted, such representations could be learned simultaneously with the driving policy, for example, through distillation. But for efficiency, this was pre-trained (Humans typically also have ~30 hours of driver training before taking the driving exam. But they start with huge prior knowledge). Interesting idea: the navigation command is injected as multiple locations of the control part. Source. |
Driving data is inherently heavily imbalanced, where most of the captured data will be driving near-straight in the middle of a lane. Any naive training will collapse to the dominant mode present in the data. No data augmentation is performed. Instead, during training, the authors sample data uniformly across lateral and longitudinal control dimensions. Source. |
Authors: Hawke, J., Shen, R., Gurau, C., Sharma, S., Reda, D., Nikolov, N., Mazur, P., Micklethwaite, S., Griffiths, N., Shah, A. & Kendall, A.
- Motivations:
1-
Learn bothsteering
andspeed
via Behavioural Cloning.2-
Use raw sensor (camera) inputs, rather than intermediate representations.3-
Train and test on dense urban environments.
- Why "conditional"?
- A route command (e.g.
turn left
,go straight
) resolves the ambiguity of multi-modal behaviours (e.g. when coming at an intersection). -
"We found that inputting the command multiple times at different stages of the network improves robustness of the model".
- A route command (e.g.
- Some ideas:
- Provide wider state observability through multiple camera views (single camera disobeys navigation interventions).
- Add temporal information via optical flow.
- Another option would be to stack frames. But it did not work well.
- Train the primary shared encoders and auxiliary independent decoders for a number of computer vision tasks.
-
"In robotics, the
test
data is the real-world, not a static dataset as is typical in mostML
problems. Every time our cars go out, the world is new and unique."
-
- One concept: "Causal confusion".
- A good video about Causal Confusion in Imitation Learning showing that "access to more information leads to worse generalisation under distribution shift".
-
"Spurious correlations cannot be distinguished from true causes in the demonstrations. [...] For example, inputting the current speed to the policy causes it to learn a trivial identity mapping, making the car unable to start from a static position."
- Two ideas during training:
- Using flow features to make the model use explicit motion information without learning the trivial solution of an identity mapping for speed and steering.
- Add random noise and use dropout on it.
- One alternative is to explicitly maintain a causal model.
- Another alternative is to learn to predict the speed, as detailed in "Exploring the Limitations of Behavior Cloning for Autonomous Driving".
- Output:
- The model decides of a "motion plan", i.e. not directly the low-level control?
- Concretely, the network gives one prediction and one slope, for both
speed
andsteering
, leading to two parameterised lines.
- Two types of tests:
1-
Closed-loop (i.e. go outside and drive).- The number and type of safety-driver interventions.
2-
Open-loop (i.e., evaluating on an offline dataset).- The weighted mean absolute error for
speed
andsteering
.- As noted, this can serve as a proxy for real world performance.
- The weighted mean absolute error for
-
"As discussed by [34] and [35], the correlation between
offline
open-loop
metrics andonline
closed-loop
performance is weak."
- About the training data:
- As stated, they are two levers to increase the performance:
1-
Algorithmic innovation.2-
Data.
- For this
IL
approach,30
hours of demonstrations. -
"Re-moving a quarter of the data notably degrades performance, and models trained with less data are almost undriveable."
- As stated, they are two levers to increase the performance:
- Next steps:
- I find the results already impressive. But as noted:
-
"The learned driving policies presented here need significant further work to be comparable to human driving".
-
- Ideas for improvements include:
- Add some predictive long-term planning model. At the moment, it does not have access to long-term dependencies and cannot reason about the road scene.
- Learn not only from demonstration, but also from mistakes.
- This reminds me the concept of
ChauffeurNet
about "simulate the bad rather than just imitate the good".
- This reminds me the concept of
- Continuous learning: Learning from corrective interventions would also be beneficial.
- The last point goes in the direction of adding learning signals, which was already done here.
- Imitation of human expert drivers (
supervised
learning). - Safety driver intervention data (
negative reinforcement
learning) and corrective action (supervised
learning). - Geometry, dynamics, motion and future prediction (
self-supervised
learning). - Labelled semantic computer vision data (
supervised
learning). - Simulation (
supervised
learning).
- Imitation of human expert drivers (
- I find the results already impressive. But as noted:
"Application of Imitation Learning to Modeling Driver Behavior in Generalized Environments"
Click to expand
The IL models were trained on a straight road and tested on roads with high curvature. PS-GAIL is effective only while surrounded by other vehicles, while the RAIL policy remained stably within the bounds of the road thanks to the additional rewards terms included into the learning process.. Source. |
Authors: Lange, B. A., & Brannon, W. D.
- One motivation: Compare the robustness (domain adaptation) of three
IL
techniques: - One take-away: This student project builds a good overview of the different
IL
algorithms and why these algorithms came out.- Imitation Learning (
IL
) aims at building an (efficient) policy using some expert demonstrations. - Behavioural Cloning (
BC
) is a sub-class ofIL
. It treatsIL
as a supervised learning problem: a regression model is fit to thestate
/action
space given by the expert.-
Issue of distribution shift: "Because data is not infinite nor likely to contain information about all possible
state
/action
pairs in a continuousstate
/action
space,BC
can display undesirable effects when placed in these unknown or not well-known states." -
"A cascading effect is observed as the time horizon grows and errors expand upon each other."
-
- Several solutions (not exhaustive):
1-
DAgger
: Ask the expert to say what should be done in some encountered situations. Thus iteratively enriching the demonstration dataset.2-
IRL
: Human driving behaviour is not modelled inside a policy, but rather capture into a reward/cost function.- Based on this reward function, an (optimal) policy can be derived with classic
RL
techniques. - One issue: It can be computationally expensive.
- Based on this reward function, an (optimal) policy can be derived with classic
3-
GAIL
(I still need to read more about it):-
"It fits distributions of states and actions given by an expert dataset, and a cost function is learned via Maximum Causal Entropy
IRL
." -
"When the
GAIL
-policy driven vehicle was placed in a multi-agent setting, in which multiple agents take over the learned policy, this algorithm produced undesirable results among the agents."
-
PS-GAIL
is therefore introduced for multi-agent driving models (agents share a single policy learnt withPS-TRPO
).-
"Though
PS-GAIL
yielded better results in multi-agent simulations thanGAIL
, its results still led to undesirable driving characteristics, including unwanted trajectory deviation and off-road duration."
-
RAIL
offers a fix for that: the policy-learning process is augmented with two types of reward terms:- Binary penalties: e.g. collision and hard braking.
- Smoothed penalties: "applied in advance of undesirable actions with the theory that this would prevent these actions from occurring".
- I see that technique as a way to incorporate knowledge.
- Imitation Learning (
- About the experiment:
- The three policies were originally trained on the straight roadway: cars only consider the lateral distance to the edge.
- In the "new" environment, a road curvature is introduced.
- Findings:
-
"None of them were able to fully accommodate the turn in the road."
PS-GAIL
is effective only while surrounded by other vehicles.- The smoothed reward augmentation helped
RAIL
, but it was too late to avoid off-road (the car is already driving too fast and does not dare ahard brake
which is strongly penalized). - The reward function should therefore be updated (back to reward engineering 😅), for instance adding a harder reward term to prevent the car from leaving the road.
-
"Learning by Cheating"
-
[
2019
] [📝] [] [ 🎓UT Austin
] [ 🚗Intel Labs
] -
[
on-policy supervision
,DAgger
,conditional IL
,mid-to-mid
,CARLA
]
Click to expand
The main idea is to decompose the imitation learning (IL ) process into two stages: 1- Learn to act. 2- Learn to see. Source. |
mid-to-mid learning: Based on a processed bird’s-eye view map , the privileged agent predicts a sequence of waypoints to follow. This desired trajectory is eventually converted into low-level commands by two PID controllers. It is also worth noting how this privileged agent serves as an oracle that provides adaptive on-demand supervision to train the sensorimotor agent across all possible commands. Source. |
Example of privileged map supplied to the first agent. And details about the lateral PID controller that produces steering commands based on a list of target waypoints. Source. |
Authors: Chen, D., Zhou, B., Koltun, V. & Krähenbühl, P
- One motivation: decomposing the imitation learning (
IL
) process into two stages:Direct IL
(from expert trajectories to vision-based driving) conflates two difficult tasks:1-
Learning to see.2-
Learning to act.
- One term: "Cheating".
1-
First, train an agent that has access to privileged information:-
"This privileged agent cheats by observing the ground-truth layout of the environment and the positions of all traffic participants."
- Goal: The agent can focus on learning to act (it does not need to learn to see because it gets direct access to the environment’s state).
-
2-
Then, this privileged agent serves as a teacher to train a purely vision-based system (abundant supervision
).- Goal: Learning to see.
1-
First agent (privileged
agent):- Input: A processed
bird’s-eye view map
(with ground-truth information about lanes, traffic lights, vehicles and pedestrians) together with high-levelnavigation command
andcurrent speed
. - Output: A list of waypoints the vehicle should travel to.
- Hence
mid-to-mid
learning approach. - Goal: imitate the expert trajectories.
- Training: Behaviour cloning (
BC
) from a set of recorded expert driving trajectories.- Augmentation can be done offline, to facilitate generalization.
- The agent is thus placed in a variety of perturbed configurations to learn how to recover
- E.g. facing the sidewalk or placed on the opposite lane, it should find its way back onto the road.
- Input: A processed
2-
Second agent (sensorimotor
agent):- Input: Monocular
RGB
image, currentspeed
, and a high-levelnavigation command
. - Output: A list of waypoints.
- Goal: Imitate the privileged agent.
- Input: Monocular
- One idea: "White-box" agent:
- The internal state of the
privileged
agent can be examined at will.- Based on that, one could test different high-level commands: "What would you do now if the command was [
follow-lane
] [go left
] [go right
] [go straight
]".
- Based on that, one could test different high-level commands: "What would you do now if the command was [
- This relates to
conditional IL
: all conditional branches are supervised during training.
- The internal state of the
- Another idea: "online learning" and "on-policy supervision":
-
"“On-policy” refers to the sensorimotor agent rolling out its own policy during training."
- Here, the decision of the second agents are directly implemented (
close-loop
). - And an oracle is still available for the newly encountered situation (hence
on-policy
), which also accelerates the training. - This is an advantage of using a simulator: it would be difficult/impossible in the physical world.
- Here, the decision of the second agents are directly implemented (
- Here, the second agent is first trained
off-policy
(on expert demonstration) to speed up the learning (offline BC
), and only then goon-policy
:-
"Finally, we train the sensorimotor agent
on-policy
, using the privileged agent as an oracle that provides adaptive on-demand supervision in any state reached by the sensorimotor student." - The
sensorimotor
agent can thus be supervised on all its waypoints and across all commands at once.
-
- It resembles the Dataset aggregation technique of
DAgger
:-
"This enables automatic
DAgger
-like training in which supervision from the privileged agent is gathered adaptively via online rollouts of the sensorimotor agent."
-
-
- About the two benchmarks:
- Another idea: Do not directly output low-level commands.
- Instead, predict waypoints and speed targets.
- And rely on two
PID
controllers to implement them.-
1-
"We fit a parametrized circular arc to all waypoints using least-squares fitting and then steer towards a point on the arc." -
2-
"A longitudinalPID
controller tries to match a target velocity as closely as possible [...] We ignore negative throttle commands, and only brake if the predicted velocity is below some threshold (2 km/h
)."
-
"Deep Imitative Models for Flexible Inference, Planning, and Control"
-
[
2019
] [📝] [🎞️] [🎞️] [] [ 🎓Carnegie Mellon University
,UC Berkeley
] -
[
conditional IL
,model-based RL
,CARLA
]
Click to expand
The main motivation is to combine the benefits of IL (to imitate some expert demonstrations) and goal-directed planning (e.g. model-based RL ). Source. |
φ represents the scene. It consists of the current lidar scan , previous states in the trajectory as well as the current traffic light state . Source. |
From left to right: Point , Line-Segment and Region (small and wide) Final State Indicators used for planning. Source. |
Comparison of features and implementations. Source. |
Authors: Rhinehart, N., McAllister, R., & Levine, S.
-
Main motivation: combine the benefits of
imitation learning
(IL
) andgoal-directed planning
such asmodel-based RL
(MBRL
).- Especially to generate interpretable, expert-like plans with offline learning and no reward engineering.
- Neither
IL
norMBRL
can do so. - In other words, it completes planning based on some imitation prior.
-
One concept: "Imitative Models"
- They are "probabilistic predictive models able to plan interpretable expert-like trajectories to achieve new goals".
- As for
IL
-> use expert demonstration:- It generates expert-like behaviors without reward function crafting.
- The model is learnt "offline" also means it avoids costly online data collection (contrary to
MBRL
). - It learns dynamics desirable behaviour models (as opposed to learning the dynamics of possible behaviour done by
MBRL
).
- As for
MBRL
-> use planning:- It achieves new goals (goals that were not seen during training). Therefore, it avoids the theoretical drift shortcomings (distribution shift) of vanilla behavioural cloning (
BC
). - It outputs (interpretable)
plan
to them at test-time, whichIL
cannot. - It does not need goal labels for training.
- It achieves new goals (goals that were not seen during training). Therefore, it avoids the theoretical drift shortcomings (distribution shift) of vanilla behavioural cloning (
- Binding
IL
andplanning
:- The learnt
imitative model
q(S|φ)
can generate trajectories that resemble those that the expert might generate.- These manoeuvres do not have a specific goal. How to direct our agent to goals?
- General tasks are defined by a set of goal variables
G
.- At test time, a route planner provides waypoints to the imitative planner, which computes expert-like paths for each candidate waypoint.
- The best plan is chosen according to the planning objective (e.g. prefer routes avoiding potholes) and provided to a low-level
PID
-controller in order to producesteering
andthrottle
actions. - In other words, the derived plan (list of set-points) should be:
- Maximizing the similarity to the expert demonstrations (term with
q
) - Maximizing the probability of reaching some general goals (term with
P(G)
).
- Maximizing the similarity to the expert demonstrations (term with
- How to represent goals?
dim=0
- with points:Final-State Indicator
.dim=1
- with lines:Line-Segment Final-State Indicator
.dim=2
- with areas (regions):Final-State Region Indicator
.
- The learnt
-
How to deal with traffic lights?
- The concept of
smart waypointer
is introduced. -
"It removes far waypoints beyond
5
meters from the vehicle when a red light is observed in the measurements provided by CARLA". -
"The planner prefers closer goals when obstructed, when the vehicle was already stopped, and when a red light was detected [...] The planner prefers farther goals when unobstructed and when green lights or no lights were observed."
- The concept of
-
About interpretability and safety:
-
"In contrast to black-box one-step
IL
that predicts controls, our method produces interpretable multi-step plans accompanied by two scores. One estimates the plan’sexpertness
, the second estimates its probability to achieve the goal."- The
imitative model
can produce some expert probability distribution function (PDF
), hence offering superior interpretability to one-stepIL
models. - It is able to score how likely a trajectory is to come from the expert.
- The probability to achieve a goal is based on some "Goal Indicator methods" (using "Goal Likelihoods"). I must say I did not fully understand that part
- The
- The safety aspect relies on the fact that experts were driving safely and is formalized as a "plan reliability estimation":
-
"Besides using our model to make a best-effort attempt to reach a user-specified goal, the fact that our model produces explicit likelihoods can also be leveraged to test the reliability of a plan by evaluating whether reaching particular waypoints will result in human-like behavior or not."
- Based on this idea, a classification is performed to recognize safe and unsafe plans, based on the planning criterion.
-
-
-
About the baselines:
- Obviously, the proposed approach is compared to the two methods it aims at combining.
- About
MBRL
:1-
First, aforward dynamics model
is learnt using given observed expert data.- It does not imitate the expert preferred actions, but only models what is physically possible.
2-
The model then is used to plan a reachability tree through the free-space up to the waypoint while avoiding obstacles:- Playing with the
throttle
action, the search expands each state node and retains the50
closest nodes to the target waypoint. - The planner finally opts for the lowest-cost path that ends near the goal.
- Playing with the
-
"The task of evoking expert-like behavior is offloaded to the reward function, which can be difficult and time-consuming to craft properly."
- About
IL
: It used Conditional terms on States, leading toCILS
.S
forstate
: Instead of emitting low-level control commands (throttle
,steering
), it outputs set-points for somePID
-controller.C
forconditional
: To navigate at intersections, waypoints are classified into one of several directives: {Turn left
,Turn right
,Follow Lane
,Go Straight
}.- This is inspired by "End-to-end driving via conditional imitation learning" - (Codevilla et al. 2018) - detailed below.
"Conditional Vehicle Trajectories Prediction in CARLA Urban Environment"
Click to expand
Some figures:
End-to- Mid approach: 3 inputs with different levels of abstraction are used to predict the future positions on a fixed 2s -horizon of the ego vehicle and the neighbours. The ego trajectory is be implemented by an external PID controller - Therefore, not end-to- end . Source. |
The past 3D-bounding boxes of the road users in the current reference are projected back in the current camera space. The past positions of ego and other vehicles are projected into some grid-map called proximity map . The image and the proximity map are concatenated to form context feature vector C . This context encoding is concatenated with the ego encoding, then fed into branches corresponding to the different high-level goals - conditional navigation goal . Source. |
Illustration of the distribution shift in imitation learning. Source. |
VisualBackProp highlights the image pixels which contributed the most to the final results - Traffic lights and their colours are important, together with highlights lane markings and curbs when there is a significant lateral deviation. Source. |
Authors: Buhet, T., Wirbel, E., & Perrotton, X.
- Previous works:
- "Imitation Learning for End to End Vehicle Longitudinal Control with Forward Camera" - (George, Buhet, Wirbel, Le-Gall, & Perrotton, 2018).
- "End to End Vehicle Lateral Control Using a Single Fisheye Camera" (Toromanoff, M., Wirbel, E., Wilhelm, F., Vejarano, C., Perrotton, X., & Moutarde, F. 2018).
- One term: "End-To-Middle".
- It is opposed to "End-To-End", i.e. it does not output "end" control signals such as
throttle
orsteering
but rather some desired trajectory, i.e. a mid-level representation.- Each trajectory is described by two polynomial functions (one for
x
, the other fory
), therefore the network has to predict a vector (x0
, ...,x4
,y0
, ...,y4
) for each vehicle. - The desired ego-trajectory is then implemented by an external controller (
PID
). Therefore, notend-to-end
.
- Each trajectory is described by two polynomial functions (one for
- Advantages of
end-to-mid
: interpretability for the control part + less to be learnt by the net. - This approach is also an instance of "Direct perception":
-
"Instead of commands, the network predicts hand-picked parameters relevant to the driving (distance to the lines, to other vehicles), which are then fed to an independent controller".
-
- Small digression: if the raw perception measurements were first processed to form a mid-level input representation, the approach would be said
mid-to-mid
. An example is ChauffeurNet, detailed on this page as well.
- It is opposed to "End-To-End", i.e. it does not output "end" control signals such as
- About Ground truth:
- The expert demonstrations do not come from human recordings but rather from
CARLA
autopilot. 15
hours of driving inTown01
were collected.- As for human demonstrations, no annotation is needed.
- The expert demonstrations do not come from human recordings but rather from
- One term: "Conditional navigation goal".
- Together with the RGB images and the past positions, the network takes as input a navigation command to describe the desired behaviour of the ego vehicle at intersections.
- Hence, the future trajectory of the ego vehicle is conditioned by a navigation command.
- If the ego-car is approaching an intersection, the goal can be
left
,right
orcross
, else the goal is tokeep lane
. - That means
lane-change
is not an option.
- If the ego-car is approaching an intersection, the goal can be
-
"The last layers of the network are split into branches which are masked with the current navigation command, thus allowing the network to learn specific behaviours for each goal".
- Three ingredients to improve vanilla end-to-end imitation learning (
IL
):1
- Mix ofhigh
andlow
-level input (i.e. hybrid input):- Both raw signal (images) and partial environment abstraction (navigation commands) are used.
2
- Auxiliary tasks:- One head of the network predicts the future trajectories of the surrounding vehicles.
- It differs from the primary task which should decide the
2s
-ahead trajectory for the ego car. - Nevertheless, this secondary task helps: "Adding the neighbours prediction makes the ego prediction more compliant to traffic rules."
- It differs from the primary task which should decide the
- This refers to the concept of "Privileged learning":
-
"The network is partly trained with an auxiliary task on a ground truth which is useful to driving, and on the rest is only trained for IL".
-
- One head of the network predicts the future trajectories of the surrounding vehicles.
3
- Label augmentation:- The main challenge of
IL
is the difference between train and online test distributions. This is due to the difference betweenOpen-loop
control: decisions are not implemented.Close-loop
control: decisions are implemented, and the vehicle can end in a state absent from the train distribution, potentially causing "error accumulation".
- Data augmentation is used to reduce the gap between train and test distributions.
- Classical randomization is combined with
label augmentation
: data similar to failure cases is generated a posteriori.
- Classical randomization is combined with
- Three findings:
-
"There is a significant gap in performance when introducing the augmentation."
-
"The effect is much more noticeable on complex navigation tasks." (Errors accumulate quicker).
-
"Online test is the real significant indicator for
IL
when it is used for active control." (The common offline evaluation metrics may not be correlated to the online performance).
- The main challenge of
- Baselines:
- Conditional Imitation learning (
CIL
): "End-to-end driving via conditional imitation learning" video - (Codevilla, F., Müller, M., López, A., Koltun, V., & Dosovitskiy, A. 2017).CIL
produces instantaneous commands.
- Conditional Affordance Learning (
CAL
): "Conditional affordance learning for driving in urban environments" video - (Sauer, A., Savinov, N. & Geiger, A. 2018).CAL
produces "affordances" which are then given to a controller.
- Conditional Imitation learning (
- One word about the choice of the simulator.
- A possible alternative to CARLA could be DeepDrive or the
LGSVL
simulator developed by the Advanced Platform Lab at the LG Electronics America R&D Centre. This looks promising.
- A possible alternative to CARLA could be DeepDrive or the
"Uncertainty Quantification with Statistical Guarantees in End-to-End Autonomous Driving Control"
Click to expand
One figure:
The trust or uncertainty in one decision can be measured based on the probability mass function around its mode. Source. |
The measures of uncertainty based on mutual information can be used to issue warnings to the driver and perform safety / emergency manoeuvres. Source. |
As noted by the authors: while the variance can be useful in collision avoidance, the wide variance of HMC causes a larger proportion of trajectories to fall outside of the safety boundary when a new weather is applied. Source. |
Authors: Michelmore, R., Wicker, M., Laurenti, L., Cardelli, L., Gal, Y., & Kwiatkowska, M
- One related work:
- NVIDIA’s
PilotNet
[DAVE-2
] where expert demonstrations are used together with supervised learning to map from images (front camera) to steering command. - Here, human demonstrations are collected in the CARLA simulator.
- NVIDIA’s
- One idea: use distribution in weights.
- The difference with
PilotNet
is that the neural network applies the "Bayesian" paradigm, i.e. each weight is described by a distribution (not just a single value). - The authors illustrate the benefits of that paradigm, imagining an obstacle in the middle of the road.
- The Bayesian controller may be uncertain on the steering angle to apply (e.g. a
2
-tail orM
-shape distribution). - A first option is to sample angles, which turns the car either right or left, with equal probability.
- Another option would be to simply select the mean value of the distribution, which aims straight at the obstacle.
- The motivation of this work is based on that example: "derive some precise quantitative measures of the BNN uncertainty to facilitate the detection of such ambiguous situation".
- The Bayesian controller may be uncertain on the steering angle to apply (e.g. a
- The difference with
- One definition: "real-time decision confidence".
- This is the probability that the BNN controller is certain of its decision at the current time.
- The notion of trust can therefore be introduced: the idea it to compute the probability mass in a
ε
−ball around the decisionπ
(observation
) and classify it as certain if the resulting probability is greater than a threshold.- It reminds me the concept of
trust-region
optimisation in RL. - In extreme cases, all actions are equally distributed,
π
(observation
) has a very high variance, the agent does not know what to do (no trust) and will randomly sample an action.
- It reminds me the concept of
- How to get these estimates? Three Bayesian inference methods are compared:
- Monte Carlo dropout (
MCD
). - Mean-field variational inference (
VI
). - Hamiltonian Monte Carlo (
HMC
).
- Monte Carlo dropout (
- What to do with this information?
-
"This measure of uncertainty can be employed together with commonly employed measures of uncertainty, such as mutual information, to quantify in real time the degree that the model is confident in its predictions and can offer a notion of trust in its predictions."
- I did not know about "mutual information" and liked the explanation of Wikipedia about the link of
MI
toentropy
andKL-div
.- I am a little bit confused: in what I read, the
MI
is function of two random variables. What are they here? The authors rather speak about the predictive distribution exhibited by the predictive distribution.
- I am a little bit confused: in what I read, the
- I did not know about "mutual information" and liked the explanation of Wikipedia about the link of
- Depending on the uncertainty level, several actions are taken:
mutual information warnings
slow down the vehicle.standard warnings
slow down the vehicle and alert the operator of potential hazard.severe warnings
cause the car to safely brake and ask the operator to take control back.
-
- Another definition: "probabilistic safety", i.e. the probability that a BNN controller will keep the car "safe".
- Nice, but what is "safe"?
- It all relies on the assumption that expert demonstrations were all "safe", and measures the how much of the trajectory belongs to this "safe set".
- I must admit I did not fully understand the measure on "safety" for some continuous trajectory and discrete demonstration set:
- A car can drive with a large lateral offset from the demonstration on a wide road while being "safe", while a thin lateral shift in a narrow street can lead to an "unsafe" situation.
- Not to mention that the scenario (e.g. configuration of obstacles) has probably changed in-between.
- This leads to the following point with an interesting application for scenario coverage.
- One idea: apply changes in scenery and weather conditions to evaluate model robustness.
- To check the generalization ability of a model, the safety analysis is re-run (offline) with other weather conditions.
- As noted in conclusion, this offline safety probability can be used as a guide for active learning in order to increase data coverage and scenario representation in training data.
"Exploring the Limitations of Behavior Cloning for Autonomous Driving"
-
[
distributional shift problem
,off-policy data collection
,CARLA
,conditional imitation learning
,residual architecture
,reproducibility issue
,variance caused by initialization and sampling
]
Click to expand
One figure:
Conditional Imitation Learning is extended with a ResNet architecture and Speed prediction (CILRS ). Source. |
Authors: Codevilla, F., Santana, E., Antonio, M. L., & Gaidon, A.
- One term: “CILRS” = Conditional Imitation Learning extended with a ResNet architecture and Speed prediction.
- One Q&A: How to include in E2E learning information about the destination, i.e. to disambiguate imitation around multiple types of intersections?
- Add a high-level
navigational command
(e.g. take the next right, left, or stay in lane) to the tuple <observation
,expert action
> when building the dataset.
- Add a high-level
- One idea: learn to predict the ego speed (
mediated perception
) to address the inertia problem stemming from causal confusion (biased correlation between low speed and no acceleration - when the ego vehicle is stopped, e.g. at a red traffic light, the probability it stays static is indeed overwhelming in the training data).- A good video about Causal Confusion in Imitation Learning showing that "access to more information leads to worse generalisation under distribution shift".
- Another idea: The off-policy (expert) driving demonstration is not produced by a human, but rather generated from an omniscient "AI" agent.
- One quote:
"The more common the vehicle model and color, the better the trained agent reacts to it. This raises ethical challenges in automated driving".
"Conditional Affordance Learning for Driving in Urban Environments"
-
[
2018
] [📝] [🎞️] [🎞️] [] [ 🎓CVC, UAB, Barcelona
] [ 🚗Toyota
] -
[
CARLA
,end-to-mid
,direct perception
]
Click to expand
Some figures:
Examples of affordances, i.e. attributes of the environment which limit the space of allowed actions. A1 , A2 and A3 are predefined observation areas. Source. |
The presented direct perception method predicts a low-dimensional intermediate representation of the environment - affordance - which is then used in a conventional control algorithm. The affordance is conditioned for goal-directed navigation, i.e. before each intersection, it receives an instruction such as go straight , turn left or turn right . Source. |
The feature maps produced by a CNN feature extractor are stored in a memory and consumed by task-specific layers (one affordance has one task block). Every task block has its own specific temporal receptive field - it decides how much of the memory it needs. This figure also illustrates how the navigation command is used as switch between trained submodules. Source. |
Authors: Sauer, A., Savinov, N., & Geiger, A.
- One term: "Direct perception" (
DP
):- The goal of
DP
methods is to predict a low-dimensional intermediate representation of the environment which is then used in a conventional control algorithm to manoeuvre the vehicle. - With this regard,
DP
could also be saidend-to-
mid
. The mapping to learn is less complex thanend-to-
end
(from raw input to controls). DP
is meant to combine the advantages of two other commonly-used approaches: modular pipelinesMP
andend-to-end
methods such as imitation learningIL
or model-freeRL
.- Ground truth affordances are collected using
CARLA
. Several augmentations are performed.
- The goal of
- Related work on affordance learning and direct perception.
Deepdriving
: Learning affordance for direct perception in autonomous driving by (Chen, Seff, Kornhauser, & Xiao, 2015).Deepdriving
works on highway.- Here, the idea is extended to urban scenarios (with traffic signs, traffic lights, junctions) considering a sequence of images (not just one camera frame) for temporal information.
- One term: "Conditional Affordance Learning" (
CAL
):- "Conditional": The actions of the agent are conditioned on a high-level command given by the navigation system (the planner) prior to intersections. It describes the manoeuvre to be performed, e.g.,
go straight
,turn left
,turn right
. - "Affordance": Affordances are one example of
DP
representation. They are attributes of the environment which limit the space of allowed actions. Only6
affordances are used forCARLA
urban driving:Distance to vehicle
(continuous).Relative angle
(continuous and conditional).Distance to centre-line
(continuous and conditional).Speed Sign
(discrete).Red Traffic Light
(discrete - binary).Hazard
(discrete - binary).- The
Class Weighted Cross Entropy
is the loss used for discrete affordances to put more weights on rare but important occurrences (hazard
occurs rarely compared totraffic light red
).
- The
- "Learning": A single neural network trained with multi-task learning (
MTL
) predicts all affordances in a single forward pass (~50ms
). It only takes a single front-facing camera view as input.
- "Conditional": The actions of the agent are conditioned on a high-level command given by the navigation system (the planner) prior to intersections. It describes the manoeuvre to be performed, e.g.,
- About the controllers: The path-velocity decomposition is applied. Hence two controllers are used in parallel:
- 1-
throttle
andbrake
- Based on the predicted affordances, a state is "rule-based" assigned among:
cruising
,following
,over limit
,red light
, andhazard stop
(all are mutually exclusive). - Based on this state, the longitudinal control signals are derived, using
PID
or threshold-predefined values. - It can handle traffic lights, speed signs and smooth car-following.
- Note: The Supplementary Material provides details insights on controller tuning (especially
PID
) forCARLA
.
- Based on the predicted affordances, a state is "rule-based" assigned among:
- 2-
steering
is controlled by a Stanley Controller, based on two conditional affordances:distance to centreline
andrelative angle
.
- 1-
- One idea: I am often wondering what timeout I should set when testing a scenario with
CARLA
. The author computes this time based on the length of the pre-defined path (which is actually easily accessible):-
"The time limit equals the time needed to reach the goal when driving along the optimal path at
10 km/h
"
-
- Another idea: Attention Analysis.
- For better understanding on how affordances are constructed, the attention of the
CNN
using gradient-weighted class activation maps (Grad-CAMs
). - This "visual explanation" reminds me another technique used in
end-to-end
approaches,VisualBackProp
, that highlights the image pixels which contributed the most to the final results.
- For better understanding on how affordances are constructed, the attention of the
- Baselines and results:
- Compared to
CARLA
-based Modular Pipeline (MP
), Conditional Imitation Learning (CIL
) and Reinforcement Learning (RL
),CAL
particularly excels in generalizing to the new town.
- Compared to
- Where to provide the high-level navigation conditions?
- The authors find that "conditioning in the network has several advantages over conditioning in the controller".
- In addition, in the net, it is preferable to use the navigation command as switch between submodules rather than an input:
-
"We observed that training specialized submodules for each directional command leads to better performance compared to using the directional command as an additional input to the task networks".
-
"Variational Autoencoder for End-to-End Control of Autonomous Driving with Novelty Detection and Training De-biasing"
-
[
VAE
,uncertainty estimation
,sampling efficiency
,augmentation
]
Click to expand
Some figures:
One particular latent variable ^y is explicitly supervised to predict steering control. Another interesting idea: augmentation is based on domain knowledge - if a method used to the middle-view is given some left-view image, it should predict some correction to the right. Source. |
For each new image, empirical uncertainty estimates are computed by sampling from the variables of the latent space. These estimates lead to the D statistic that indicates whether an observed image is well captured by our trained model, i.e. novelty detection . Source. |
In a subsequent work, the VAE is conditioned onto the road topology. It serves multiple purposes such as localization and end-to-end navigation. The routed or unrouted map given as additional input goes toward the mid-to-end approach where processing is performed and/or external knowledge is embedded. Source. See this video temporal for evolution of the predictions. |
Authors: Amini, A., Schwarting, W., Rosman, G., Araki, B., Karaman, S., & Rus, D.
- One issue raised about vanilla
E2E
:- The lack a measure of associated confidence in the prediction.
- The lack of interpretation of the learned features.
- Having said that, the authors present an approach to both understand and estimate the confidence of the output.
- The idea is to use a Variational Autoencoder (
VAE
), taking benefit of its intermediate latent representation which is learnt in an unsupervised way and provides uncertainty estimates for every variable in the latent space via their parameters.
- One idea for the
VAE
: one particular latent variable is explicitly supervised to predict steering control.- The loss function of the
VAE
has therefore3
parts:- A
reconstruction
loss:L1
-norm between the input image and the output image. - A
latent
loss:KL
-divergence between the latent variables and a unit Gaussian, providing regularization for the latent space. - A
supervised latent
loss:MSE
between the predicted and actual curvature of the vehicle’s path.
- A
- The loss function of the
- One contribution: "Detection of novel events" (which have not been sufficiently trained for).
- To check if an observed image is well captured by the trained model, the idea is to propagate the
VAE
’s latent uncertainty through the decoder and compare the result with the original input. This is done by sampling (empirical uncertainty estimates). - The resulting pixel-wise expectation and variance are used to compute a sort of loss metric
D
(x
,ˆx
) whose distribution for the training-set is known (approximated with a histogram). - The image
x
is classified asnovel
if this statistic is outside of the95th
percentile of the training distribution and the prediction can finally be "untrusted to produce reliable outputs". -
"Our work presents an indicator to detect novel images that were not contained in the training distribution by weighting the reconstructed image by the latent uncertainty propagated through the network. High loss indicates that the model has not been trained on that type of image and thus reflects lower confidence in the network’s ability to generalize to that scenario."
- To check if an observed image is well captured by the trained model, the idea is to propagate the
- A second contribution: "Automated debiasing against learned biases".
- As for the novelty detection, it takes advantage of the latent space distribution and the possibility of sampling from the most representative regions of this space.
- Briefly said, the idea it to increase the proportion of rarer datapoints by dropping over-represented regions of the latent space to accelerate the training (sampling efficiency).
- This debiasing is not manually specified beforehand but based on learned latent variables.
- One reason to use single frame prediction (as opposed to
RNN
):-
""Note that only a single image is used as input at every time instant. This follows from original observations where models that were trained end-to-end with a temporal information (
CNN
+LSTM
) are unable to decouple the underlying spatial information from the temporal control aspect. While these models perform well on test datasets, they face control feedback issues when placed on a physical vehicle and consistently drift off the road.""
-
- One idea about augmentation (also met in the Behavioral Cloning Project of the Udacity Self-Driving Car Engineer Nanodegree):
-
"To inject domain knowledge into our network we augmented the dataset with images collected from cameras placed approximately
2
feet to the left and right of the main centre camera. We correspondingly changed the supervised control value to teach the model how to recover from off-centre positions."
-
- One note about the output:
-
"We refer to steering command interchangeably as the road curvature: the actual steering angle requires reasoning about road slip and control plant parameters that change between vehicles."
-
- Previous and further works:
- "Spatial Uncertainty Sampling for End-to-End control" - (Amini, Soleimany, Karaman, & Rus, 2018)
- "Variational End-to-End Navigation and Localization" - (Amini, Rosman, Karaman, & Rus, 2019)
- One idea: incorporate some coarse-grained roadmaps with raw perceptual data.
- Either unrouted (just containing the drivable roads).
Output
= continuous probability distribution over steering control. - Or routed (target road highlighted).
Output
= deterministic steering control to navigate.
- Either unrouted (just containing the drivable roads).
- How to evaluate the continuous probability distribution over steering control given the human "scalar" demonstration?
-
"For a range of
z
-scores over the steering control distribution we compute the number of samples within the test set where the true (human) control output was within the predicted range."
-
- About the training dataset:
25 km
of urban driving data.
- One idea: incorporate some coarse-grained roadmaps with raw perceptual data.
"ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst"
Click to expand
Two figures:
Different layers composing the mid-level representation . Source. |
Training architecture around ChauffeurNet with the different loss terms, that can be grouped into environment and imitation losses. Source. |
Authors: Bansal, M., Krizhevsky, A., & Ogale, A.
- One term: "mid-level representation"
- The decision-making task (between
perception
andcontrol
) is packed into one single "learnable" module.- Input: the representation divided into several image-like layers:
Map features
such as lanes, stop signs, cross-walks...;Traffic lights
;Speed Limit
;Intended route
;Current agent box
;Dynamic objects
;Past agent poses
.- Such a representation is generic, i.e. independent of the number of dynamic objects and independent of the road geometry/topology.
- I discuss some equivalent representations seen at IV19.
- Output: intended route, i.e. the future poses recurrently predicted by the introduced
ChauffeurNet
model.
- Input: the representation divided into several image-like layers:
- This architecture lays between E2E (from
pixels
directly tocontrol
) and fully decomposed modular pipelines (decomposingplanning
in multiple modules). - Two notable advantages over E2E:
- It alleviates the burdens of learning perception and control:
- The desired trajectory is passed to a controls optimizer that takes care of creating the low-level control signals.
- Not to mention that different types of vehicles may possibly utilize different control outputs to achieve the same driving trajectory.
- Perturbations and input data from simulation are easier to generate.
- It alleviates the burdens of learning perception and control:
- The decision-making task (between
- One key finding: "pure imitation learning is not sufficient", despite the
60
days of continual driving (30 million
examples).- One quote about the "famous" distribution shift (deviation from the training distribution) in imitation learning:
"The key challenge is that we need to run the system closed-loop, where errors accumulate and induce a shift from the training distribution."
- The training data does not have any real collisions. How can the agent efficiently learn to avoid them if it has never been exposed during training?
- One solution consists in exposing the model to non-expert behaviours, such as collisions and off-road driving, and in adding extra loss functions.
- Going beyond vanilla cloning.
- Trajectory perturbation: Expose the learner to synthesized data in the form of perturbations to the expert’s driving (e.g. jitter the midpoint pose and heading)
- One idea for future works is to use more complex augmentations, e.g. with RL, especially for highly interactive scenarios.
- Past dropout: to prevent using the history to cheat by just extrapolating from the past rather than finding the underlying causes of the behaviour.
- Hence the concept of tweaking the training data in order to “simulate the bad rather than just imitate the good”.
- Trajectory perturbation: Expose the learner to synthesized data in the form of perturbations to the expert’s driving (e.g. jitter the midpoint pose and heading)
- Going beyond the vanilla imitation loss.
- Extend imitation losses.
- Add environment losses to discourage undesirable behaviour, e.g. measuring the overlap of predicted agent positions with the non-road regions.
- Use imitation dropout, i.e. sometimes favour the environment loss over the imitation loss.
- Going beyond vanilla cloning.
"Imitating Driver Behavior with Generative Adversarial Networks"
-
[
2017
] [📝] [] [ 🎓Stanford
] -
[
adversarial learning
,distributional shift problem
,cascading errors
,IDM
,NGSIM
,rllab
]
Click to expand
Some figures:
The state consists in 51 features divided into 3 groups: The core features include hand-picked features such as Speed , Curvature and Lane Offset . The LIDAR-like beams capture the surrounding objects in a fixed-size representation independent of the number of vehicles. Finally, 3 binary indicator features identify when the ego vehicle encounters undesirable states - collision , drives off road , and travels in reverse . Source. |
As for common adversarial approaches, the objective function in GAIL includes some sigmoid cross entropy terms. The objective is to fit ψ for the discriminator. But this objective function is non-differentiable with respect to θ . One solution is to optimize πθ separately using RL . But what for reward function ? In order to drive πθ into regions of the state-action space similar to those explored by the expert πE , a surrogate reward ˜r is generated from D _ψ based on samples and TRPO is used to perform a policy update of πθ . Source. |
Authors: Kuefler, A., Morton, J., Wheeler, T., & Kochenderfer, M.
- One term: the problem of "cascading errors" in behavioural cloning (
BC
).BC
, which treatsIL
as a supervised learning problem, tries to fit a model to a fixed dataset of expert state-action pairs. In other words,BC
solves a regression problem in which the policy parameterization is obtained by maximizing the likelihood of the actions taken in the training data.- But inaccuracies can lead the stochastic policy to states that are underrepresented in the training data (e.g., an ego-vehicle edging towards the side of the road). And datasets rarely contain information about how human drivers behave in such situations.
- The policy network is then forced to generalize, and this can lead to yet poorer predictions, and ultimately to invalid or unseen situations (e.g., off-road driving).
- "Cascading Errors" refers to this problem where small inaccuracies compound during simulation and the agent cannot recover from them.
- This issue is inherent to sequential decision making.
- As found by the results:
-
"The root-weighted square error results show that the feedforward
BC
model has the best short-horizon performance, but then begins to accumulate error for longer time horizons." -
"Only
GAIL
(and of courseIDM
+MOBIL
) are able to stay on the road for extended stretches."
-
- One idea:
RL
provides robustness against "cascading errors".RL
maximizes the global, expected return on a trajectory, rather than local instructions for each observation. Hence more appropriate for sequential decision making.- Also, the reward function
r
(s_t
,a_t
) is defined for all state-action pairs, allowing an agent to receive a learning signal even from unusual states. And these signals can establish preferences between mildly undesirable behaviour (e.g., hard braking) and extremely undesirable behaviour (e.g., collisions).- In contrast,
BC
only receives a learning signal for those states represented in a labelled, finite dataset. - Because handcrafting an accurate
RL
reward function is often difficult,IRL
seems promising. In addition, the imitation (via the recovered reward function) extends to unseen states: e.g. a vehicle that is perturbed toward the lane boundaries should know to return toward the lane centre.
- In contrast,
- Another idea: use
GAIL
instead ofIRL
:-
"IRL approaches are typically computationally expensive in their recovery of an expert cost function. Instead, recent work has attempted to imitate expert behaviour through direct policy optimization, without first learning a cost function."
- "Generative Adversarial Imitation Learning" (
GAIL
) implements this idea:-
"Expert behaviour can be imitated by training a policy to produce actions that a binary classifier mistakes for those of an expert."
-
"
GAIL
trains a policy to perform expert-like behaviour by rewarding it for “deceiving” a classifier trained to discriminate between policy and expert state-action pairs."
-
- One contribution is to extend
GAIL
to the optimization of recurrent neural networks (GRU
in this case).
-
- One concept: "Trust Region Policy Optimization".
- Policy-gradient
RL
optimization with "Trust Region" is used to optimize the agent's policyπθ
, addressing the issue of training instability of vanilla policy-gradient methods.-
"
TRPO
updates policy parameters through a constrained optimization procedure that enforces that a policy cannot change too much in a single update, and hence limits the damage that can be caused by noisy gradient estimates."
-
- But what reward function to apply? Again, we do not want to do
IRL
. - Some "surrogate" reward function is empirically derived from the discriminator. Although it may be quite different from the true reward function optimized by expert, it can be used to drive
πθ
into regions of the state-action space similar to those explored byπE
.
- Policy-gradient
- One finding: Should previous actions be included in the state
s
?-
"The previous action taken by the ego vehicle is not included in the set of features provided to the policies. We found that policies can develop an over-reliance on previous actions at the expense of relying on the other features contained in their input."
- But on the other hand, the authors find:
-
"The
GAIL
GRU
policy takes similar actions to humans, but oscillates between actions more than humans. For instance, rather than outputting a turn-rate of zero on straight road stretches, it will alternate between outputting small positive and negative turn-rates". -
"An engineered reward function could also be used to penalize the oscillations in acceleration and turn-rate produced by the
GAIL
GRU
".
-
- Some interesting interpretations about the
IDM
andMOBIL
driver models (resp. longitudinal and lateral control).- These commonly-used rule-based parametric models serve here as baselines:
-
"The Intelligent Driver Model (
IDM
) extended this work by capturing asymmetries between acceleration and deceleration, preferred free road and bumper-to-bumper headways, and realistic braking behaviour." -
"
MOBIL
maintains a utility function and 'politeness parameter' to capture intelligent driver behaviour in both acceleration and turning."