"A Survey on Probabilistic Planning and Temporal Scheduling with Safety Guarantees"
-
[
2020
] [📝] [ 🎓KU Leuven
] -
[
(PO)MDP
,probabilistic safety guarantees
,chance constrained
,safe-reachability objective
]
Click to expand
Source. |
Authors: Vermaelen, J., Dinh, H. T., & Holvoet, T.
-
Motivations:
- A review.
1-
Beside the main objective of traditional(PO)MDP
, i.e. maximizing theexpected return
, try to enforce [probabilistic]safety guarantees
.2-
Give illustrative examples for each concept.- Here the authors consider the deployment of a drone for inspection tasks on industrial sites.
-
Note:
- As I understand, an explicit model of the environment is required.
- Hence no model-free
RL
seems directly applicable.
-
Robust
(PO)MDP
.- Ideas:
- The
transition
model is not perfect and may not be known exactly. - The derived
policy
and theexpected return
can be very sensitive to small changes in thetransition
probabilities. -
"An uncertainty set
τ a,s
can be used to denote the uncertaintransitions
, rather than using known, fixed probabilities. Furthermore, probabilistic information regarding the unknown parameters can be considered, for example, usingconfidence regions
. The resulting policy has to attain the highest worst-case performance over thatconfidence region
."
- The
- Example:
- The
transition
model of theMDP
says that the drone’sLift
action has a known and fixed probability of0.8
to succeed and0.2
to fail. - These probabilities were estimated from data gathered over previous flights.
- These parameters become representative after sufficient flights, but some level of uncertainty remains.
- A
Robust (PO)MDP
takes the confidence in the transition model, here the gathered flight data, into account.
- The
- It does not achieve explicit safety guarantees.
- Ideas:
-
Constrained:
C-(PO)MDP
- Idea:
- Constraints on secondary objectives.
-
"Apart from the traditional
cumulative reward
that is to be maximized,n
otherobjectives
are expressed, related to then
respectivecosts
. On these secondary objectives,constraints
are applicable." -
"Such
constraints
can limit expectedcosts
or utilization of certain resources to a maximum threshold."
- Example
1
:- Each
action
has a specific power consumption. - Secondary objective: the total power consumption.
- Constraint: limit the expected power usage to a threshold smaller than the capacity of the battery.
- Each
- Example
2
:- Each
action
requires some time to execute. - Secondary objective: total duration.
- Constraint: limit the expected duration of the agent’s total execution.
- Each
-
"It turns the problem into a
constrained
optimization problem. The resultingpolicy
is still apolicy
that maximizes the overall reward, yet obeys [well, no hard constraints!] the posedconstraints
." - Limitations:
- Simply assigning a high cost to
states
that correspond with a low battery level, provides no guarantees. - The planner simply tries to avoid these
states
. - Lowering the penalty does not solve the problem either: the planner can switch from being too conservative to too risky.
- Simply assigning a high cost to
- Idea:
-
Chance Constrained:
CC-(PO)MDP
- Why "chance"?
probabilistic
constraints =chance
constraints.- The objective is still to maximize the expected cumulative reward.
- But the probability (“
chance
”) of violating safety constraints is bounded.
- Example:
p
(collision
) should stay below0.01
.-
"We can demand that the expected probability of the
UAV
showing such unsafe behavior remains below some arbitrarily low threshold."
- Limitation:
-
"It is important to remark that an accurate model of the environment of the
UAV
and the effects of itsactions
is required to obtain precise guarantees." - In particular, the
transition
probabilities might not be straightforward to estimate but have to be known precisely.
-
-
"It has been observed that the
C-MDP
is more general than theCC-MDP
. AnyCC-MDP
problem can be reduced to aC-MDP
problem."
- Why "chance"?
-
Path Constrained:
PC-(PO)MDP
- Motivation: combine a probabilistic model checking aspect with a planning approach.
- Idea:
- Constraints on consecutive
states
. - Using Probabilistic Computation Tree Logic (
PCTL
) for instance.
- Constraints on consecutive
- Example:
- In each possible
path
, no landing procedure should be initiated before all environment checks have been run successfully. - This temporal logic formula must hold with a probability of at least
0.99
.
- In each possible
-
Safe-Reachability Objectives:
SRO-(PO)MDP
- Comparison with
C-(PO)MDPs
andCC-(PO)MDPs
:- Their
policy
maximize the agent’s expectedcumulative reward
while bounding an expectedcost
orrisk
. - They typically hold large
reward
values forgoal states
, to attract the agent, but there is no hard guarantees to reach thegoal state
.
- Their
- Motivation: satisfy a
safe-reachability objective
in all possible execution paths, that is, including worst-case scenarios. - Idea: enforce a double guarantee regarding:
1
the behaviour of theUAV
: the probability of visiting anunsafe
state is always kept below some threshold.2
its mission: agoal state
is eventually reached with a probability above a certain threshold.
- Example:
- The drone reaches its destination with a minimum probability, while the probability of crashing is bounded.
- Comparison with
-
Risk Sensitive:
RS-(PO)MDP
- Comparison: traditional
(PO)MDPs
are minimizing thecost
itself. - Idea:
risk
= "possibility of something bad happening".- The policy should maximize the
probability
of the cumulativecost
being below a certain predefined threshold.
- Example:
- Maximize the probability of "
p
(collision
) stays below0.01
".
- Maximize the probability of "
- Comparison: traditional
"Vehicle Control with Prediction Model Based Monte-Carlo Tree Search"
-
[
2020
] [📝] [ 🎓Seoul National University
] -
[
highway
,combine learning + planning
,learnt transition model
]
Click to expand
A prediction model is learnt from the NGSIM dataset. And used as transition model to estimate the next states of the nodes and the corresponding rewards . This contrasts with conventional MCTS formulations where the transition model is rule-based. Source. |
Learning and planning are combined. A RL agent is first trained offline. Then, its value net helps for the evaluation stage while its policy net guides the selection phase of the MCTS . And the trajectories sampled during the MCTS online rollouts are used to fine-tune the value net . Source. |
Authors: Ha, T., Cho, K., Cha, G., Lee, K., & Oh, S.
-
Motivations:
1-
Improve thetransition
model that predicts thenext state
fromcurrent state
andcurrent action
.-
"In conventional
MCTS
methods, they assume that thenext state
of the agent can be calculated from a pre-defined deterministic model, or sampled by a self-play method." - Here the model is learnt offline.
-
2-
Combinelearning
andplanning
.- By using a pre-trained
RL
agent.
- By using a pre-trained
-
state
1-
An image. It represents the bird-eye view of the scene.2-
A vector. It represents theposition
andspeed
of the ego-car.3-
A set of vectors. It represents theposition
andspeed
of neighbouring vehicles.
-
action
35
combinations of predefined (acceleration
,steer
) pairs.
-
How to predict the
transition
s?- A prediction network is trained on the
NGSIM
dataset.- Called "Multi-Agent Joint Trajectory Prediction Model".
"Deep Predictive Autonomous Driving Using Multi-Agent Joint Trajectory Prediction and Traffic Rules"
- also detailed on this page.
- Max number of surrounding vehicles =
6
.
- A prediction network is trained on the
-
How can a pre-trained
RL
agent can be used to improve theMCTS
?- It offers a
value
and apolicy
function. 1-
Thevalue
function evaluates the expanded nodes during theevaluation
step.2-
Thepolicy
function guides theselection
phase ofMCTS
.- A modified
UCT
:argmax
is applied onQ + c*π*U
. π
is thepolicy
function.- The upper bound
U
is a function of the (node
,action
) visitation counter: a larger number of trials should give us a smaller bound. - Here
c=200
. But the weighting depends on the magnitude ofQ
, which depends on the range ofr
, with hereγ=0.8
.
- A modified
- It offers a
-
How can online
MCTS
improve the pre-trainedRL
agent?- It is said above that the
policy
net guides theselection
.- Trajectories are collected (from one node up to a
termination
node).
- Trajectories are collected (from one node up to a
- The
value
net can be fine-tuned using these sampled trajectories.- I.e. Monte Carlo methods which is unbiased.
- And this is
on-policy
since the experiences come from thepolicy net
.
- It is said above that the
-
Once the tree is constructed (here after
200
loops), how is the final decision made?-
"In most
MCTS
algorithms, thefinal action
is selected with the highest value of theQ-value
, the visit countN
, or a combination of themQ−U
. In this paper, we adopt the last method."
-
-
Other detail for the
section
phase.- Breath-first-search to the root node: the
selection
method first selects not-visitedaction
before all theaction
from the root node are searched at least once.
- Breath-first-search to the root node: the
-
Benchmark:
- It would have been interesting to use a version of
IDM
to model the reaction of the other vehicles.
- It would have been interesting to use a version of
"Driving Maneuvers Prediction Based Autonomous Driving Control by Deep Monte Carlo Tree Search"
-
[
2020
] [📝] [ 🎓University of Chengdu
,University of Texas
,San Diego State University
] -
[
AlphaGo
,end-to-end
,combining planning + learning
]
Click to expand
The MCTS search is assisted by two networks: A value net is learnt from the collected experiences and used for the expansion and evaluation steps. A second net learns to predict the transitions , enabling rollouts independently of the simulator. Source. |
The expansion and simulation steps of the vanilla MCTS are merged, relying on a network that predicts the action probabilities and associated values based on the current state . Source. |
Compared to AlphaGo , the evaluation of an expanded node is done using the value net only. No rollout with some fast rollout policy pπ is performed. Source. |
How the prior action probability is used during the selection phase? Left: proposed approach. Right: AlphaGo . Source1 Source2. |
Authors: Chen, J., Zhang, C., Luo, J., Xie, J., & Wan, Y.
-
Close to model-based
RL
/ learning-basedplanning
, it uses the idea ofAlphaGo
. -
Motivations:
1-
end-to-end
: predictsteering
from front view image on a racing track.- Using the Udacity Self-driving Simulator (
USS
) andTorcs
.
- Using the Udacity Self-driving Simulator (
2-
Improve training efficiency. Compared to model-freeRL
and behavioural planning.-
"The
training losses
of [the proposed]AVP
network,DQN
,DDPG
andIL
converge at the training step40,000
,80,000
,70,000
and30,000
, respectively. Hence, theAVP
network improves50.0%
compared toDQN
,42.9%
compared toDDPG
in training efficiency and only loses33.3%
in training efficiency compared toIL
. [Probably becauseAVP
should predict thenext state
for allactions
, not only the optimal ones_.]"
-
3-
Interpretability: predict manoeuvres by reconstructing multiple possible trajectories.
-
Main idea:
MCTS
with some learnt modules.-
"We employ
deep-MCTS
based on asynchronous policy and the value MCTS algorithm (APV-MCTS
fromAlphaGo
) which contains three steps by merging theexpansion
andsimulation
steps." - No info about the
timestep
.
-
-
What transition model? How to predict
s'
?- A first net predicts the
next state
of the vehicle based on a given controlaction
and the currentstate
.- I.e.
grey-scale image
togrey-scale image
prediction.
- I.e.
- Loss: distance between images. E.g.
L2
.
- A first net predicts the
-
How to guide the tree
expansion
and make theevaluation
step more efficient?1-
By estimating avalue
functionV(s)
to complement/avoid rollouts.2-
By estimating theaction
selection probabilities (i.e. a prior policy), used for both the initialization of new nodes and during theselection
phase.- A second net predicts the
action
selection probabilitiespt
and thevalue
vt
of currentstate
ςt
.P(n)
is the prior selection probability of noden
.
-
Metrics:
- Stability of driving trajectory: the mean of the deviation to track centre (
MDC
). - Stability of driving control: the mean continuity error (
MCE
). - No consideration of the efficiency? E.g.
speed
.
- Stability of driving trajectory: the mean of the deviation to track centre (
-
About
AlphaGo
:- Neural networks are used to reduce the effective
depth
andbreadth
of the search tree:1-
Evaluating positions using avalue
network.2-
Sampling actions using apolicy
network.-
"First, the
depth
of the search is reduced bystate evaluation
: truncating the search tree atstate
s
and replacing the subtree belows
by an approximatevalue
functionv(s)
that predicts the outcome fromstate
s
." -
"Second, the
breadth
of the search is reduced by samplingactions
from apolicy
p(a|s)
that is a probability distribution over possible movesa
in positions
."
- Three learning steps:
-
[
policy
learnt fromBC
] "Supervised learning (SL
)policy
network directly from expert human moves." -
[
policy
improved byself-play
] "Next, we train a reinforcement learning (RL
)policy
network that improves theSL
policy network by optimizing the final outcome of games of self-play." -
"Finally, we train a
value
network that predicts the winner of games played by theRL
policy
network against itself."
-
expansion
phase:- The leaf state
sL
is processed just once by theSL
policy network. The output probabilities are stored as prior probabilitiesP
for each legalaction
.- This impacts the action selection:
argmax
[Q(s)
+u(s)
] where the bonus termu
is proportional to that prior probability but decays with repeated visits to encourage exploration.
- This impacts the action selection:
- The leaf node is evaluated in two very different ways. Evaluations are combined by a weighted sum.
1-
By the value networkvθ(sL)
.2-
By the outcome of a random rollout played out until terminal stepT
using the fast rollout policypπ
.
- The leaf state
- Neural networks are used to reduce the effective
"Autonomous Driving at Intersections: A Critical-Turning-Point Approach for Planning and Decision Making"
-
[
2020
] [📝] [ 🎓Cogdrive lab
atUniversity of Waterloo
] -
[
CTP
,ABT
,unprotected left turn
,information gathering
]
Click to expand
Left-turning behaviours at unsignalized intersection can be modelled using the Critical Turning Points Model" (CTP ). Before the CTP , the ego-vehicle is creeping forward to collect information and assess the situation. After the CTP , it turns and accelerates. The action space is therefore made 2 -dimensional (acceleration and CTP ). This improves the driving efficiency compared to just having control of the acceleration on a fixed pre-defined path, where the ego vehicle stops and waits for the critical zone to clear. Here, if one of the paths is blocked, it can keep moving forward and take the sharp turn at the next CTP . Source. |
The nodes in the tree represent the beliefs . Each node is formed by a set of particles, where every particle is associated with a quadruple (s , a , o , r ) obtained from previous episodes. The edge that connects the nodes b and b' in the tree is formed by a (action -observation ) pair, which means that a particle in the belief b performs an action a and reached to belief b' (i.e. b' = transition(b,a,o) ) which perceives an observation o . The size of the tree and thus the complexity of the search depends on the discretization of the observation space. Here, it is parametrized with an accuracy variable κa . Source. |
Author: Shu, K.
-
A master thesis, with very good explanations about
POMDP
solvers! -
Related work: Autonomous Driving at Intersections: A Critical-Turning-Point Approach for Left Turns, (Shu et al., 2020). See below.
-
Motivations:
1-
Address left-turning at unsignalized intersection.-
"Left turning when the oncoming vehicles are not using turn signals is one of the most common and challenging tasks for planning at unsignalized intersections."
-
2-
Consider the unknown intentions of the oncoming vehicle to make less conservative decisions.3-
Ensure real-time performance.-
"An observation accuracy as
1
and a tree depth of4
or5
[2
or2.5s
] would be an ideal tree size for good performance in this test scenario." -
"Complexity is influenced by the numbers of
CTPs
, the moreCTPs
that are used, the more candidate paths would be generated to be taken into account to find an appropriate behavior. [...] Here,4
CTPs would be appropriate."
-
-
About the architecture:
1-
Offline: select proper candidate paths (yes, multiple) for the particular intersection.2-
Online: select onepath
and decide thespeed
on it.-
"With the generated candidate left-turning paths, the left-turn problem is formulated as a
POMDP
problem." -
"
POMDP
provide a way to make decisions on the best sequences of actions to perform when the agent can not fully observe thestates
of the environment, in our case, it is the intention of the surrounding [opposing] vehicles."
-
-
How to model behaviours at intersection?
- With a "Critical Turning Point Model" (
CTP
). - From dataset, the author finds that behaviours at left turns at intersections can be divided into two stages:
1-
Acreeping forward
phase when the driver assesses the safety of the environment.2-
An intensesteering
andacceleration
phase when the driver feels confident to drive through the potential collision area.
- The physical point where the vehicle transition from
1-
to2-
is called a critical turning points (CTP
).- Multiple
CTP
s exist for one intersection. - They can be extracted from or validated by naturalistic driving data, using
yaw rate
, rather thatlongitudinal speed
, as a major criterion.
- Multiple
- With a "Critical Turning Point Model" (
-
How to avoid over-conservatist decisions?
- Enable
information gathering
by allowing slight modifications around the "turning" path, instead of just controlling thelongitudinal acceleration
on one fixed pre-defined path. -
"Without the uses of
CTPs
, the path of the ego vehicle is fixed, thus, when the Critical Zone is blocked by the oncoming vehicle, the ego vehicle stops and waits for the critical zone to clear until any acceleration can be executed." -
"However, with the uses of
CTPs
, if one of the paths is blocked, the vehicle has the option to keep moving forward and take the sharp turn at the nextCTP
."
- Enable
-
How to estimate
intention
?- With
belief
tracking. Using particle filtering in thebelief
tree.
- With
-
About the
POMDP
formulation:state
:1-
Travelleddistance
on the pre-defined path, in the Frenet–Serret frame.2-
speed
along that path.3-
intention
of the vehicles, which includesgoing straight
,making right
orleft turns
.
observation
:- Only the
position
andspeed
are observable. - The
intention
(route to follow) is unknown to the ego vehicle and must be estimated. -
[Discretization of the space] "An accuracy variable
κa
for the observation model is introduced. Thestates
are multiplied by the accuracy variable first, then rounded to the closest integer and finally divided byκa
. By tuningκa
, the width of the search tree could be adjusted."
- Only the
action
:1-
acceleration
along the pre-defined path.- Discrete in [
-4
,4m/s²
] with a interval of1m/s
.
- Discrete in [
2-
Left-turninginstruction
- Boolean.
-
"When this parameter is set to
1
, the ego vehicle will shift from the creeping forward phase to the sharp left-turning phase at the nextCTP
ahead."
reward
:-
Crashing and going backward.+
Reaching the goal.+
Moving as the referencedspeed
.+
Selecting the proper path.-
"Finally, the key
reward
that enables the vehicle to select the appropriate path is set asRm = 10y − 50x²
, which encourages the vehicle to move closer to the goal on bothx
andy
-axis."
-
transition
model:- All the surrounding vehicles are moving on their pre-defined paths with an uncertain
acceleration
/deceleration
which follows a normal distribution.
- All the surrounding vehicles are moving on their pre-defined paths with an uncertain
-
About the decision frequency:
-
"[Due to] the rapidly changing nature of the intersection scenarios [...] the frequency of decision-making needed to be realized as high as possible."
- Here, working at
2Hz
.
-
-
About the
POMDP
online solver.-
"Among all the
MCTS
methods,Adaptive Belief Tree
(ABT
) is implemented. This solver usesMCTS
to generate apolicy
tree, then theaction
with the highestreward
is executed andbelief
is updated using a particle filter. After that, instead of discarding the entire policy tree as the classicalPOMCP
online solver,ABT
only identifies and trims the parts of the policy tree that are influenced by the change of the model (e.g. change ofaction
space,transition
model,observation
model, etc.), then revises the influence part of thepolicy
tree and improves it if there is still time left in that time cycle. Since most part of thepolicy
tree is saved after each time step, the search efficiency is boosted."
-
"POMDP Autonomous Vehicle Visibility Reasoning"
Click to expand
Left: the MODIA framework consists of two small POMDP decision problems (DP s) that are solved offline. One to deal with another car at an intersection. Another to deal with an obstacle. When vehicles are perceived online, DPs are instantiated as decision components (DC s). DC s recommend an action at specific arbitration points along the route, with conflicts resolved by an executor arbitration function (e.g., take the safest action). Virtual vehicles, imagined just outside of the field-of-view, are also created and instantiate DC s to allow for reasoning about possible imperceptible vehicles. Right (from MODIA ): the policy seems to be derived iteratively, with α -vectors. This is possible since only 128 states are defined. Source. |
Authors: Wray, K. H., Lange, B., Jamgochian, A., Witwicki, S. J., Kobashi, A., Hagaribommanahalli, S., & Ilstrup, D.
-
Motivations:
1-
Decision making under limited visibility.-
"To maximize safety, AVs must reason about these “known unknowns” [they are capable of detecting their own limited visibility] and intelligently make decisions for when to
go
,stop
, oredge forward slowly
for visibility when entering an occluded T-intersection or passing an obstacle in the road." -
"
POMDP
provides a powerful model for sequential decision-making under limited visibility, sensor noise, and other forms of uncertainty known sensor limitations through its probabilistic model of observing other vehicles. They enablebelief
-based reasoning about a vehicle’s existence even if it has never actually been perceived by perception."
-
2-
Scalability: multiple vehicles and different scenarios.-
[issue] "Single monolithic
POMDP
s for AV decision-making define astate
space for up to some maximum number of possible vehicles."
-
3-
Real world. Unfortunately, no video is provided.
-
MODIA
: for scalability.- From (Wray, Witwicki, & Zilberstein, 2017):
"Online Decision-Making for Scalable Autonomous Systems"
. Slides. - It is a framework for multiple online decision-components with interacting actions.
- The main idea is the scene-decomposition, i.e. to divide the problem into canonical decision-making subproblems and to solve separately each of them.
- This idea has already been applied in multiple papers, including patents:
RL with scene decomposition for navigating complex environments
from Honda and Stanford (related paper). -
"In
MODIA
, a collection of possible decision-problems (DP
s), known a priori, are instantiated online and executed as decision-components (DC
s), unknown a priori.
- This idea has already been applied in multiple papers, including patents:
- Two decision-making problem (
DP
) are considered:DP.1
: negotiating with a single vehicle at a T-intersection.DP.2
: passing a single obstacle [not detailed here??].
- Multiple instances of each
DP
are created:- For each perceived (or virtual) vehicle, a
DC
is instantiated as a copy of one of the twoDP
s original, including itspolicy
that has been derived offline. - Each created
DC
makes decision proposals. An executor arbitration function aggregates these proposals to produce one action to be performed.-
"Here we consider an executor arbitration function that selects the most conservative recommendation. That is,
stop
is preferred toedge
which is preferred togo
."
-
- At each time step, each
DC
[j
] (instantiated fromDP
[i
]) obtains itsbelief
andarbitration point
from the monitorM
[i
].
- For each perceived (or virtual) vehicle, a
- From (Wray, Witwicki, & Zilberstein, 2017):
-
The two
DP
are formulated asPOMDP
.-
They are solved offline.
- How is belief tracking done?
b
is updated tob'
with:b'
(s'
) =η
.O(a,s',ω)
. SUM-over-s[T(s,a,s')
.b(s)
]
- How is the policy derived? No detail here. But from
MODIA
paper:-
"Since
Vπ
is piecewise linear and convex, we describe it using sets ofα
-vectorsΓ
={α1
, ... ,αr
} with eachαi
=[αi
(s1
), ... ,αi
(sn
)]T
andαi
(s
) denoting value of states
∈S
. The objective is to find optimal policyπ∗
that maximizesV
denoted asV∗
. Given an initial beliefb0
,V∗
can be iteratively computed for a time stept
, expanding beliefs at each update resulting in beliefb
."
-
- How is belief tracking done?
-
They share the same
action
space.- {
stop
,edge forward
, andgo
} decisions which control the motion along the trajectory. - How can the ego-car decide to take an obstacle over, as illustrated by
DC4
on the figure, if the trajectory is defined and the decisions only control the longitudinal motion? -
"The low-level trajectory control uses these points as an input to its constrained optimization continual planner."
- {
-
state
forDP.1
: discrete and abstracted!1-
Ego location: {before-stop
,at-stop
,before-gap
,at-gap
,goal
,terminal
}. I don't understand whatgap
means here.2-
Other's location: {before-stop
,at-stop
,before-gap
,at-gap
,goal
,empty
}.3-
Ego time at the location: {short
,long
}.4-
Existence of a gap when the AV arrives: {yes
,no
}. I am confused by the inclusion of the temporality. What happens if the ego car waits and the gap disappears? Is it stilltrue
?- In total
6*6*2*2
=128
states. This can be stored in a table, and does not require function approximator. - The high abstraction in the
state
space can be problematic here.1-
Precision is required, especially when driving close to other objects.2-
Some transitions may become impossible, depending on the timestep used.
-
observation
(I must say I don't understand their relevance. Shouldn't it be thelocation
s with noise added for instance?):1-
If the ego car successfully moved {yes
,no
}. Why "successfully"?2-
If the other vehicle successfully moved {yes
,no
}.3-
If a gap is detected {yes
,no
}.
-
reward
forDP.1
:0
for the goalstate
,−1000
for any other terminalstate
, and-1
for all other (s
,a
) pairs.- That makes
DP.1
episodic.
-
transition function
forDP.1
:- I don't understand. More details needed (especially about "etc.")! What
distribution
s for instance? -
"It multiplies parameterized probabilities of quantifiable events within the
state
-action
space including: (1
) the AV and/or other vehicle moving forward, (2
) time incrementing, (3
) entering a terminal state, (4
) the other vehicle slowing down for a crossing AV, (5
) the gap’s existence toggling based on other vehicle movement, etc."
- I don't understand. More details needed (especially about "etc.")! What
-
observation function
forDP.1
:-
[More details would have been appreciated] "It multiples parameterized probabilities including: (
1
) localizing correctly within the AV’s location state factor, (2
) correctly detecting the other vehicle that does exist, (3
) correctly matching the other vehicle to its location state factor, (4
) observing the terminal or goal state, (5
) detecting the gap correctly based on predictions, etc."
-
-
What about the timestep?
- If too large, then the ego car cannot react. If too small, some
transitions
betweenstates
are impossible and somestates
become absorbing.
- If too large, then the ego car cannot react. If too small, some
-
What about
DP.2
?
-
-
How to deal with occlusion?
- Virtual vehicles are instantiated.
-
"We create virtual vehicles at the edge of the detectable visibility range along all lanes. For example at a T-intersection, two virtual vehicle
DC
s will always exist, one for each incoming lane."
"Improving Automated Driving through Planning with Human Internal States"
-
[
2020
] [📝] [ 🎓FAU Erlangen
] [ 🚗AUDI
] -
[
ABT
,TAPIR
]
Click to expand
Top-left: scenario: the agent cannot recognize whether the obstacle is traversable (road blocked, obstacle exists ep=1 ; or not, ep=0 ) until getting closer. The position of the obstacle is described by a continuous variable. Bottom-left: distribution used for the observation model : getting closer increases the probability of a true positive measurement but also for false positives . Very close to the obstacle, it gets detected and correctly classified with high probability. Middle: example of a search in simplified belief tree with two action s: the distance of potential object is known, reducing the problem to a binary case, with uniform initial belief (50/50 ). Right: example of trajectories obtained from the online search: the action that maximizes the approximated Q -function is selected at each step. Here, future rewards are not discounted: γ=1 . Source. |
Given a tree, how to compute the Q (b , a ) values? Left: Estimate the expected return by averaging from all episodes passing from this (b , a ) tuple. This leads to a conservative behaviour because the outer sum in also includes episodes that act sub-optimally (unnecessarily crash into the obstacle) since ABT tries all actions when hitting a new belief . Therefore, the authors switched to the max -option implemented in TAPIR (right). It does not include suboptimal episodes: taking action a and acting optimally thereafter. Source. |
Left: Influence of the c-UCT parameter balancing the exploration / exploitation . Stochasticity in the observations emission and in the tree construction make results vary. Therefore, 50 simulations are performed each time. Middle: Illustration of the depletion problem: Because ABT only keeps particles that took the same action and received the same observation , inherently, particles get lost. This can be seen at 20s : the agent receives its first positive measurement. At this point hardly any particles are close to the observed position. Therefore, emergency resampling is started, which creates particles close to the observed location (in red). This impacts the efficiency as the tree has to be re-calculated almost from scratch. Note the discretisation with initial uniform distribution. Right: Discretization is proposed to deal with the continuous observation space. Source. |
The expected behaviour is to slow down in front of the potential obstacle and once certain that there is no threat, accelerate again. A too low UCT -factor leads to too optimistic (exploitation ) behaviour: it can plan for up to 20s but ignores certain dangerous branches of the tree. Increasing the UCT factor makes it more conservative, but a too high c-UCT may perform too much exploration, leading to a too short horizon and also crashes (unable to react). Top-right: conservative estimation of the Q -values. Source. |
Authors: Bey, H., Tratz, M., Sackmann, M., & Lange, A.
-
Motivations:
1
Detail how thePOMDP
online solverABT
works.2
Using a simple example, illustrate the "difficulties that have to be overcome when trying to define and solve a real-worldPOMDP
problem".- Well, "real-world" means playing with the algorithm, and defining the
POMDP
spaces and models.- But no consideration about real-driving or real-cars: no action delay,
acceleration
asaction
s and1s
time-step.
- But no consideration about real-driving or real-cars: no action delay,
- Well, "real-world" means playing with the algorithm, and defining the
- Solver-specific pitfalls and impact of parameters:
1-
UCT
-factor, balancingexploration
versusexploitation
.2-
Estimation of theQ
-value function.- Continuous hidden
states
andobservations
makes it hard: 3-
Particle depletion in particle filter.4-
Degenerated tree and space discretization.5-
Planning horizon and heuristic / default policy.
-
About the scenario:
-
"The expected behavior is for the vehicle to slow down until it is certain about the obstacle’s state and then to either stop or accelerate again."
-
-
About
ABT
(Kurniawati and Yadav, 2016).ABT
= Adaptive Belief Tree.- It is implemented in "Toolkit for Approximating and Adapting
POMDP
Solutions in Real Time" (TAPIR
) software [:octocat:]. - Family of online + sampling-based solvers:
-
[approximation] "For more difficult problems it is often desired to reach a good enough solution in finite time rather than spending much more time in search of the optimum."
- The
belief
(probability distribution over possiblestates
) is represented by a set of sampledstates
within a particle filter. - They are
anytime
-capable, meaning that they improve the policy as long as they are given time. ABT
differs from Determinized Sparse Partially Observable Tree (DESPOT
) and Partially Observable Monte-Carlo Planning (POMCP
) in that it is able to keep and only modify the tree if the model changes (hence Adaptive), whereas the others have to start from scratch.
-
-
About particle depletion (especially in continuous
states
).-
"Particles explored with a different
action
at the first step or having received wrongobservations
are lost. Eventually, after several time steps no particle is left that fits the actualobservation
." -
"The reason being, that the
ABT
does not perform regular resampling as it tries to keep the subtree, while moving or generating new particles would mean to lose their episodes. This behaviour is opposed to theDESPOT
solver which uses an independent conventional particle filtering including importance resampling." -
[solution] "
ABT
offers a default implementation to replenish particles. It tries to “force” particles from the previousbelief
into the new one by always choosing theaction
actually performed and “hoping” to receive the correctobservation
, until a minimum number of particles (1000
) is reached at the new root."
-
-
About infinite branching factor (in continuous space).
-
"Whenever a new
observation
is made, a newobservation
branch and subsequentbelief
are opened up. In case of continuousobservation
this would mean that each particle lands in its ownbelief
, degenerating the tree just after one step." - Solutions:
1-
Progressive widening: limits the number of availableobservations
at eachaction
node, forcing episodes to run deeper into the tree.2-
Space discretization: twoobservations
are considered equal if a certain distance function falls below a pre-setthreshold
. In that case, the episode is routed into the samebelief
.- Small
threshold
->fine
-grained -> many belief nodes -> shallow search tree. - Larger
threshold
->coarse
-grained -> less observation branches -> deeper trees. But risk:-
"If a belief stretches over a large section, there may be trajectories passing through that
belief
and not crashing, even though the obstacle lies within that section. [...] Thebelief
may receive a good value estimate, making it attractive, even though it is dangerous."
-
- Small
-
-
A
heuristic
function as a first estimate for abelief
value.- Can be seen as the
default
policy (as opposed totree
policy) inMCTS
, used for the rollouts in thesimulation
step. - Examples:
1-
The heuristic returns simply zero.- If the
rewards
are constructed to be always negative, this leads to optimistic estimates.
- If the
2-
IDM
model is used.-
"This helps to detect
beliefs
that inevitably crash into the obstacle and should be avoided; the horizon is artificially prolonged."
-
- Can be seen as the
-
Directions for future works:
1-
Speed up the search algorithms by parallelizing it. For instance "Hyp-despot: A hybrid parallel algorithm for online planning under uncertainty" (Cai et al., 2018).2-
Better handle continuous observations.
"Improving Automated Driving through Planning with Human Internal States"
-
[
2020
] [📝] [] [ 🎓Stanford
] -
[
MCTS-DPW
,Q-MDP
,internal state
]
Click to expand
Bottom-left: surrounded cars are modelled with parametric IDM and MOBIL models. These parameters are called internal state and are not observable. The agent uses a belief tracker to estimate them. Top: The action space has size 10 : 9 combinations of {change-left , change-right , stay } x {accelerate , maintain , slow-down } and one dynamically determined braking action, computed based on the speed and position of the vehicle ahead. Source. |
The authors compared different solvers and compare them based on their Pareto fronts in this multi-objective problem. All approaches are variants of Monte Carlo tree search (MCTS ) with double progressive widening (DPW ). An approximation consists in assuming that the parameters in the ìnternal state are correlated (middle). In this case, only the ''type'' of behaviour {Timid , Normal , Aggressive } must be estimated, making the belief tracking and inference simpler: dim =1 compared to dim =8 . Source. |
Two kinds of robustness are considered for the internal state estimation: How much are the parameters correlated. And how diverse can their value be (domain size). Source. |
Authors: Sunberg, Z., & Kochenderfer, M.
-
Motivations:
-
"Our hypothesis is that planning techniques that consider
internal states
such asintentions
anddispositions
of other drivers can simultaneously improve safety and efficiency."
-
-
About uncertainties:
1-
Epistemic uncertainty: can be reduced through learning.2-
Aleatory uncertainty: cannot be reduced with any kind of knowledge.-
"The
internal state
(e.g.,intentions
andaggressiveness
) of other drivers and road users can only be indirectly inferred."internal state
uncertainty can be said epistemic.
-
"
POMDP
planners model human drivers’ internal states as epistemic uncertainty, while the MDP methods only consider aleatory uncertainty."
-
About
(PO)MDP
solvers:-
[Only approximate planning approaches are considered] "Though
POMDP
s are very powerful in terms of expression, even the class of finite-horizonPOMDP
s isPSPACE-complete
, indicating that it is unlikely that efficient general exact algorithms for large problems will be discovered." - Only online solutions are considered here, and all the approaches use variants of Monte Carlo tree search (
MCTS
) with double progressive widening (DPW
) - as opposed tovalue iteration
methods for instance.- Main idea of
MCTS-DPW
: restrict the search to relevant regions of the tree. -
"
DPW
further focuses computation by considering only a limited, but gradually increasing, number of sampledstate
s orobservation
s".
- Main idea of
-
-
About the driver models:
-
[Transition function in the
(PO)MDP
formulation] "TheT
function is implicitly defined by a generative model that consists of a state transition function,F(·)
, and a stochastic noise process." IDM
andMOBIL
parametric models are used.- Example of parameters: The
MOBIL
parameterp
in[0, 1]
is the politeness factor, which represents how much the driver values allowing other vehicles to increase their acceleration.
- Example of parameters: The
- Their parameters are used to define the
internal state
. - Limitations:
-
"The primary weakness of this investigation is the model of the other drivers. Since the
IDM
andMOBIL
models were developed to simulate large scale traffic flow, simulations with these models may not be accurate. Incorporating models learned from data would further validate the conclusions drawn here." - The authors also point that these models neglect dynamic intentions of other drivers.
-
-
-
About the correlation of the distributions of the parameters in the
internal state
s.- The
internal state
consists in many parameters (fromIDM
andMOBIL
): e.g.desired deceleration
,desired speed
,desired time gap
,politeness factor
... 1-
All parameters could be assumed perfectly correlated.- An
aggressiveness
class can be defined {Timid
,Normal
,Aggressive
}, with defined values for each parameter. - In this case, a particle consists of only a single value:
dimension
=1
.
- An
2-
Parameters could be assumed uncorrelated and must be estimated jointly.- A particle consists of values of all model parameters:
dimension
=8
.
- A particle consists of values of all model parameters:
3-
A partial correlation is also investigated.- Robustness analysis
1
: Simulation models with varying levels of correlation are tested:-
"When the parameters are fully correlated, all of the parameters are easy to estimate by observing only a few, so there is not a significant performance gap between
MSM
,QMDP
, andPOMCPOW
." -
"On the other hand, when the parameters are uncorrelated, the
QMDP
planner performs much better than thecertainty equivalence planner
[Mean state MDP
], andPOMCPOW
performs much better thanQMDP
."
-
- Robustness analysis
2
: the domain from which the parameters are drawn is expanded by a variable factor.- For instance, the width of the interval (e.g. [
0.5s
,2.5s
]), from which the marginal distribution of thedesired time gap
parameter is drawn, is varied. -
"Robustness to inaccuracy in the parameter domain is one-sided: when the true domain is larger than that assumed by the planners, performance is adversely affected, but when the true domain is smaller, there is no degradation."
- For instance, the width of the interval (e.g. [
- Conclusion for human-driver-modelling:
-
"Simple solvers are enough if most human driver behaviour can be correlated with easily measurable quantities."
-
"Otherwise, need more advanced planners that carry the uncertainty further into the future."
-
- The
-
About the
observation
filtering:- The
belief
at a given time consists of:1-
The exactly known physical state.2-
A collection of particles, along with associated weights, representing possible internal state.- The online estimation of the internal state
θ
(driver model parameters) is accomplished with a particle filter.
- The online estimation of the internal state
- To update the
belief
when anaction
is taken:1-
New particles are sampled with probability proportional to their weights.2-
Sampled noise values and the state transition function are used to generate newstate
s.3-
The new weights are determined by approximating the conditional probability of the particle given theobservation
.
- The
-
Approximate Planning Approaches:
-
1
+2
: Baselines, representing ways to force the epistemic uncertainty in thePOMDP
into anMDP
formulation.1-
Normal
: All drivers' parameters are assumed to be known and identical.- No belief tracking is needed: an
MDP
is formulated. - It is over-optimistic since it assumes to knows the
internal states
of other drivers. - Over-confident: it is able to reach the goal a large proportion of the time, but it causes many safety violations.
- No belief tracking is needed: an
2-
Naive MDP
: Theinternal states
are considered as random variables, independent at each timestep, i.e. assumed to be simply aleatoric.- No belief tracking can be done: an
MDP
with astate
consisting only of the physical state is formulated. - It is conservative since it plans pessimistically assuming it can learn nothing new about the drivers.
- Over-cautious: it can attain a high level of safety, but it is never able to meet the goal more than
80%
of the time.
- No belief tracking can be done: an
-
3
+4
+5
: Online estimation of the internal stateθ
using a particle filter. -
3
+4
: Passively online learning. Optimistic assumption. Hence systematically suboptimal.-
"These approximations are useful in many domains because they are much easier to compute than the full
POMDP
solution since they require only the solution to the fully observable MDP." - There is no incentive for learning about the state, and hence these policies will not take costly actions for "active learning" / "information gathering".
3-
Mean state MDP
.- At each timestep, a fully observable
MDP
is constructed using the mean internal state values:s
=E(s∼b)[s]
. -
"This approach is sometimes referred to as
certainty equivalence control
". - It is over-confident (achieving a high success rate, but sacrificing safety) because it plans without any internal state uncertainty.
- At each timestep, a fully observable
4-
QMDP
: hypothetical problem with partial observability on the current step, but that subsequently becomes fully observable.-
"Another natural approach to finding
Q
functions forPOMDP'
s is to make use of theQ
values of the underlyingMDP
. That is, we can temporarily ignore theobservation
model and find theQ-MDP
(s
,a
) values for theMDP
consisting of thetransitions
andrewards
only." [(Littman et al., 1994)] -
"With the
Q-MDP
values in hand, we can treat all theQ-MDP
values for each action as a single linear function and estimate theQ
value for abelief
stateb
. This estimate amounts to assuming that any uncertainty in the agent's currentbelief
state will be gone after the nextaction
." [(Littman et al., 1994)] - The
action
selected maximizes theexpected Q-MDP
value for the currentbelief
:Q
(b
,a
) =Σ
b
(s
).Q-MDP
(s
,a
) (single linear function) - Here, the
Q-MDP
values are estimated throughMCTS-DPW
instead ofvalue iteration
. - It performs better than
mean state MDP
because it considers samples from the entire estimated internalstate
distribution when planning.
-
-
-
5-
POMCPOW
.-
"Since full Bayesian belief updates are computationally expensive, the
POMCPOW
algorithm extendsMCTS
to include approximate beliefs represented by weighted particle collections that are gradually improved as the tree is searched". - Since the vehicle does not have to take costly information-gathering
action
s to accomplish its goal,POMCPOW
only outperformsQMDP
in certain cases.
-
-
6-
A omniscient upper bound planner.
-
-
How to compare approaches on this multi-objective problem?
- Two criteria and terms in the
reward
function:1-
Safety: any situation in which any human-driven or autonomous vehicle has to break hard to avoid a collision is marked "unsafe".2-
Efficiency: accomplishing a goal with minimum resource use (time
).
-
"At first, it may seem that
safety
is a strictly higher priority thanefficiency
, but consumers will not sacrifice efficiency without limit." - Multi-objective comparison with Pareto frontier.
reward
function: Weighted sum of the competing goals, create a single objective function.1-
Positive reward for reaching the target lane within the distance limit.2-
Negative reward for hard brakes and slow velocity.
- Different weight values are tested:
λ
in {0.5
,1
,2
,4
,8
}. - Pareto fronts are approximated by connecting the resulting Pareto points with straight lines.
-
"Conclusions about different algorithms can be reached by comparing the resulting curves generated by those algorithms."
- Two criteria and terms in the
"Efficient Uncertainty-aware Decision-making for Autonomous Vehicles Using Guided Branching"
-
[
2020
] [📝] [] [ 🎓Hong Kong University
] -
[
guided branching
]
Click to expand
The Domain-specific Closed-loop Policy Tree (DCP-Tree ) provides a guided branching mechanism in the action space. For efficiency, this semantic-level policy tree is updated based on the previous best policy. Each node in the policy tree is a finite-horizon semantic behaviour of the ego car. From the ongoing action, each policy sequence will contain at most one change of action in one planning cycle. Compared to MPDM , this offers the possibility to change decision in the planning horizon while, as humans, not frequently changing the driving policy back and forth in a single decision cycle. Source. |
Decision-making using domain-specific expert knowledge to guide branching, inspired by MPDM . Branching in the action domain is guided by the DCP -Tree. Branching in the intention domain is done by the CFB mechanism to pick out risky hidden intentions of nearby vehicles. Source. |
Authors: Zhang, L., Ding, W., Chen, J., & Shen, S.
- Motivations: Propose a simpler alternative to existing online
POMDP
solvers (DESPOT
,ABT
,POMCP
), with focus on efficiency.-
"The key idea is utilizing domain-specific expert knowledge to guide the branching in both
(1) action
and(2) intention
space." - The goal is to consider as few branches as possible. The most critical ones, potentially leading to risky outcomes.
-
- Two related works:
- "Multipolicy Decision-Making for AutonomousDriving via Changepoint-based Behavior Prediction"
- "MPDM: Multipolicy decision-making in dynamic, uncertain environments for autonomous driving"
MPDM
is extended here and used as a benchmark.-
"
MPDM
approximates thePOMDP
process into the(1)
closed-loop simulation of predefined(2)
semantic-level driving policies (e.g.,LC
,LK
, etc.). The incorporation of domain knowledge greatly accelerates the problem-solving."(1)
The exploration of the state space is guided by simple closed-loop controllers, i.e. domain knowledge.(2)
Semantic-level policies instead of traditional "state
"-level actions such as discretized accelerations or velocities
-
"One major limitation of
MPDM
is that the semantic-level policy of the ego vehicle is not allowed to change in the planning horizon."
- Differences with the original
MPDM
:1-
The policy of the ego vehicle is allowed to change in the planning horizon according to theDCP-Tree
.2-
Focused branching is applied to pick out the risky scenarios, even given totally uncertain behaviour prediction, which enhances the safety of the framework.-
"The
CFB
mechanism is applied to pick out risky hidden intentions of nearby vehicles and achieves guided branching in intention space."
-
- About formulation:
- [hidden part of the state] The intention about lateral behaviours in {
LK
,LCL
,LCR
}. - [belief tracker] A "rule-based lightweight belief tracking module" generates a probability distribution over these intentions. No much details about this module.
- [transition model] Two driver models:
intelligent driving model
andpure pursuit controller
. - [action] Longitudinal {} and lateral semantic-level decisions.
- [hidden part of the state] The intention about lateral behaviours in {
- Some terms:
MPDM
= multipolicy decision-making.EUDM
= uncertainty-aware decision-making.DCP-Tree
= domain-specific closed-loop policy tree.CFB
= conditional focused branching.
"Autonomous Driving at Intersections: A Critical-Turning-Point Approach for Left Turns"
-
[
2020
] [📝] [ 🎓University of Waterloo
,Tsinghua University
] [ 🚗Tencent
] -
[
intention-aware motion planning
,ABT
]
Click to expand
Source. |
The ego vehicle updates its beliefs on the route -intention of the oncoming vehicle. At start, actions share the same pattern. But when the left-turn intention becomes highly likely, an acceleration action is performed (top-left). In the straight -intention case, the future path is blocked by the oncoming traffic (right). Instead of stopping and then waiting at standstill, the ego vehicle slows down to creep forward, targeting a CTP more far away. Source. |
Authors: Shu, K., Yu, H., Chen, X., Chen, L., Wang, Q., Li, L., & Cao, D.
- Motivations:
- Replicate human-like efficient behaviour for a left-turn at an unsignalized intersection, trading off between safety and conservatism, without explicit hand-written rules:
-
"Before merging into the intersection, the ego vehicle drives into the intersection with high speed. Then it decelerates to a lower speed and creeps forward, considering the potential of collision with the oncoming vehicle while waiting for the oncoming vehicle’s intention to become certain. The ego vehicle then performs more confident actions on a proper route when the intention of the oncoming vehicles becomes clear."
-
- Replicate human-like efficient behaviour for a left-turn at an unsignalized intersection, trading off between safety and conservatism, without explicit hand-written rules:
- Architecture:
1-
[high-level
] Paths generation to defineCTP
s. Offline.2-
[low-level
]CTP
selection and speed planning on the chosen path. Online.
- One term: "critical turning point" (
CTP
).-
"The starting points where the vehicle makes hard
steering
(sharp turning) are identified and extracted as 'turn points'." - They are computed based on the critical zone extraction (
CZE
) which is generated from road geometry. - Path candidates are then generated from the
CTP
and sent to the lower-level planner forpath selection
andspeed planning
. - [Benefits of
CTP
s]-
"Our candidate paths planned with
CTP
gives the ego vehicle an option to keep moving forward to reach the nextCTP
when one of the paths is blocked by the oncoming traffic." -
"The proposed method spends about
1.5s
less time to pass through the intersection than the one that does not useCTP
s." -
"When the oncoming vehicle is driving with a higher speed, the shortest and the most aggressive path is chosen since the waiting time is shorter."
-
-
- About the
POMDP
formulation:- [hidden part of
state
] Oncoming vehicles have unknown intentions: eitherstraight
orleft-turn
.- The other vehicle is assumed to ignore the ego-car and the uncertainty is about the chosen route.
- Other approaches include the yield reaction of the other's, e.g. {
stopping
,hesitation
,normal
,aggressive
}.
- [
observation
]speeds
andpositions
in Cartesian coordinate system.- I think an
orientation
would also be useful to infer which route is taken + give a sense to thespeed
scalar.
- I think an
- [
action
] It is2
-dimensional:1-
(speed
planning) - Theego-acceleration
along the current path in [-4 m/s2
,4 m/s2
] with a step of1 m/s2
.2-
(path
/CTP
selection) - Aleft-turn
Boolean variable which "conveys sharp turn instructions". - I understand it as a change in selected path: where to leave the initial lane
- [
transition
model] All vehicles are assumed to move at constant speed on predefined routes. - [
solver
] Adaptive Belief Tree (ABT
).- "Instead of trimming the entire policy tree after an action is selected, the
ABT
algorithm only modifies parts of the tree that are influenced by the updated belief after executing the selected action".
- "Instead of trimming the entire policy tree after an action is selected, the
- [hidden part of
"Integrating Planning and Interpretable Goal Recognition for Autonomous Driving"
-
[
2020
] [📝] [🎞️] [ 🎓University of Edinburgh
] [ 🚗FiveAI
] -
[
goal recognition
,intention-aware motion planning
,inverse planning
,MCTS
]
Click to expand
Example of scenario (note: left-hand drive) where prediction based on goal-recognition can inform the planning. It enables a less conservative behaviour (entering the intersection earlier) while offering interpretability. Source. |
The main ideas are to couple prediction and planning , try to infer the goals followed by the other vehicles and use high-level abstraction of manoeuvres via macro actions . Source. |
The ego-agent updates its belief on the goal follow by the other vehicles (left). As noted below, the ablation study (right) raises question about what really offers benefits for the time-efficiency of the driving policy. Source. |
Authors: Albrecht, S. V., Brewitt, C., Wilhelm, J., Eiras, F., Dobre, M., & Ramamoorthy, S.
- Motivations:
1-
Improve the anticipation ability and hence the efficiency of driving behaviour at urban intersections.2-
Provide intuitive interpretations of the predictions to justify the ego-agent's decisions.3-
Keep computational efficiency low.
- The main ingredients to achieve that are:
- Couple
planning
andprediction
. - Infer the goals (here specifying target locations) of other vehicles.
- Use of high-level manoeuvres and macro actions.
- Couple
- Two assumptions:
1-
Each vehicle seeks to reach some (unknown) goal location from a set of possible goals, and behaves rationally by driving optimally to achieve goals.- Note: for that second point, the definition of the likelihood
p
(trajectory | goal-i
) still allows for a degree of deviation.
- Note: for that second point, the definition of the likelihood
2-
At any time, each vehicle is executing one ofmanoeuvre
among a finite set:lane-follow
lane-change-left
/right
turn-left
/right
give-way
stop
- About goad recognition:
- The idea is to recognise the goals of other vehicles, in order to perform rational inverse planning, i.e. made better (better informed) decisions.
-
"We must reason about why – that is, to what end – the vehicle performed its current and past manoeuvres, which will yield clues as to its intended goal."
-
"[Limitation of the optimality assumption] An important future direction is to account for human irrational biases."
- How to plan ego-action? By leveraging recognition and prediction.
- Both
goal probabilities
andtrajectory predictions
are used to inform a Monte Carlo Tree Search (MCTS
) algorithm.
- Both
- How to reduce the search depth? Using
macro actions
.-
"To keep the required search depth shallow and hence efficient, both
inverse planning
andMCTS
plan over macro actions." - In hierarchical
RL
,macro actions
are sometimes defined by a tuple <applicability condition
,termination condition
,primitive policy
>. - Here (I am a little bit confused):
- Each manoeuvre also specifies applicability conditions (
lane-change-left
is only applicable if there is a lane in same driving direction on the left of the vehicle) and termination conditions. -
"Macro actions concatenate one or more manoeuvres, [and] automatically set the parameters in manoeuvres [e.g.
driving distance
forlane-follow
orlane-id
s ingive-way
] based on context information (usually road layout)". - But no underlying primitive policy is used:
-
"
Macro actions
as used in this work do not define a hierarchy of decomposable actions; they simply define sequences of actions."
- Each manoeuvre also specifies applicability conditions (
- By reasoning on a high level of abstraction,
macro actions
offer a temporal abstraction that relieves the planner and ensures a low computational cost.
-
- Main steps of the approach:
1-
Goal Generation: Only consider locations that are reachable.2-
Manoeuvre Detection: Compute the posterior probabilities over each goal for each manoeuvre.3-
Inverse Planning: UsingA*
search over macro actions, derive an optimal plan.4-
Trajectory Prediction: Predict multiple plausible trajectories for a given vehicle and goal, rather than a single optimal trajectory.- Hence accounting for the multi-modal nature of the future prediction: given the same context, future may vary.
- It assumes that trajectories which are closer to optimal are more likely.
- About
close-loop
/open-loop
forward simulation:-
"The ego vehicle’s motion always uses closed-loop mode, while other vehicles can be simulated in either closed-loop or open-loop mode."
closed-loop
simulation: it uses a combination of proportional control and adaptive cruise control (ACC
), based onIDM
.open-loop
simulation: no automatic distance keeping is used.- The vehicle's position and velocity directly are set as specified in trajectory.
-
- About the experiment settings:
-
"For each [of the four] scenario, we generate
100
instances with randomly offset initial longitudinal positions (offset ∼ [−10
,+10
]m
) and initial speed sampled from range [5
,10
]m/s
for each vehicle including ego vehicle." - Frequency of
MCTS
:1 Hz
. - Number of simulations:
D
=
30
. - Maximum search depth:
d-max
=
5
. - Prior probabilities for achievable goals:
uniform
.
-
- Benefits of the approach:
- One of the baselines implements
MCTS
without goal recognition: the prediction instead assumes on constant-velocity lane-following.- This candidate suffers from a limited prediction horizon, but still performs well in term of driving time required to complete scenario.
- Another baseline also uses
CV
models, together with a "conservative give-way manoeuvre" which waits until all oncoming vehicles on priority lanes have passed.- Hence no
MCTS
. - This one is not able to infer goal and anticipate behaviour, preventing them to safely enter the road earlier, for instance when a car is detected to exit the roundabout.
- Hence no
- Based on this ablation study, it is not clear to me what improves the efficiency of the driving policy:
1-
Is it the consideration of goals?2-
Or just the coupling ofprediction
+planning
which gets rid of the conservative "wait until clear" condition?
- One of the baselines implements
"Point-Based Methods for Model Checking in Partially Observable Markov Decision Processes"
-
[
2020
] [📝] [] [ 🎓Stanford University
] [ 🚗Honda
] -
[
probabilistic garanties
,safety checkers
,POMDP
,SARSOP
]
Click to expand
POMDP quantitative model checker. Three parts are followed: 1- Creation of a product POMDP . 2- Reduction to reachability (From LTL Satisfaction to Reachability). 3- Solving the reachability problem. Source: author provided - taken during the IV19 SIPD workshop - see my report here. |
A reachability problem can be interpreted as a planning problem where the goal is to reach the set B . In LTL terminology, the temporal operator F means 'eventually'. Source. |
Authors: Bouton, M., Tumova, J., & Kochenderfer, M. J.
-
Motivations:
1-
Synthesize policies that satisfy a linear temporal logic (LTL
) formula in aPOMDP
, i.e. makePOMDP
policies exhibit guarantees on their performance.- "Policy synthesis" means that some good policy is derived, as opposed to just the evaluation of a given policy (computation of the probability of satisfaction for an objective).
2-
Scale to larger problem than previousbelief
-state
techniques (note that only finite discretestate
spaces are considered here).- For instance,
Norman et al.
addressed the problem ofbelief
-state
planning withLTL
specifications by discretizing the belief space and formulating anMDP
over this space. - But when the
state
space has more than a few dimensions, discretizing thebelief
space becomes intractable.
- For instance,
-
About model checking:
- 1-
Quantitative
model checking: Compute the maximum probability of satisfying a desired logical formula (and compute the associatedbelief
-state
policy). - 2-
Qualitative
model checking: Find a policy satisfying the formula with probability1
. - It makes me think of the strict action masking methods and masking approaches that consider statistical model checking, such as probabilistic reachable sets.
- 1-
-
About
LTL
formulas:LTL
is used as a language to specify the objective of the problem.- Examples:
¬A
U
B
means "avoid stateA
and reach stateB
" (safe-reachability objective).G
¬C
means "the agent must never visit stateC
" (the temporal operatorG
means "globally").
-
About "reachability problems":
-
"[the goal is to] Compute the maximum probability of reaching a given set of
state
s."
-
-
About "labelling functions" for states of the
POMDP
in the context ofLTL
formulas:- The
labels
are atomic propositions that evaluate to true or false at a givenstate
. - A labelling function maps each
state
of the environment to the set of atomic propositions holding in thatstate
. -
"We do not assume that the labels constituting the
LTL
formula are observable. The agent should infer the labels from the observations." - Concretely, the agent cannot observe whether it has reached an end component or not, but the
belief state
characterizes the confidence on whether or not it is in an end component. Therefore, it maintains a belief on both thestate
of the environment and thestate
of the automaton.
- The
-
One major idea: formulate "reachability problems" (quantitative model checking problem) as reward maximization problems.
-
"We show that the problem of finding a policy maximizing the satisfaction of the objective can be formulated as a reward maximization problem. This consideration allows us to benefit from efficient approximate
POMDP
solvers, such asSARSOP
." - In other words, a reachability problem can be interpreted as a planning problem where the goal is to reach the set
B
(the set of states where the propositional formula expressed byB
holds true). - For instance, the reward function gives
1
ifs
inB
.
-
-
Steps of the approach:
1-
Creation of aproduct POMDP
.-
"We define a new
POMDP
such that solving the original quantitative model checking problem reduces to a reachability problem in this model." - The new
POMDP
is calledproduct POMDP
:- The
state
space is the Cartesian product of the state space of the originalPOMDP
and the deterministic rabin automaton (DRA
, representing theLTL
formula).-
"The construction of the
product POMDP
can be interpreted as a principled way to augment thestate
space in order to account for temporal objective." -
"For formulas involving only a single until (
U
) or eventually (F
) temporal operators, the problem can be directly expressed as a reachability problem and does not require a state space augmentation".
-
- A new
transition
function is also defined, using the fact that anyLTL
formula can be represented by a deterministic Rabin automaton (resulting in a finite state machine).
- The
-
2-
Reduction to reachability (i.e. go fromLTL
satisfaction to reachability).- Solving the original quantitative model checking problem reduces to a reachability problem in this
product POMDP
model.- Reaching a
state
in this set guarantees the satisfaction of the formula.
- Reaching a
- What is to be done:
First
find the "end components".Then
identify the successstate
s.
- The computation of the maximal end components is one of the two bottlenecks of the presented approach (together with the choice of the planning algorithms).
- Solving the original quantitative model checking problem reduces to a reachability problem in this
3-
Solving the reachability problem.- Here, the
state
uncertainty will play a role (distinguishingMDP
s fromPOMDP
s).
- Here, the
-
About the solver used:
SARSOP
.- The idea is to restrict the
policy
space (hence an approximation), usingalpha vector
s.alpha vector
s are|state space|
-dimensional vectors defining a linear function over thebelief
space.- They are used to represent both the
policy
and thevalue function
.- Hence, they can serve to approximate the quantitative model checking problem and not only the policy synthesis problem.
- About point-based value iteration (
PBVI
) algorithms.- This is a family of
POMDP
solvers that involves applying a Bellman backup (hence "value iteration") to a set ofalpha vectors
in order to approximate the optimal value function. - Why it is said "point-based"?
- Vanilla value iteration (
VI
) algorithms cannot scale forPOMDP
s. - In
PBVI
algorithms, thebelief
space is sampled. - An
alpha vector
associated to each belief point is then computed to approximate the value function at that point.
- Vanilla value iteration (
- This is a family of
- What is the specificity of
SARSOP
?- It stands for "Successive Approximations of the Reachable Space under Optimal Policies".
- It relies on a tree search to explore the
belief
space.- It maintains upper and lower bounds on the value function, which are used to guide the search close to optimal trajectories (i.e. only exploring relevant regions).
- In other words, it focuses on regions that can be reached from the initial belief point under optimality conditions.
- This makes
SARSOP
one of the most scalable offlinePOMDP
planners.
- The idea is to restrict the
-
Another major idea: use the upper and lower bounds of
SARSOP
to estimate the probability of satisfaction of theLTL
formula.- In
PBVI
algorithms, convergence guarantees are offered, specified in upper and lower bound on the value function (e.g. one can control the convergence of the value function by controlling the depth of the tree inSARSOP
). -
"For a given precision parameter, we can directly translate the bounds on the value function in the
product POMDP
in terms of probability of success for our problem of quantitative model checking". - The user can specify the precision parameter.
- In
"Self-Driving Under Uncertainty - aa228 Course Project"
-
[
2020
] [📝] [] [ 🎓Stanford University
] -
[
POMCPOW
,JuliaPOMDP
]
Click to expand
On the graphical model, it can be seen that only the position of the obstacle b-xy is not fully observable. One contribution is the use of second reward model to help. It first encourages high speed. And also linearly reward the approaching of the goal to address the limitation of depth-bounded online tree search methods. Source. |
Author: Shved, P.
- A university project completed for the aa228 - Decision Making under Uncertainty course at Stanford.
- The scenario is a simple straight road with single moving obstacle (whose behaviour is non-adversarial and does not change in response to the actions taken by the ego-agent).
- Results were not very brilliant, but the author proposes a good analysis:
-
"We find that with certain fine-tuning,
POMCPOW
produces driving decisions that result in mission completion albeit sub-optimally. However, for more complicated scenarios, the quality of planning decisions degrades and the agent gets stuck without completing the mission."
-
- Initial motivation:
- Evaluate the robustness of the planner, i.e. study its performance while varying the perception precision.
- That means the
observation
model is stochastic (Gaussian) while thetransition
model is kept deterministic.
- Some elements about the
POMDP
formulation:- The
state
space is continuous and not discretized. - The task is episodic: A
state
is terminal if1-
(either) The end of the road is reached.2-
(or) A collision occurs.
- The
- One idea: use a second
reward
model.1-
The first reward model evaluates the driving behaviours: rewarding reaching the goal, penalizing collisions and encouraging progress.-
"However, the solvers used were unable to attain good results on that model directly. We augment this reward model and use model
R2
."
-
2-
The second was added after testing the online search and serves two purposes:-
"In the challenge model, the large positive reward for mission completion is spread out across the state space, and we reward high velocity."
- Concretely, it adds a term proportional to
r-mission-completion
*dist-to-goal
.
-
- About the
POMCPOW
solver:-
"It is a fixed-depth tree search algorithm that uses Double Progressive Widening to explore continuous
actions
space, and observation binning to constrain the continuousobservation
space." - This is an online solver, which may not be the best option for this simple scenario:
-
"Inability to learn from prior experience and produce an offline policy which results in unnecessary computation in simple situations."
-
-
- Another idea: use two rollout policies for the tree search.
-
"The default priors (e.g. random rollouts) were not sufficient to achieve high rewards."
-
"We addressed this problem by using two rollout policies:
always brake
andalways maintain
, comparing the expectedrewards
, and choosing theaction
produced by the highest policy."
-
- Findings. Key ingredients for the method to work:
1-
Tuning of the reward function.2-
Carefully craft the rollout policy for the tree search.3-
Spreading the rewards across the world.- It all results in a lot of manual tuning:
-
"We find that the agent behavior depends more on the domain knowledge embedded into the policy algorithm that on the sensor quality, but detailed exploration of this effect is left to future work."
-
"Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model"
-
[
combine planning + learning
,model-based RL
,internal representations
]
Click to expand
As for the variants of alphaGo with self-play (no supervised learning from human demonstrations), two nets are used to guide the MCTS search: policy and value function . In return, they can be trained using the observed trajectories and the MCTS decisions. The main differences with the alphaGo variants are that (1) the transition model is not available. Instead it is learnt (g function). And (2) the search is performed in the hidden state space, not in the state space, to be efficient when working on large states such as images. The hidden state should capture only the aspects that are relevant for planning in the MCTS , i.e to make good estimations of the reward , value and policy . Source. |
Whereas AlphaZero only has only one neural network (prediction ), MuZero needs three (prediction , dynamics , representation ). In AlphaZero , moving between states in the MCTS tree is simply a case of asking the environment. MuZero doesn't have this luxury, so needs to build its own dynamic model. Source1 Source2. |
Authors: Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., … Silver, D.
-
Why it can be useful for
autonomous driving
:- The underlying
dynamics
is unknown and difficult to model. - Assuming reaction models like
IDM
/rule-based
may not always be valid. - The
state
space can be large (end-to-end
) or if manually defined, not all its components may be relevant.
- The underlying
-
Motivations:
1-
Combine abilities ofplanning
algorithms (good atchess
andGo
) andmodel-free RL
algorithms (good at visually complexAtari
games).- Compared to
chess
, thedynamics
of visual games is hard to predict. - In
chess
andGo
, it has hard to say what the opponent will do at each step. But all the allowedactions
can be listed in a tree whosetransitions
are deterministic.
- Compared to
2-
Do not try to exactly reconstruct thestates
with the learntdynamics
model.Planning
is not done instate
space. Rather in ahidden state
space.- As explained in this brilliant video:
- You only predict what you need (
hidden state
) to obtain thevalues
,reward
andpolicy
needed for planning in theMCTS
. - You are not learning with a simulator, but rather with a learn
dynamics
model on thehidden state
.
- You only predict what you need (
-
About model-based
RL
:-
"
Model-based RL
aims to address this issue by first learning a model of the environment'sdynamics
, and thenplanning
with respect to the learned model." - Difficulty to learn the exact
dynamics
:-
"This tripartite separation between
representation learning
,model learning
, andplanning
is potentially problematic since the agent is not able to optimize itsrepresentation
ormodel
for the purpose of effectiveplanning
, so that, for example modeling errors may compound duringplanning
."
-
-
-
Similarities to
AlphaGo Zero
'sMCTS
:- The improved
policy
targets are generated by anMCTS
search.- These targets are used during training to make the
policy
net imitate theMCTS
decisions.
- These targets are used during training to make the
- The improved
value
targets are generated by playing the game orMDP
.- These observed
returns
are used to update thevalue
net.
- These observed
- The improved
-
Differences to
AlphaGo Zero
'sMCTS
:- No deterministic
transition
/dynamics
function is used.- Instead, one is learnt.
planning
is not perfomed in thestate
space (e.g. images for atari), but rather in somehidden state
space.- Also:
-
"We allow for long episodes with discounting and intermediate rewards by bootstrapping
n
steps into the future from the search value."
-
- No deterministic
-
About the
dynamics
/transition
function:- Input: a previous hidden state
sk−1
and a candidate actionak
. - Output: an immediate reward
rk
and a new hidden statesk
. - It looks similar to common model-based
RL
, except that it works withhidden states
instead ofstates
. - Hence, the task is not to predict the next
state
images.-
"There is no requirement for its
transition
model to match realstates
in the environment." -
"Planning at pixel-level granularity is not computationally tractable in large scale problems."
-
- Input: a previous hidden state
-
About
hidden states
:-
"The
hidden states
are free to representstate
in whatever way is relevant to predicting current and futurevalues
andpolicies
." - In other words, the agent can invent, internally, the rules or dynamics that lead to most accurate
planning
. -
"Unlike traditional approaches to
model-based RL
, this internal statesk
has no semantics of environmentstate
attached to it – it is simply thehidden state
of the overall model, and its sole purpose is to accurately predict relevant, future quantities:policies
,values
, andrewards
."
-
-
Three functions are learnt to enable
planning
in theMCTS
:- A
representation
functionh
.- Convert a stack of the last
32
observations
into an initialhidden state
.
- Convert a stack of the last
- A
dynamics
functiong
.- Estimate the next
hidden state
and the immediatereward
.
- Estimate the next
- A
prediction
functionf
.- Estimate the
value
and thepolicy
distribution approximations to guide the search. - Here, these
value
andpolicy
nets are taking thehidden state
as an input.
- Estimate the
- The model resulting from their concatenation is noted
μ
=mu
, therefore the termedmuZero
.
- A
-
"Mirrored
MDP
":-
"The main idea of these methods is to construct an
abstract MDP
model such thatplanning
in theabstract MDP
is equivalent toplanning
in the real environment." -
"It mirrors the structure of an
MDP
model that computes the expectedreward
andstate
transition for a givenstate
andaction
." - When training, the sampled trajectory is aligned into the hidden space and the
dynamics
model is used to roll forward.- The
actions
taken by theMCTS
in the trajectories serve as a ground truth for the learntpolicy
model. - The observed immediate
rewards
as ground truths for the learntreward
function. - The observed sum of
discounted rewards
as a ground truth for the learntvalue
function (TD-learning
,Monte Carlo
orn
-step).
- The
-
-
As noted by this brilliant post,
MuZero
shares ideas withWorld Models
(D. Ha & Schmidhuber, 2018).- Both create internal representations of the environment that exist only within the agent and are used to imagine possible futures in order to train a model to achieve a goal.
1-
Embedding the current observation:MuZero
: therepresentation
network.WorldModels
: avariational autoencoder
.
2-
Modelling the imagined environment and itstransition
function:MuZero
: thedynamics
network.WorldModels
: a recurrent neural network.
3-
Selecting anaction
:MuZero
:MCTS
guided by a learntprediction
network.WorldModels
: an evolutionary process to evolve an optimal action controller.
"Context and Intention Aware Planning for Urban Driving"
-
[
2019
] [📝] [🎞️] [ 🎓National University of Singapore
,MIT
] -
[
interaction-aware planning
,LSTM-prediction
,DESPOT
]
Click to expand
Modular and hierarchical architecture. The planner is road context - and intention -aware. Source. |
Scenario showing the benefit of integrating intention inference into planning: after inferring the exo-vehicles’ intentions of driving to the right lane, the ego-car immediately changes to the left lane which saves travelling time and avoids near collisions. Source. |
The authors tried their approach on real scenarios, which is quite rare in the field! Contrary to the simulation, measurements are not optimal: inaccurate perception module fails assigning the second and third car to their correct lanes. In addition, the turn signal of the first car is ignored, which could have led to an accident. It would be relevant to consider this turn signal to infer intention . A second idea would be to consider occlusion. Source. |
Authors: Meghjani, M., Luo, Y., Ho, Q. H., Cai, P., Verma, S., Rus, D., & Hsu, D.
-
Motivation:
- Show the benefit, for prediction, planning and decision making in urban driving, of using:
1-
Road contextual information.- This reduces the uncertainties in
intention
and trajectory predictions of exo-vehicles. - It also reduces the computation cost of the
POMDP
planner for the ego-vehicle: the search is assisted by pruning invalid actions and shaping the rewards.
- This reduces the uncertainties in
2-
Intention inference forbehaviour
andtrajectory
prediction.- The interaction with multiple exo-vehicles is considered by the planner.
3-
Long-term planning.
-
[About the scenarios] "We argue that the
lane merging
problem is much harder than the intersection case, because exo-vehicles have a lot more freedom. [...] Long-term interactions are often required."
- Show the benefit, for prediction, planning and decision making in urban driving, of using:
-
About intention inference and trajectory prediction.
- The context-aware prediction model decouples
intention
andtrajectory
predictions:1-
Theintention
is predicted by a neural network (mentioned below).2-
Thetrajectory
of predicted intention is obtained based on polynomial fitting and extrapolating the real-time vehicle state.- The
time-to-collision
model (TTC
) is used to predict the longitudinal speed. And hence embed interaction in the prediction.
- The
- This video describes the context-aware prediction module.
- This generative model is used in the
POMDP
transition function, for forward simulations.
- The context-aware prediction model decouples
-
About the decision making hierarchical architecture.
- Input: observed trajectories of exo-vehicles.
- Output: high-level lateral
action
for ego-vehicle. - Three components:
1-
Road context database.- The context include the
number of lanes
, thedirection
of the lanes, thelane width
, and thepositions
of thestart
andend
points of each lane.
- The context include the
2-
LSTM
intention predictor.-
"We formalize the
intention
as the high-level action, i.e., the action oflane-keeping
,left-lane-changing
, orright-lane-changing
." - The input includes:
changes
in lateral and longitudinalpose
, flags indicating ifright lane
andright lane
exist, as well aslateral distance
from the centre of the lane. - The output of the net is a
belief
, i.e., probability distribution, over threeintention
classes:lane-keeping
,right lane-changing
, andleft lane changing
.- The use of a net contrasts with particle filters for
belief
tracking, which are prone to computationally expensive Bayesian belief updates.
- The use of a net contrasts with particle filters for
-
3-
POMDP
high-level planner.- It determines long-term high-level
action
s of the ego-vehicle under uncertainties of exo-vehicles' intentions.-
"The
belief
tree search takes as input the currentbelief
and thestate
of all vehicles and performs Monte Carlo simulations for the uncertain future to provide the largest rewarding high-levelaction
."
-
-
[About the solver] "The key idea of
DESPOT
is to search abelief
tree underK
sampled scenarios only, which greatly reduces computational complexity, making it an efficient solver for ourPOMDP
model."
- It determines long-term high-level
-
About the
POMDP
formulation.state
space:- The
intention
: hidden variables. - The
road contextual information
. - The
pose
(x
,y
,θ
) of each vehicle. - A
4
-time-step history of the pastposes
, representing1s
.
- The
action
space:- {
LANE-KEEP
,LEFT-LANE-CHANGE
,RIGHT-LANE-CHANGE
} for the next time-step. -
"We further prune the forbidden
actions
in different lanes with the help of road contextual information."
- {
transition
function:- The above-mentioned trajectory predictor is used to predict its next-step
pose
, given itsintention
and theroad contextual information
. - A Gaussian noise is added (no value reported).
- The above-mentioned trajectory predictor is used to predict its next-step
-
About the baselines:
1-
Reactive controller.- It reacts based on the comparison between
headway distances
with adistance threshold
.
- It reacts based on the comparison between
2-
Greedy controller.- It chooses to drive in the lane that is shortest to the destination lane at each time step regardless of exo-vehicles.
3-
SimMobilityST
.- A rule-based algorithm used in
SimMobility
simulator.
- A rule-based algorithm used in
-
About the criteria for performance comparison:
- For safety: the
collision rate
. - For efficiency: the
success rate
andtravel time
. - For smoothness: the number of
lane changes
per 100 meters. - A timeout is also implemented.
- For safety: the
-
Limitations:
1-
The intention prediction module considers the exo-vehicles independently, ignoring the influence of interactions between them.2-
No feedback loop: The high-levelaction
s provided by the planner are decoupled from the low-level control. It is possible that low-level controller cannot execute the high-levelaction
given by the planner.- In particular, the planner suffers from decision switches between
lane-change
andkeep-lane
. Instable high-level commands may become struggling for the low-level controller to implement.
- In particular, the planner suffers from decision switches between
"Belief State Planning for Autonomous Driving: Planning with Interaction, Uncertain Prediction and Uncertain Perception"
-
[
2019
] [📝] [ 🎓KIT
] [ 🚗BMW
] -
[
ABT
,information gathering
,occlusion
]
Click to expand
Various sources of uncertainty and how to model the scene when dealing with occlusion. The variable g is a Boolean, indicating whether there is a car (g=1 ) in the occluded area. It is estimated by the belief tracker . Ψ denotes the edge of the field of view on the path of a phantom vehicle. It is part of the observation and used for the transition model. The idea is to sample if the phantom vehicle is detected or not. The probability of this sampling is proportional to the revealed occluded area during the transition from state (t ) to state (t+1 ). I admit I did not fully understand. Source. |
Hierarchy: The behaviour planner (A* -based or POMDP -based) generates an optimal plan first, but under different optimization criteria than the trajectory planner. The result of the behaviour planner is a policy , not a trajectory! Different plans can be retrieved from it, depending on the scene evolution. In a second step, the most probable trajectory in the generated policy of the POMDP is retrieved and optimized by a local trajectory planning algorithm on a shorter horizon. Source. |
The rollout stage of the MCTS uses a deterministic A* that solves a simplified (deterministic) problem, followed by constant velocity rollouts. Thus, new belief nodes can be quickly initialized with an estimated value (heuristic). Note that observations (measurements) come at 1Hz . In between, the tree is ''recycled'' and produces 4 reference trajectories . Right: observation clustering. If a new observation arrives, it is tried to match it on one of the existing observation clusters. Source. |
Top: prediction , manoeuvre selection and trajectory planning are done separately. It assumes that the probabilistic future behaviour of the other agents is independent of the (future) ego-decisions, which may work well for simple scenarios. Bottom: In MCTS , an external prediction algorithm is not needed as the other agents are simulated stepwise ahead as part of a forward simulation. One challenge is to define these transition models though. Bottom-right: considering possible future observation s (measurements ) leads to a closed-loop policy planner that respects not only the current belief state but also the most likely future scenarios. This enables less-conservative behaviours, for instance by postponing decisions until more information is collected. Source. |
The deterministic formulation. It is solved with an A* graph search on a receding horizon. Three types of Event are defined to populate the s -t cost map. Source. |
The stochastic formulation. Here for the intersection scenario. The belief tracker uses observations to estimate the current route (hidden intention ) followed by the other vehicles and maintain belief about the possible futures. This leads to less conservative behaviours. Source. |
The ego car maintains multiple hypotheses wrt. the path the other car is following and prepare multiple plans (in a policy ) to react accordingly. The observations are aggregated over time to estimate the probability of each path (belief tracking ). Source. |
Author: Hubmann, C.
-
PhD Thesis.
-
One idea: combine
MCTS
with a deterministicA*
roll-out heuristic for fast convergence to the optimal policy. -
Motivations:
1-
Uncertainty-aware planner.- Since the current
state
is not fully known, abelief
over state is maintained fromobservations
. It is described byb
(s
), i.e. the probability of being in a certain states
.
- Since the current
2-
Interaction-aware planner.-
"Modeling intertwined
prediction
andplanning
allows to consider the reaction of other agents to the trajectory of the autonomous car."
-
3-
"Globally optimal", [complete], anytime and online planner.-
[Real time capability] "This is possible by extending state of the art solvers with domain specific heuristics which allows to focus on promising branches in an otherwise intractable graph search."
-
4-
Rely neither on manually designed logic rules, nor on hand-selectedmanoeuvres
.- This would become infeasible in complex urban environments.
5-
Avoid over-conservatism of rule-based planners.- The idea is to consider possible future
observation
s (measurements
) during planning in this sequential decision-making problem. - This leads to a
closed-loop
planner which generates apolicy
over an uncertainbelief
space, offering:- The ability to postpone (conservative) decisions, waiting for more information to be gathered.
- The policy contains reactive plans for possible future
observations
, i.e. measurements of the uncertain behaviour of the other agents.
- The policy contains reactive plans for possible future
- The ability to take
action
s in order to reduce uncertainty, i.e. activeinformation gathering
.- E.g. step laterally off the predefined path to increase the
field of view
in occluded scenarios (lateral exploration
).
- E.g. step laterally off the predefined path to increase the
- The ability to postpone (conservative) decisions, waiting for more information to be gathered.
-
"By considering possible future
observations
explicitly, the algorithm is able to predict in what ways the currentbelief
state may change in the future. This enables the postponing of decisions, such as merging before or after another vehicle, as the algorithm is able to predict that futureobservations
will lead to a less uncertain prediction." -
"This behavior (also known as
information gathering
) is the result of the policy because theobservation
model has simulated, that the next measurements will lead to a less uncertainbelief
state. Because of theobservation
model, it can even infer at what point in time thebelief
becomes less uncertain and approach the intersection accordingly."
- The idea is to consider possible future
-
About
manoeuvres
.-
"While a precise definition of a
maneuver
does not exist, it is often compared to the mathematical concept ofhomotopies
. Two continuous trajectories are in the samehomotopy
class if a continuous, collision-free projection exists that transforms one trajectory to the other one. For example, two different trajectories, one overtaking an obstacle on the right side and one on the left side lie in differenthomotopies
." - Many works define beforehand a set of
manoeuvres
. For each, trajectories are planned and then evaluated with a cost function.- The author wants to avoid this a priori enumeration of
manoeuvres
which constrains the solution space and may prevent global and optimal behaviours. -
"While the approach of only considering a limited amount of predefined
maneuvers
is feasible on highways, this may become intractable in urban environments due to the high amount of varying topological situations and corresponding maneuver possibilities."
- The author wants to avoid this a priori enumeration of
-
-
What types of uncertainty?
- Uncertain perception:
1-
Noisy sensor measurements2-
Occlusions
- Uncertain prediction:
3-
Unknown intentions of other drivers4-
Unknown driver models for other drivers, including interaction capabilities
- Uncertain perception:
-
How to plan under uncertainty?
-
"The problem of optimally considering the uncertainty of future
states
can be addressed by planning in the space ofpolicies
instead of in the space oftrajectories
." - A
policy
is generated over abelief
state and optimizes the expectedreward
, starting from an initialbelief
state.
-
-
About
open-loop
/close-loop
planners.1-
open-loop
motion planning algorithms do not consider future (possible)measurements
which arrive during the execution of the planned motion.-
"Respecting every possible prediction leads to safe but potentially conservative behavior. Another possibility is to consider only the most likely prediction(s)."
-
"The
open-loop
planner has to slow down immediately as it is not able to incorporate futureobservations
in theplanning
phase. This means, that the planner reacts to both possible future situations simultaneously."
-
2-
close loop
motion planning on the other hand allows to consider the possible future observations in the planning stage.-
"The policy contains
2
plans about how to react to theobservation
which arrives att=1
." -
"On the contrary, the
POMDP
planner is able to reason about both possible scenarios. This results in a policy that postpones the decision ofcrossing
vs.braking
to a future point in time when more observations have been recorded."
-
-
How to make interaction-aware planning? By not separating
planning
andprediction
.-
"A common approach is to separate
prediction
andplanning
. In this case, all the trajectories of the other agents are predicted first. Given the predicted trajectories, a maneuver is selected for the autonomous car (by the behavior layer) and a correspondent trajectory is planned by thetrajectory
planner." - A separation would assume that the probabilistic future behaviour of the other agents is independent of the (future) ego-decisions.
- It would work for simple scenarios (e.g. highway).
- But not for urban situations where interaction must be explicitly considered.
- How to deal with the uncertainty in prediction, when separation is used?
- Frequent replanning is one option. Especially when the prediction is unimodal.
- In
MCTS
, an external prediction algorithm is not needed as the other agents are simulated stepwise ahead as part of a forward simulation. One challenge is to define thesetransition
models, usually one for each of the different possible manoeuvres. For instanceIDM
.
-
-
Making the problem simpler: the "path-velocity decomposition" assumption.
- It is made for most of scenarios.
- Except for the strongly coupled problems
lane change
andocclusions
which require combined longitudinal and lateral optimization.
-
What if the
transition
model is deterministic and thestate
fully observable?- Then the planning problem is simpler and can be represented as a graph:
vertices
=state
sedges
=action
s
- An
A*
graph search on a receding horizon derives the longitudinalacceleration
.- The path to follow is given by another module, based on variational methods, i.e. local convex optimization around a reference path.
- The
A*
algorithm uses a heuristic to speed up the graph search by truncating non-promising branches early.- Here the heuristic is based on the idea of Inevitable Collision States (
ICS
). -
"When a new
state
is generated, it is tested for being anICS
. If this is the case, the remaining estimated costs are at least the collision costs."
- Here the heuristic is based on the idea of Inevitable Collision States (
- The traffic rules and the predicted behaviours of other agents are represented into a spatio-temporal cost map.
- Only the most likely prediction is considered for each agent.
- The uncertainty in the (unimodal) prediction is addressed by frequent replanning (
10Hz
).
- Information of possible future
observations
is ignored. Hence open-loop.
- Then the planning problem is simpler and can be represented as a graph:
-
About the
cost
definition for theA*
search.- Three types of events are defined:
1-
static
. It prohibits the ego vehicle to traverse a certainposition
on its path at a certain position during atime
interval.2-
dynamic
. The prohibitedposition
istime
dependent. For instance leading car.- Also traffic lights, otherwise the
green
andred
phases are assumed to last forever. -
"During a
yellow
phase, the legal length of theyellow
phase is used to predict the traffic signal switch, such thatt-start
is the predicted start of thered
phase andt-end
is set to infinity. That way, the algorithm implicitly handles the decision to pass or not to pass a (recently switched) traffic light."
- Also traffic lights, otherwise the
3-
area
. It prohibits the ego vehicle to enter certain areas when they cannot be left again.
- Three types of events are defined:
-
How to penalize deviation from the
desired speed
in thereward
?- The
desired speed
depends on the road’scurvature
and the current legalspeed limit
. - Too high speeds are punished quadratically.
- Too low speeds are punished linearly.
- This allows lower velocities during decelerating upon
events
such as red traffic lights. -
"Punishing lower velocities in a linear way motivates the planner to drive with the desired velocity but allows for slower solutions (e.g. because of a temporarily occupied lane)."
- This allows lower velocities during decelerating upon
- The
-
How to ensure consistent decision over time and prohibit jumps in behaviours?
- The current desired
state
on the reference trajectory is considered during replanning, instead of the measuredstate
. -
"Instead of planning from the actual, measured state
x-meas(t0)
the currently desired statex-des(t0)
, retrieved from the previous planning step, is used as the start statex-start
."
- The current desired
-
What if the
transition
model is stochastic and thestate
partially observable?- Modelling with
POMDP
- Challenges: curse of history and curse of dimensionality.
-
"One of the main difficulties in solving
POMDPs
is the continuousbelief
state." - The
observation
space is continuous. But observation clustering is performed to construct the tree. -
During the simulation of the
belief
tree, theobservations
which are following a certainaction
must be clustered into a discrete number of possibleobservations
. This is the case as the structure of a tree can only be generated when a discrete number ofobservations
exist. - If a new
observation
arrives, it is tried to match it on one of the existingobservation
clusters.
-
- Modelling with
-
POMDP
formulations (multiple since different scenarios are considered ).-
action
.- A discrete set. Describing
longitudinal acceleration
.
- A discrete set. Describing
-
state
.longitudinal position
andspeed
on a path.- Hidden parts which cannot be measured directly but only inferred over time can be, depending on the scenario:
1-
[Crossing
] Thepath
followed by the other vehicle, i.e. itsintention
.2-
[Merging
]Willingness
for yielding.-
"The variable
m
is used to describe the friendliness of the other driver i.e. if he will react by yielding (m=1
) to a merge attempt or not (m=0
)."
-
3-
[Occlusion
] A Boolean, indicating whether there is a car in the occluded area. But no information about itsposition
?
-
observation
.position
as (x, y
) instead of (s
, [lat-offset
],path
) sincepath
is not observable.- no heading? this could be useful to estimate the path it follows
- [
Occlusion
] TheFoV
on every lane.
-
observation
model.- Noise is added to the true
position
andspeed
. - [
Occlusion
] For every potential phantom vehicle, an observation is also generated. Ok, but whatposition
is given?
- Noise is added to the true
-
transition
model.- To convert
acceleration
toposition
: simple physical models (point-mass object with1st
order integration). - To derive the
action
of other agents. Depending on the scenario:IDM
is used to compute thelongitudinal acceleration
.- Potentially extended with interaction-based term, when paths intersect in the near future.
-
"The
acceleration
of the other vehicles is additionally perturbed by use of Gaussian noise to represent the model uncertainty."
- A learnt classifier predicts the probability of
yielding
to a possible merge attempt /ignoring
and simply following the existing front vehicles.
- To convert
-
reward
- Considering
collision
,comfort
anddesired speed
.
- Considering
-
-
How to represent phantom vehicles [
occlusion
]?- The idea is to represent all the possible vehicle configurations in the occluded area by one reachable set:
-
"As it is infeasible to describe all possible vehicle configurations in occlusions by particular
states
, the idea is to describe all possible configurations on a occluded lane by oneset
." - For each occluded lane, a
phantom
vehicle is placed at the start of the field of viewFoV
. - It has infinite length and drives faster than the speed limit (i.e. worst-case).
-
-
"If every possible, occluded vehicle configuration would be represented explicitly, a certain subset of these configurations would be discovered, when the
FoV
is expanded." -
"Nonetheless, in this work, the idea is to represent all these configurations by one reachable set. Instead of splitting the set into many discretized subsets, the idea is to sample if the
phantom
vehicle is detected or not. The probability of this sampling is proportional to the revealed occluded area during the transition fromstate
(t
) tostate
(t+1
)." [not very clear to me]
- The idea is to represent all the possible vehicle configurations in the occluded area by one reachable set:
-
How the
belief
state can be estimated over time_?- By recursive Bayesian estimation.
- [
Intersection
scenario] A Bayes classifier with2
-dimensional feature (position
andspeed
) computes the probability of each vehicle being on a certain router
.- The likelihood term includes
Euclidean distances
to the considered path.
- The likelihood term includes
- How is it done for the other two scenarios? I did not understand.
-
How
POMDP
s can be solved?1-
Offline.-
"The
value
function of aPOMDP
is always piece-wise linear and convex in thebelief
." - The optimal value function over the continuous
belief
is represented by a set of alpha-vectors. - Point-based solvers, such as
SARSOP
,PBVI
,HSVI
.- Why "point-"?
-
"The idea of point-based algorithms is to overcome that problem by backing up the
value
function only for a discrete set of chosenbelief-points
."
-
2-
Online: with an online graph/tree search.- Traditional graph-searches are not appropriate due to the non-deterministic nature of the
transition
model. - Monte Carlo Tree Search (
MCTS
) is one option: it combines a deterministic tree search with random sampling. - For instance,
Adaptive Belief Tree
(ABT
) usesMCTS
, optimized forPOMDP
s, where the search is done in abelief
tree.-
"The combination of a smart selection method (e.g. the
UCT
algorithm) and the fast estimation of futurerewards
by a sufficiently good defaultpolicy
allows to reduce the search space drastically, which gives the algorithm its online capabilities." - Another characteristic of
ABT
: reuse previous parts of the tree.
-
- Here the
ABT
implementation "Toolkit for Approximating and Adapting POMDP Solutions in Real Time" (TAPIR
) ([:octocat:]) is used to solve thePOMDP
online.
- Traditional graph-searches are not appropriate due to the non-deterministic nature of the
-
How to make the search in the
belief
tree faster?- By initializing the
value
of newbelief
nodes with optimized roll-outs.- A heuristic is used to estimate the
value
function of the nodes. -
"The heuristic value is calculated by solving a deterministic, simplified problem online as soon as a new
belief
state is encountered." - Here using the previously developed
A*
graph-search.
- A heuristic is used to estimate the
- More precisely:
1-
Optimization of the simplified (deterministic
) problem is used for the first three steps ...2-
... Followed by a constant velocity roll-out until the optimization horizon.
- By initializing the
-
What decision frequency and tree depth, i.e.
planning horizon
?-
[On the one hand ...] "The more time is used for sampling of episodes, the better the policy is approximated."
-
[On the other hand ...] "The earlier the
policy
is sent to the trajectory layer, the less delay is introduced between the sensor measurements and the corresponding policy." - Here, the algorithm returns a decision every
200ms
(5Hz
). -
"While the solution is optimized for
200ms
, a step size of∆t=1s
is used to construct the tree to allow for a planning horizon of8-10s
." [i.e.depth
~8-10
]
-
-
Future works:
1-
ConstrainedPOMDPs
for safety guarantees.2-
Combininglearning
andplanning
: The heuristic that guides the online search could be learnt offline.
"SUMMIT: A Simulator for Urban Driving in Massive Mixed Traffic"
Click to expand
SUMMIT has been developed to simulate realistic dense, unregulated urban traffic for heterogeneous agents at any worldwide locations. Source. |
Source. |
The motion model used is Context-GAMMA that applies velocity -space optimization under kinematic (e.g. non-holonomic motion of car), geometric (collision avoidance with nearby agents) and context-related constraints to generate sophisticated, unregulated crowd behaviors. Source. |
Source. |
Authors: Cai P., Lee Y., Luo Y., & Hsu D.
-
"Driving in unregulated, crowded urban environments, like in uncontrolled roads or unsignalized intersections in less-developed countries remains an open problem."
-
Motivation for a new simulator:
- Simulates realistic dense, unregulated urban traffic for heterogeneous agents at any worldwide locations.
- Driving simulators already exist. But here the idea is:
- To scale up from complex interactions between agents to crowded urban scenes.
- Not to be restricted to predefined maps.
- To simulate large crowds, but with the quality of interactions.
- To closely represent real-world scenarios and generate high-fidelity interactive data.
-
How to work on any worldwide locations?
- With
OpenStreetMaps
. SUMMIT
relies onSUMO
to automatically convertOSM
maps to lane networks.-
"
SUMMIT
fetches real-world maps from theOpenStreetMap
, and constructs two topological graphs: alane network
for vehicles, and asidewalk network
for pedestrians."
- With
-
How to generate complex and realistic crowd interactions? What crowd behaviour model?
- The authors present a model:
Context-GAMMA
, built on top fromGAMMA
(Luo & Cai, 2019) (video), a "General Agent Motion Prediction Model". - It uses
velocity
-space optimization to generate sophisticated, unregulated crowd behaviours. -
"
GAMMA
assumes that each traffic agent optimizes its velocity based on the navigation goal, while being constrained by kinematic constraints (e.g. non-holonomic motion of car) and geometric constraints (collision avoidance with nearby agents)." Context-GAMMA
introduce a third constraint and another objective about the road context.-
"We suggest that algorithms should leverage the road contexts to help long-term planning."
-
- The authors present a model:
-
How to generate realistic data?
-
"
SUMMIT
is based onCARLA
to benefits from its desirable features such as high-fidelity physics, realistic rendering, weather control, and rich sensors".
-
-
What decision-making module?
- A context-aware
POMDP
is proposed and compared toTTC
-based reactive system. Two components: 1-
The belief tracker for the hiddenstate
variables.-
"At each time step, it uses a motion model to compute the likelihood of transitions and observations, and updates the posterior belief using the Bayes rule."
-
2-
The online solverhyp-DESPOT
that takes the current belief and computes the optimal driving action.- About the
state
space:- observable ego (continuous):
position
,speed
, andheading
. - observable other (discrete):
position
,speed
- No information about the discretisation.
- hidden other:
type
: An agent can be eitherdistracted
, thus not interacting with the ego-vehicle, or beattentive
, thus cooperatively avoid collision with the ego-vehicle.intention
wrt. a set of path candidates.
-
"We assume that the ego-vehicle can observe its own
state
and discretized values of the observablestate
s of exo-agents. The hiddenstate
s of exo-agents can only be inferred and modelled withbelief
s."
- observable ego (continuous):
- About the
action
space:- Path-velocity decomposition:
-
"We restrict the
POMDP
to compute the acceleration along the intended path, while the steering angle is generated using a pure-pursuit algorithm."
-
- {
ACC
,MAINTAIN
,DEC
}.
- Path-velocity decomposition:
- About the
transition
model = motion model:-
"
distracted
traffic agents are assumed to track their intended path with the current speed". -
"
attentive
traffic agents also tend to follow the sampled path, but usePORCA
, an interactive collision avoidance model that is similar toGAMMA
but considerably simpler, to generate the actual local motion". - To model stochastic transitions of exo-agents, their motion is perturbed by Gaussian noises on the displacement.
-
- About the
reward
model:- Multi-objective:
safety
,efficiency
, andsmoothness
of driving.
- Multi-objective:
- A context-aware
-
Baseline and evaluation:
-
"We compare
Context-GAMMA
with a reactive model that moves agents along lane center-curves and uses time-to-collision (TTC
) to calculate the vehicle’s speed." - Criteria:
average speed
of traffic agents.congestion factor
defined as the percentage of agents being jammed in the crowd. (Jammed agents are removed from the crowd after being stationary for5s
.)
- Findings:
- The congestion factor of the
TTC
-controlled traffic grows quickly with the simulation time, indicating that agents fail to coordinate with each other. -
"We thus conclude that sophisticated long-term planning is important for driving in unregulated traffic, and
Context-POMDP
establishes a reference for future crowd-driving algorithms."
- The congestion factor of the
-
"Crossing of Road Intersections : Decision-Making Under Uncertainty for Autonomous Vehicles"
Click to expand
The author calls for a probabilistic framework to reason and make decision, due to the inherent perception uncertainty and behaviour (interaction ) uncertainty. Also, learning-based methods are avoided due to their susceptibility to over-fit if the dataset is not balanced. Source. |
The author prefers probabilistic methods, in order to deal with uncertainties while trying to offer some interpretability. The navigation module outputs a single action to be implemented. Another option would have been to deliver some policy which could be followed for several steps, limiting inconsistent transitions (especially for comfort) and favouring long-horizon reasoning. Source. |
The intended manoeuvre is inferred based on observed speed , acceleration and heading - no position term - and will be treated as an observation in the POMDP . Source. |
As noted in my report of IV19 , risk assessment can be performed by comparing the expectated behaviour (expectation ) to the inferred behaviour (intention ), i.e. what should be done in the situation and what is actually observed. A discrepancy can detect some misinterpretation of the scene. Source. |
The problem is formulated as a POMDP . Source. |
Decomposition of the probabilistic transition function. Only the longitudinal control via discrete acceleration is considered. The state x consists of physical and behavioural parts. In particular, it includes the behaviour expectation for each vehicle, i.e. what should be done according to the traffic rules. It also come with a behavioural intention for which is the inferred manoeuvre followed by the observed vehicle. intention continuation is used to describe the transition about intention , while gap acceptance model are used for the transition about expected behaviour. Finally, note that the selected acceleration action only influences the physical term of the ego vehicle. Source. |
One contribution is called Functional Discretisation . So-called motion patterns are stored within an HD-map as polygon delimiting the intersection entrance and crossing zones. This discrete crossing and merging zones are not manually defined but learnt based on simulated vehicle trajectories. The continuous intersection space is therefore divided into non-uniform discrete areas. Top-right: three crossing scenarios are considered, with different pairs of priorities. Source. |
The trust KPI is based the time gap , i.e. the delta in predicted time of when each vehicle will reach the crossing point. This should be ''maintained'' over 4s over all the approach. Hence the use of ''temporal'' logic. The unsafe stop KPI states that the vehicle should never be stand still within the unsafe area. Source. |
Author: Barbier M.
- What?
- A PhD thesis.
- Motivation:
- Interaction-aware and uncertainty-aware handling of a signed intersection.
- How to capture and deal with interaction in the decision?
- The intended manoeuvre is inferred (
behavioural classification
) and subsequently treated as anobservation
. - By comparing it with the expected manoeuvre, the agent should determine how to interact with the other vehicle.
- The intended manoeuvre is inferred (
- About the behavioural classification.
lateral
part: {Turn right
,Turn left
,Go straight
}.longitudinal
part: {Stop
,Yield
,Cross
}.- Six features are used:
max
andmin
speed.max
andmin
acceleration- Maximum
right
andleft
deviation from the mean heading angle.
- I would be afraid that maximum and minimum values could come from outliers and would rather have worked with quantiles (e.g.
10%
and90%
).
- About risk assessment:
-
"The
intended
manoeuvre represents what the driver is doing, whereas theexpected
manoeuvre represents what the situation requires." - One idea to compute the difference between what the other driver
IS DOING
(calledintention
) and what one driverSHOULD DO
(calledexpectation
)- The
expected
behaviour can be derived e.g. according to priority rules or to gap acceptance models.
- The
- This discrepancy is useful for risk assessment since it can detect some misinterpretation of the scene:
- Situations where
intention
andexpectation
do not match could result in a risky interaction.
- Situations where
- By penalizing states with a large difference, the reward function incorporates feedbacks about interaction and encourages the agent to select actions that reduce this risk.
-
- About the (large) search horizon:
-
"The configuration with
γ = 0.85
andC = 30
is chosen as it meets these characteristics. The discount value results in a search horizon of12
seconds".
-
- About the online
POMDP
solver:POMCP
.action continuation
is used as rollout policy.-
"[One could] include imitation learning to initiate
V
(ha
) with knowledge obtained by looking at human drivers." -
"A memory could be used to initialize the value function from previous exploration, accelerating the search for the optimal policy."
- How to evaluate the decision-making framework?
- The author argues that evaluation should be decorrelated from the reward crafting, hence having separated
KPI
s:- The reason is that systems that used their performances indicators in their value estimation are likely to over-fit.
-
"Goodhart's law stating that 'when a metric is used as a target, it ceases to be a good metric'"
- Another idea to avoid reward hacking: the reward function is designed with multiple objectives: trade-off between
performances
,risks
andinteractions
.
- The author argues that evaluation should be decorrelated from the reward crafting, hence having separated
- How to decide the threshold in
KPI
s?- Statistical Model Checking is applied to vary the bound of KPIs.
- Bounded Linear Temporal Logic (
BLTL
) allows to state conditions that will eventually be true.
- Bounded Linear Temporal Logic (
- The author works for instance with the
probability of crossing the intersection in less than a given time
.
- Statistical Model Checking is applied to vary the bound of KPIs.
- About
ENABLE-S3
:- The author uses the validation and verification architecture of this European project.
-
From
ENABLE-S3
website: "Pure simulation cannot cover physics in detail due to its limitations in modelling and computation. Real-world tests are too expensive, too time consuming and potentially dangerous. Thus,ENABLE-S3
aims at developing an innovative solution capable of combining both worlds in an optimized manner [...] and facilitate the market introduction of automated systems in Europe."
"DESPOT-α: Online POMDP Planning With Large State And Observation Spaces"
-
[
2019
] [📝] [ 🎓National University Of Singapore
] -
[
POMDP
,online solver
,DESPOT
,parallelization
,large observation space
]
Click to expand
Unlike standard belief tree, some observation branches are removed in a DESPOT . Source. |
Top - Illustration of the particle divergence problem: When observation space is large, particles quickly diverge into separate belief nodes in the belief tree, each of which contains only a single particle. This causes over-optimistic behaviours. Bottom - In a DESPOT-α , each node has the same number of particles as the root of the tree and weighting is performed based on the observation s. This prevents the over-optimistic evaluation of value of the belief . Source. |
Authors: Garg, N. P., Hsu, D., & Lee, W. S.
-
Previous work: "Determinized Sparse Partially Observable Tree" (
DESPOT
) by (Ye, Somani, Hsu & Lee. 2017). -
About
DESPOT
:-
Why
Partially Observable
?- As the
state
is not fully observable, the agent must reason (and maintain) withbelief
s, which are probability distributions over thestate
s given history h. - The
belief
is a sufficient statistic that contains all the information from the history ofaction
s andobservation
s (a1
,z1
,a2
,z2
, ... ,at
,zt
). -
"By reasoning in
belief
space,POMDP
s are able to maintain a balance between exploration and exploitation and hence provide a principled framework for [sequential] decision making under uncertainty."
- As the
-
Why
Tree
?- Because a search tree of histories is constructed, online.
- The "belief tree search" aspect has to do with the
online
nature of the solver (as opposed tooffline
methods that compute an approximately optimal policy _ over the entirebelief
space, prior to execution)_:-
"At each time step, it plans locally and chooses an optimal
action
for the currentbelief
only, by performing lookahead search in the neighborhood of the currentbelief
. It then executes the chosenaction
immediately."
-
-
"Many
POMDP
solvers do online planning by doing forward search from the currentbelief
, constructing a tree which branches each time anaction
is required, and also each time anobservation
may be observed".- Each node implicitly represents a
belief
.- "Implicitly" since it contains a particle set that approximates the
belief
. This contrasts with other approaches that explicitly represent thebelief
as a probability distribution over thestate
space, e.g. with exact updates using Bayes' theorem. - Planning is only performed from the current
belief
, which is the root node.
- "Implicitly" since it contains a particle set that approximates the
- Each node branches into
|A|
action
edges. - Each action edge further branches into
|Z|
observation edges.
- Each node implicitly represents a
- A
DESPOT
is built through trials, consisting ofexploration
andbackup
on sampled scenarios.
-
Why
Determinized
?- Because the search is focused on a set of randomly sampled "scenarios" that are sampled a priori.
- A set of random numbers are generated in advance, as the first belief is given.
- As I understood, they are called
scenarios
("abstract simulation trajectories").
- As I understood, they are called
-
"A small number of sampled scenarios is sufficient to give a good estimate of the true value of any policy."
- These determinized scenarios make
DESPOT
differ fromPOMCP
which performsMCTS
on a belief tree usingUCT
.
- These determinized scenarios make
- A set of random numbers are generated in advance, as the first belief is given.
- Here is my interpretation:
- Imagine you are playing a game where your motion relies on the outcome of some dice, e.g.
Monopoly
orgame of snakes and ladders
Option 1-
At each timestep, you roll the dice and move accordingly.Option 2-
Before starting, you roll the dicex
times. You then put the dice away and start playing: at each timestep, you move according to thei
-th generated number.
- Here, these generated numbers (
scenarios
) are used to decide thenoise
injected in the evaluation of the two models used for the tree expansion:measurement
andtransition
functions.- That means it is known in advance, before starting building the tree, that the
n
-thbelief
-leaf will be generated from themeasurement
function using then
-th sampled number asnoise
parameter.
- That means it is known in advance, before starting building the tree, that the
- Imagine you are playing a game where your motion relies on the outcome of some dice, e.g.
-
"Like
DESPOT
,DESPOT-α
uses the "particle belief approximation" and searches a determinized sparse belief tree".
- Because the search is focused on a set of randomly sampled "scenarios" that are sampled a priori.
-
Why
Sparse
?- It is related to the question: How to represent a
belief
?DESPOT
represents thebelief
as a set of particles (particles are sampledstate
s), as forPOMCP
.- This enables to overcome the issue of large
state
space.
- This enables to overcome the issue of large
-
"While a standard belief tree captures the execution of all policies under all possible scenarios, a
DESPOT
captures the execution of all policies under a set of sampled scenarios." - Because some observation branches are removed, a
DESPOT
can be viewed as a sparse approximation of the standard belief tree:- The tree contains all the action branches, but only the observation branches under the sampled scenarios.
- This also implies that
DESPOT
does not perform belief update over the entirestate
space (addressing thecurse of dimensionality
).
- In other words, a
DESPOT
is structurally similar to standard belief trees, but contains onlybelief
nodes reachable under theK
sampled scenarios.- Size of
SparseSampling
:|A|^D
.C^D
(sparse because onlyC
observations are sampled for each action branch, andD
is the depth). - Size of
DESPOT
:|A|^D
.K
(forK
sampled scenarios).
- Size of
- It is related to the question: How to represent a
-
-
Additional notes about
DESPOT
:- Motivations: Address two curses.
1-
Curse of dimensionality: thestate
space, and correspondingly the dimensionality of thebelief
size, grows exponentially with the number ofstate
variables.2-
Curse of history: thebelief
tree grows exponentially withdepth
.DESPOT
(as forPOMCP
) breaks the two curses throughsampling
:-
"It alleviates the
curse of dimensionality
by samplingstate
s from abelief
and alleviates thecurse of history
by samplingobservation
s."
-
DESPOT
contains all the main ideas for online planning via belief tree search:1-
Heuristic search: The tree is incrementally constructed under the guidance of a heuristic.2-
Branch-and-bound pruning:Upper bounds
(computed from state-based heuristics) andlower bounds
(computed from default policies) on the value at eachbelief
node are used to prune suboptimal subtrees.- Note that the gap between the
lower bound
andupper bound
can represent the uncertainty at thebelief
node.
- Note that the gap between the
3-
Monte Carlo sampling: Only a randomly sampled subset of observation branches is explored at each node.
- Regularization.
- Since many scenarios are not sampled, and because the chosen policy optimizes for the sampled scenarios, it can happen that the policy does not perform well.
- Regularization can be used to address that overfitting.
- Motivations: Address two curses.
-
More about "
1- heuristic search
": Search-guidance based on the value function.-
"To make sure that even the partially constructed tree is able to compute a good policy, heuristics based on
upper bound
andlower bound
on the value ofbelief
nodes are used to guide the search". - Note that this requires the computation of the
value
ofbelief
nodes:V(b)
. - How to estimate the value? Using
α
-vectors.
-
-
One concept:
α
-vectors.- One important property:
-
"The value function of a
POMDP
can be approximated arbitrarily well by a convex piece-wise linear function of thebelief
". V
(b
) =max over α
[∑ over s
(b(s)
.α(s)
)]
-
-
"An
α
-vector is associated with a conditional plan and, for eachstate
s
, captures the reward of executing the plan starting fromstate
s
." - Note that the number of components in an
α
-vector correspond to the number of states and hence can be exponentially large. - In a
DESPOT-α
,α
-vectors will be efficiently approximated to reduce computation, to approximate thelower bound
on value ofbelief
nodes. - Hence the name
Determinized Sparse Partially Observable Tree
With α-Vector Update
.
- One important property:
- Main motivation for
DESPOT-α
:- Address the problem
particle divergence
to scale to largeobservation
spaces. -
"When the
observation
space is large, particles quickly diverge into separatebelief
nodes in the belief tree, each of which contains only a single particle." - The uncertainty can be underestimated by the derived policy, leading to poor and over-optimistic actions.
- Address the problem
- Main idea of
DESPOT-α
:- To prevent the over-optimistic evaluation of value of the
belief
, the idea is to keep a constant number of particles, and weight them (as forPOMCPOW
andPFT-DPW
that extendPOMCP
). -
"Instead of propagating only the particles producing the same observation to the child of a
belief
-action
node, we propagate all the particles to the child nodes and update the weights of particles according to relative likelihood of observationp
(z
|s
,a
)." - This is similar to
particle filters
.
- To prevent the over-optimistic evaluation of value of the
- New issue: when computing the heuristics, propagating each particle to every child
belief
node impacts the computational efficiency.-
"Always having
C
childbelief
nodes prevents over optimistic evaluation of value ofbelief
but also makes the tree size (C.|A|
)^D
".
-
- Solution (not in
POMCPOW
andPFT-DPW
):- Share the value function calculation among different (but similar)
belief
nodes, by grouping observations together.-
"We can merge the observations, when the value of the resulting
belief
s is maximized by the sameα
-vector." -
"We can use
α-vectors
to share the computation done for one trial among "sibling"belief
nodes for improvinglower bounds
".
-
- This leads to the concept of "sibling
belief
nodes": Nodes which differ from each other only in last observation.-
"We are implicitly grouping
belief
s whose values are maximised by sameα
-vector by sharingα
-vectors between siblingbelief
nodes." -
"As sibling
belief
nodes share the same set of scenarios with different weights,α
-vector calculated for onebelief
node can be used to calculate approximate lower bound for the siblingbelief
nodes by simply doing an inner product of weights of the particles and theα
-vector".
-
- Share the value function calculation among different (but similar)
- To sum up - Contributions:
1-
Sample a fixed number of observations for each action branch like insparse sampling
, while still using determinized scenarios likeDESPOT
(it still contains only theobservation
branches reachable by sampled scenarios).2-
Introduce a particle approximation of theα
-vector to improve the efficiency of online policy search.3-
Further speed-up the search by leveragingCPU
andGPU
parallelization introduced inHyP-DESPOT
.- Here
K
particles can be expanded in parallel, which is efficient since each node contains all the particles.
- Here
"Risk-Aware Reasoning for Autonomous Vehicles"
-
[
2019
] [📝] [ 🎓Khalifa University, Abu Dhabi
] -
[
risk-bounded planning
,chance constraint
,POMDP
,hierachical planning
]
Click to expand
Architecture to deal with uncertainty and produce risk-aware decisions. Probabilistic vehicle motions are modelled using Probabilistic Flow Tube (PFT ). These PFTs learnt from demonstrating trajectories represent a sequence of probabilistic reachable sets, and are used to calculate the risk of collision. This risk quantification serves in the CC-POMDP formulation of the short-horizon planner, where the ego-agent should plan the best sequence of actions while respecting a bound on the probability of collision. Uncertainty is also propagated in the higher modules of the hierarchical planning where Temporal Plan Networks with Uncertainty (STNUs ) are used to derive short-term objectives. Source. |
Authors: Khonji, M., Dias, J., & Seneviratne, L.
- One remark: Not too many details are given about the implementation, but it is interesting to read reformulation of concepts met in other works.
- One related work:
- Several ideas (
RAO*
,PFT
,CC-POMDP
) reminded me the work of (Huang, Hong, Hofmann, & Williams, 2019) -Online Risk-Bounded Motion Planning for Autonomous Vehicles in Dynamic Environments
- detailed further above. - The first author has been actually collaborating with this research group.
- Several ideas (
- One idea: hierarchical planning.
- The uncertainty-aware decision-making task is decomposed between a
high-level
planner, ashort-horizon
planner and someMPC
-based precomputed and learned manoeuvre trajectories. - Three levels of actions are distinguished for
short-horizon
planner:Micro Actions
are primitive actions, e.g.accelerate
,decelerate
,maintain
.Manoeuvre Actions
are sequences of micro actions, e.g.merge left
.merge right
.Macro Actions
are sequences of manoeuvre actions, e.g.pass the front vehicle
,go straight until next intersection
.
- The uncertainty-aware decision-making task is decomposed between a
- One concept: "chance constraint" optimization.
- Some measure of uncertainty (e.g. about perception, about unknown intention, about control) is available to the
short-horizon
planner. - To goal is to solve the optimization problem (as for vanilla
POMDP
formulations) i.e. find the optimal sequence of ego-vehicle actions, while ensuring that the probability of meeting a certain constraint (e.g. too small gap or collision) is above a certain level.- In other words, and contrary to strict constrained optimization, here there is a bound on the probability of violating constraints.
- The policymaker can set the desired level of conservatism in the plan.
- The authors mention
RAO*
. This is solver for "chance-constrained POMDP" (CC-POMDP
).- During the search, it uses heuristics to quickly detect and prune overly-risky policy branches.
- Some measure of uncertainty (e.g. about perception, about unknown intention, about control) is available to the
"Tactical decision-making for autonomous driving: A reinforcement learning approach"
-
[
2019
] [📝] [ 🎓Chalmers University
] [ 🚗Volvo
] -
[
POMDP
,MCTS
]
Click to expand
Some figures:
MCTS is especially beneficial when it is necessary to plan relatively far into the future. Source. |
The RL -learnt neural network predicts two values used to guide the search. Source. |
Treat surrounding vehicles as interchangeable objects using CNN layers. Source. |
Comparison of sampling efficiency - need for domain knowledge and computational speed should also be considered. Source. |
Author: Hoel, C.-J.
- Three related works corresponding to three proposed approaches (all
RL
-related):- 1- Genetic algorithm (policy based RL) to train a rule-based driver model (each chromosome encodes a rule-based driver model via a set of instructions).
- "An evolutionary approach to general-purpose automated speed and lane change behavior" - (Hoel et al. 2018).
- 2- DQN.
- "Automated Speed and Lane Change Decision Making using Deep Reinforcement Learning" - (Hoel et al. 2018).
- 3- Combining planning (
MCTS
) and learning (model-free RL
).- The neural network predicts two values used to guide the search:
- The value function
V
(state
). - The policy
p
(state
,action
) for each possibleaction
.
- The value function
- This time, a transition function (or generative model)
G
(s
,a
) is considered and used during theSIMULATION
phase of the search. - "Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving" - (Hoel et al. 2019).
- I have analysed this paper in this section of my summary of
IV19
.
- I have analysed this paper in this section of my summary of
- The neural network predicts two values used to guide the search:
- 1- Genetic algorithm (policy based RL) to train a rule-based driver model (each chromosome encodes a rule-based driver model via a set of instructions).
- One remark about the
POMDP
formulation:- Only the physical parts of the state (
position
andspeed
) are observed. - The parameters of the surrounding drivers, which are assumed to behave according to the
IDM
/MOBIL
models, is not directly accessible by the ego-agent. - A particle filter is used to estimate them (belief state estimation).
- Only the physical parts of the state (
- One idea: Treat surrounding vehicles as interchangeable objects using
CNN
layers.- Using CNN layers with
max-pooling
creates a translational invariance between the vehicles. -
"The output is independent on the ordering of the vehicles in the input vector, and it also removes the problem of specifying a fixed input vector size, which instead can be made larger than necessary and padded with dummy values for the extra slots"
- Using CNN layers with
- About "sampling efficiency", "domain knowledge" and trade-off of
speed
vs.generality
:- The
GA
agent requires much domain knowledge in the form of handcrafted features (form of the instructions). - The
DQN
agent requires between2
and3
orders of magnitude less driving time than theGA
agent. -
"The
MCTS
/NN
agent requires the most domain knowledge, since it needs a generative modelG
of the environment, a belief state estimator, and possibly knowledge on how to prune actions that lead to collisions."
- The
- Results:
- The baseline is a rule-based approach built with
IDM
andMOBIL
driver models (also used in the generative model and to simulate other vehicles). -
"All methods outperform the baseline
IDM
/MOBIL
model by taking decisions that allows the vehicle to navigate through traffic between5%
and10%
faster." MCTS
is especially beneficial when it is necessary to plan relatively far into the future (e.g. highway exit case).
- The baseline is a rule-based approach built with
"WiseMove: A Framework for Safe Deep Reinforcement Learning for Autonomous Driving"
-
[
2019
] [📝] [] [ 🎓University of Waterloo
] -
[
MCTS
,options framework
,LTL
,hierarchical decision making
,POMDP
]
Click to expand
One figure:
Source. |
Authors: Lee, J., Balakrishnan, A., Gaurav, A., & Feb, L. G.
- One related work: The presented approach reminds me the work of Paxton, C., Raman, V., Hager, G. D., & Kobilarov, M..
- One term: "WiseMove": the presented options-based modular safe DRL framework.
- The modular, or hierarchical, aspect comes from the option framework. Sometimes called macro-actions.
- For more on Hierarchical RL, check out this
thegradient.pub
post. - The idea is to decompose the decision by working with temporal abstracted actions (e.g. slow down, turn left) on a high-level (like a behaviour planner).
- Each of these so called options rely on low-level primitive policies that implement their manoeuvres (similar to a geometrical trajectory optimizer).
- One idea: LTL formalism is used to check the validity of high-level decisions.
- An option is defined by (1) a underlying primitive policy, but also by (2) an initial condition and (3) a terminal condition.
- For instance, the option
take-over
is available only if a vehicle is on my lane and a second lane exists. The manoeuvre is finished when I arrive in front of the other vehicle. - I like to think of it as another sort of masking mechanism.
- Here, these conditions are expressed as hand-crafted rules in an LTL-like syntax.
- One remark: I think we are currently missing open-source simulators that offers OpenAI
gym
-like APIs for training and testing RL approaches for decision making.- Several interfaces to
SUMO
have been developed. - For instance @LucasAlegre, @bstriner, @SaloniDash7, @sycdlcrain or FLOW which looks promising since it keeps being developed.
- Here, the author of
WiseMove
release anenv
python module (together withverifier
,options
andbackends
) that should fulfil this function.
- Several interfaces to
- Another remark: Combining learning [RL] and planning [(MC) tree search] is an idea I find very promising.
- Here, the safest next option is selected based on the stochastic look-aheads performed by the MCTS (safety check).
- In return, the options effectively reduce the number of decisions needed to reach any depth in the tree (sampling efficiency).
"A Simulation-Based Reinforcement Learning Approach for Long-Term Maneuver Planning in Highway Traffic Scenarios"
-
[
2019
] [📝] [ 🎓Technische Universität Darmstadt
] [ 🚗Opel
] -
[
combining learning/planning
,hierarchical/modular decision making
,POMDP
,SUMO
]
Click to expand
- One diagram is better than 100 words:
The term action comprises a lateral manoeuvre decision and a set speed request. Source. |
Authors: Augustin, D., Schucker, J., Tschirner, J., Hofmann, M., & Konigorski, L.
- One remark: I like the hierarchy and modularity of the approach.
- Especially the fact that the
action
stays high-level (speed desire
andhigh-level manoeuvre
), as opposed tosteering angle
andthrottle
commands that are often used in RL.
- Especially the fact that the
- One promising tool: FLOW
FLOW
is a Python library that interfaces the RL libraries RLlib and rllab withSUMO
. It has been developed and is supported by UC Berkeley.- It has not been used many times (because of the lack of Windows support?). Instead, many research using
SUMO
develop their own interface, which makes comparison and reproduction difficult. - A few recent
FLOW
-based works can be mentioned though:- "Simulation to scaled city: zero-shot policy transfer for traffic control via autonomous vehicles" by (Jang et al., 2018)
- "Benchmarks for reinforcement learning in mixed-autonomy traffic" by (Vinitsky et al., 2018)
"Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios"
-
[
MCTS
,online POMDP
,POMCP
,progressive widening
,SUMO
]
Click to expand
Note: the planning part of this thesis takes advantage of the prediction approaches and the driver models referenced in previous sections.
- In particular, the derived predictive model is used for both the belief update (instead of often-used particle filters and IMM filters) and as a generative model (for forward simulations) of the POMDP.
- The value estimations in the tree search are based on the learnt driver model and the long-term prediction methods previously referenced.
Some figures:
Comparison of recent POMDP-based planning modules. Source. |
Construction of the tree search with belief updates and model-based rollouts. Source. |
Source. |
Author: Sierra Gonzalez, D.
-
The author targets some "human-like tactical planning".
- The POMDP formulation is ideal since it considers uncertainty in the
controls
,states
, and theintentions
of the traffic participants. - The idea is to include estimation of intentions for long-term anticipation.
- The POMDP formulation is ideal since it considers uncertainty in the
-
One idea: about the rollout policy used for the construction of the search tree.
- One option is to use a random rollout policy.
- Here, the previously-derived models are used to predict approximately the long-term development of traffic scenes.
-
Another idea: adapt the combination of model-based and manoeuvre-estimation-based predictions, depending on how far the rollout looks into the future.
"As we go deeper into the history tree (that is, into the future), the observed dynamics of the targets at the root node become less relevant and so we rely increasingly in the model to predict the behaviour of the obstacles."
-
Other related works:
- The Hierarchical architecture of (Sonu, E., Sunberg, Z., & Kochenderfer, M. J. (2018). "Exploiting Hierarchy for Scalable Decision Making in Autonomous Driving").
- The
Double Progressive Widening (DPW)
orprogressive unpruning
of (Sunberg & Kochenderfer, 2017) to deal with continuous observations. - The general approach of (Bouton, M., Cosgun, A., & Kochenderfer, M. J. (2017). "Belief state planning for autonomously navigating urban intersections"). The main difference is the substitution of the Interacting Multiple Model (
IMM
) with DBN-based model to consider the interactions between vehicles.
-
One quote about the (relative) benefits of POMDP formulations:
"There have not been significative differences between the decisions taken by the proposed POMDP planner and the reactive SUMO model. This is due to the fact that neither of those scenes truly required to analyse the long-term consequences of a maneuver".
"Value Sensitive Design for Autonomous Vehicle Motion Planning"
-
[
2018
] [📝] [ 🎓Stanford University
] [ 🚗Ford
] -
[
POMDP
,QMDP
]
Click to expand
The POMDP policy better deals with uncertainties in the detection of the pedestrian. It accounts for the possible transition between detected and not detected cases, leading to smoother action s across the state space. Source. |
Authors: Thornton, S. M., Lewis, F. E., Zhang, V., Kochenderfer, M. J., & Christian Gerdes, J.
- Motivation:
- Apply the
VSD
methodology / formalism to the problem of speed control for the scenario of an occluded pedestrian crosswalk.
- Apply the
- About Value Sensitive Design (
VSD
):-
"[Wikipedia]: A theoretically grounded approach to the design of technology that accounts for human values in a principled and comprehensive manner."
- In (
PO
)MDP
s, engineers account for some "human values" in the design of the reward function. - Values are converted to specifications:
safety
: harm reduction or collision avoidance.legality
: care when approaching a crosswalk.mobility
: efficiency.smoothness
: comfort.
- Stakeholders are also identified:
- the AV and its occupants.
- the obstructing vehicle parked.
- the pedestrian potentially crossing the street.
- the authority of traffic laws.
-
- About the
POMDP
formulation:- The
belief
of a pedestrian crossing is tracked with some Bayesian filter.- The pedestrian detection is a Boolean value because the pedestrian is either crossing or not.
-
"There is
observation
uncertainty for the pedestrian crossing with a false positive of5%
for detecting and a false positive of5%
for not detecting the pedestrian, which captures sensor uncertainty. -
"When the pedestrian is detected, there is a
90%
probability the pedestrian will continue to be detected at the next time step. When the pedestrian is not detected, then there is a50%
chance he or she will continue to not be detected, which captures the uncertainty due to the occlusion.
- The
QMDP
solver fromJuliaPOMDP
is used.-
"Although
QMDP
assumes that at the next time step the state will be fully observable, it is well suited for this problem because the actions are not information gathering, meaning the actions do not directly reduce the uncertainty of the scenario." - I do not agree with that statement: information gathering is key in this scenario to resolve the ambiguity in the detection.
-
state
andaction
spaces are discrete.- But "continuousness" is maintained using "multilinear grid interpolations" for the state transitions, as in (Davies 1997).
- The
- Benefits of
POMDP
:-
"A stochastic optimization problem can account for modeled uncertainty present in the driving scenario while balancing the identified values through the objective function."
- The baseline on the other hand is reactive:
if
detection,then
decelerate to stop at crosswalk.else
target the desired velocity with a proportional controller.
- For this scenario where the detection is key but uncertain, one need to anticipate
transitioning
from one set of logic to the other.- When the pedestrian is detected: the baseline is safe, but it lacks efficiency.
- When the pedestrian is not detected: the baseline is efficient, but not safe.
- This rule-based dichotomy makes the baseline control have full speed when the pedestrian appears and prevents it from to legally yielding to the pedestrian.
-
"Decision Making Under Uncertainty for Urban Driving"
-
[
2018
] [📝] [] [ 🎓Stanford
] -
[
POMDP
,MCTS
,julia
,probabilistic risk assessment
,value iteration
]
Click to expand
One figure:
Comparing the vanilla POMCP and proposed safe variant of it. Source. |
Authors: Weingertner, P., Autef, A., & Le Cleac’h, S.
- One algorithm:
POMCP
.- Presented in 2010,
POMCP
is an extension of the traditional MCTS algorithm to POMDP. - Together with
DESPOT
,POMCP
is an often-used POMDP online solver.
- Presented in 2010,
- One term: "observation class".
- Different extensions of
POMCP
andDESPOT
have been proposed. In the presented approach, the goal is to work with continuous observations, while ensuring safety. - The idea is to limit the number of observation nodes in the tree by grouping observations based on some utility function.
- This utility function should not to be confused with the offline-learn value function representing the probability of collision.
- The safe clusterization of observations can be based on smallest
TTC
or smallest distance to other participants.
- Different extensions of
- One idea: guide the online graph search using an offline methods to improve safety.
- This is based on the work of (Bouton, Karlsson, et al., 2019), where offline
VI
(value iteration) is used to computeP_collision
(s
,a
). - This safety criterion is then used to limit the set of safe available actions.
- In the presented work, the author reason over the
belief
instead of thestate
.
- This is based on the work of (Bouton, Karlsson, et al., 2019), where offline
- Another idea: Use a Kalman Filter (instead of some particle filters) for belief updater.
- One quote:
"Using an online method based on sparse sampling may lead to safety issues. Rare events with critical consequences may not be sampled leading to sub-optimal and potentially dangerous decisions."
- One promising tool: POMDPs.jl
POMDPs.jl
is an interface for defining, solving, and simulatingMDPs
andPOMDPs
on discrete and continuous spaces. It has been developed and is supported by a team from Stanford.
- Two ideas for future works:
- In their repo, the authors suggest combining learning (e.g. model-free RL used as a heuristic and/or for rollout) with planning (MCTS), mentioning the success of AlphaGo Zero.
- Another improvement concerns the transition model for the observed vehicles. Instead of
CV
(constant velocity) models, one could assume surrounding vehicles are following a driver model (e.g.IDM
) and the task would be to infer its hidden parameters.
"Baidu Apollo EM Motion Planner"
Click to expand
Top-left: Hierarchical framework: lane-level trajectories are generated in parallel and eventually compared to decide of lane-changes. They can be passive (e.g. when the default lane is blocked by an obstacle) or non-passive (request triggered by the routing module for the purpose of reaching the final destination). For each lane, both path and speed optimizations are iteratively solved in the Frenet frame using a combination of dynamic programming for rough but feasible decision-proposals (E-step ) and spline-based quadratic programming for smoothing and optimization (M-step ). Source. |
Bottom: note the path <->speed communication: 1- The speed profile from the last cycle is used to evaluate the dynamic obstacle interactions , for instance to estimate the time-to-collision with oncoming and low-speed dynamic obstacles in the path optimizer. 2- The generated path is sent to the speed optimizer to compute an optimal speed profile. Source. |
Top: Search within a grid is performed using dynamic programming (DP ). In addition to the boundary constraint and the dynamic constraints (acceleration , jerk limits and monotonicity ), the generated path shall match the ego car’s initial lateral position and derivatives. Bottom: Since all constraints are linear with respect to spline parameters, a quadratic programming solver (QP ) can be used to solve the problem very fast. Source. |
Scenario -based planning was introduced in v3.5 . Source1 Source2. |
Authors: Fan, H., Zhu, F., Liu, C., Zhang, L., Zhuang, L., Li, D., Zhu, W., Hu, J., Li, H. & Kong, Q.
-
Motivations:
-
"This planner targets safety and ride experience with a multilane, path-speed iterative, traffic rule and decision combined design."
- Avoid
state
-based descriptions used inhand-tuning
decisions (tuneable but not scalable) andmodel-based
decisions (e.g. data-driven-tunedFSM
).-
"It is true that a heavy-decision-based algorithm, or heavily rule-based algorithm, is easily understood and explained. The disadvantages are also clear: it may be trapped in corner cases (while its frequency is closely related to the complexity and magnitude of the number of rules) and not always be optimal."
-
- Real-world industrial applications are targeted. As opposed to theoretical simulator-based research experiments. Especially about
safety
.-
"We aim to provide a trajectory with at least an
8s
or200m
meter motion planning trajectory." -
"In the case of an emergency, the system could react within
100ms
, compared with a300ms
reaction time for a normal human driver."
-
-
-
About
path-velocity
decomposition andFernet
frame.-
"Many autonomous driving motion planning algorithms are developed in
Frenet
frames with time (SLT
) to reduce the planning dimension with the help of a reference line. Finding the optimal trajectory in aFrenet
frame is essentially a3D
constrained optimization problem." - Instead of a direct
3D
optimization, the problem is converted into two2D
problems:path
optimization andspeed
optimization.
-
-
Why
EM
?- It is based on an
EM
-type iterative algorithm:-
"[Wikipedia] It alternates between performing an expectation (
E
) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M
) step, which computes parameters maximizing the expected log-likelihood found on theE
step." - Motivations for two-step optimization:
-
"Finding a best
path
/speed
profile on theSL
/ST
graph is a non-convex optimization problem." -
"In
Apollo
EM
-planner, we make decisions prior to providing a smooth trajectory. The decision process is designed to make on-road intentions clear and reduce the search space for finding the optimal trajectory."
-
-
- Here:
1-
E-path
. Based on the information projected in theFrenet
frame (SL
= arclengthS
vsLateral
gap), a smoothpath
is generated.- Search within a grid is performed using dynamic programming (
DP
) to reach a rough resolution.-
"Some necessary pruning based on vehicle dynamic constraints is also applied to accelerate the process."
-
- The
DP
results are used to generate a convex domain and guide the spline-based quadratic programming (QP
).- This solution can provide obstacle decisions such as
nudge
,yield
andovertake
.
- This solution can provide obstacle decisions such as
- Search within a grid is performed using dynamic programming (
2-
M-path
. Find a feasible smoothpath
solution in this region.QP
is used to search for the optimal solution in the convex region supplied by theDP
step, balancing between following the guidance line and smoothness (heading
,curvature
andderivative of curvature
).QP
requires a decision from theDP
step, such asnudge from the left
,nudge from the right
,follow
, orovertake
, to generate its constraint.-
"For both the
path
andspeed
optimizers, we find that piecewisequintic
polynomials are good enough. The spline generally contains3
to5
polynomials with approximately30
parameters. The quadratic programming problem has a relatively small objective function but large number of constraints [boundary constraints
anddynamic feasibility
]. Consequently, an active setQP
solver is good for solving the problem." - The result calculated in the last cycle is re-used as a hot start.
- Average solving time:
3ms
.
3-
E-speed
. Obstacles are projected on the station-time graph (ST
= arclengthS
vsTime
) and find a first solution withDP
.- Since the piecewise linear
speed
profile cannot satisfy dynamic requirements, the splineQP
step is needed to fill this gap.
- Since the piecewise linear
4-
M-speed
. Find a feasible smoothspeed
solution in this region.-
"The spline
QP
speed step includes three parts:cost functional
,linearized constraint
and splineQP
solver."
-
- It is based on an
-
How to tune a
cost
functional that can adapt to different scenarios?- As the scenario becomes more complicated, tuning to improve the motion planner performance becomes increasingly difficult.
- One idea is to learn these parameters from human demonstrated driving data:
auto-tuning
.
-
About
Apollo
:- Features are progressively added from the
GPS
waypoint following inv1.0
(2017) to urban driving inv5.5
(2020). - Scenario-based planning e.g.
follow-lane
,intersection
,park
,emergency
, was introduced inv3.5
:-
"We felt the need to move to a more modular, scenario specific and wholistic approach for planning its trajectory. In this approach, each driving use case is treated as a different driving scenario. This is useful because an issue now reported in a particular scenario can be fixed without affecting the working of other scenarios as opposed to the previous versions."
-
v5.5
focuses oncurb-to-curb
autonomous driving on urban roads, with aPark-and-go
scenario, useful in situations likecurb
-side delivery or passenger pick-up or drop-off.- License: Apache-2.0.
- Features are progressively added from the
"Neural network modeling for steering control of an autonomous vehicle"
-
[
2017
] [📝] [ 🎓Johns Hopkins University
] [🚗zoox
] -
[
model-based
,transition function
,parameter identification
,residual learning
,physics function
,steering system
,power steering software
,RNN
]
Click to expand
The low-level controller must compute a steering_torque to track a reference {steering_angle , steering_rate } sequence. Using a PID alone leads to poor performances. The idea is to augment it with a so-called 'feed-forward steering torque', estimated by different methods. Source. |
Main idea: learn the 'steering_torque -> steering_angle ' transition model and use it inside MPC to track a reference trajectory. The RNN does not predict this mapping from scratch. Rather focus on the residual dynamics not modeled by a physics function, capturing the effects of road-tire interactions, power steering logic, and steering system dynamics. Source. |
Authors: Garimella, G., Funke, J., Wang, C., & Kobilarov, M.
-
In short: Low-level control performed with planning, where the
transition
function is learnt. -
Motivations:
1-
Steering actuator dynamics are difficult to model due to the integrated proprietary power steering control module.2-
For safety and interpretability reasons, prefer a multi-layered model-based approach, as opposed toend-to-end
.3-
Tracking thesteering_angle
by controlling the steering torque is hard for a singlePID
.- The idea is to augment the command with a steering torque that can be computed in different ways.
-
"The goal of introducing the feedforward steering torque is to improve the system response and reduce tracking error without increasing
PID
gains."
-
Layered architecture:
1-
Some decision-making module computes a reference path to follow.2-
This desired path is converted to some reference "steering trajectory" (sequence of {steering_angle
,steering_rate
}) by aMPC
.3-
This desired "steering trajectory" is tracked by aPID
by commanding thesteering torque
.- The idea is to augment the
PID
with some feed-forward steering torque model.
- The idea is to augment the
-
Several models to compute the feed-forward steering torque.
- All performed online.
- Output: feed-forward steering torque. To support the
PID
. - Input: a reference
steering angle
andforward velocity
. 0-
No additional torque (justPID
).- Bad. Because of larger errors the integrator performs much of the tracking task.
- It is limited by the time delays inherent in the system.
1-
Lookup table.- Marginal improvement, since the maneuvers are highly dynamic.
2-
Inverting some "first principles model".- Best that look-up but cannot model the road-wheel interactions correctly at low velocities.
-
"This is likely because at lower velocities, the power steering system provides larger assist, and that effect is difficult to model directly without knowing the underlying power steering software algorithm."
3-
Solving a N-MPC problem.- But for that a
transition
function is required. - Solution: this function is learnt (offline).
- But for that a
-
How to model the steering dynamics?
state
:x
- steering angle
δ
. - steering rate
δ˙
. - lateral velocity
vy
. - yaw rate
φ
.
- steering angle
command
:u
- applied steering torque
τs
. - longitudinal speed
vx
.
- applied steering torque
- Discrete-time
model
: nonlinear functionf
xt+1
=f
(xt
,ut
)f
can be derived:1-
From a "first principles" approach.2-
Using a non-parametric model such as a neural network.
-
1-
Baseline:f
derived from first principles, i.e. from theory.torque
= (moment of inertia
).(angular acceleration
)- Here, the steering
torque
is a composition of:- Coulomb friction from the steering rack.
- Jacking torque due to camber angle.
- Damping caused by rigid body dynamics.
- Self-aligning torque: produced due to tire deformation when the steering wheel moves against the tire thread.
- At small slip angles, it is proportional to the lateral force applied to the front tire.
- Power steering torque.
- Its logic is unknown: proprietary software.
- The power steering torque is modeled as:
g
(τs
,vx
).τs
τs
= torque sensor input.g
= function parametrized by5
parameters.
- A lot of unknown parameters:
vx
and theφ
are propagated using the bicycle model.- The lateral force applied to the front tire is estimated by a tire model.
- From here, the
4
weights and the5
parameters of functiong
are missing.- They are learnt with least-squares regression.
-
2-
RNN
:f
as a non-parametric model:- Main idea: augment the
RNN
model with a physics function block and predict residuals.- It improves the gradient flow of the network.
-
"The neural network layers predict the residual dynamics not modeled by the physics function."
- Example of residuals captured by the net: the effects of road-tire interactions, power steering logic, and steering system dynamics.
- Composed of several units of "neural network blocks" stacked in time.
- Each "block" is divided into:
- A known physics function:
δ
=δ
+k1
.δ´
δ˙
= 0.φ˙
=vx
.tan(k2
.δ
) withk2
the inverse of the wheelbase of the car.vy
= 0.
- A stack of
N=2
FC layers.
- A known physics function:
- Each "block" is divided into:
- Input: previous output + new command (
δ
,δ'
). - Loss:
-
"The state deviations are weighed using a diagonal matrix
P
to enforce a uniform scale across the deviations. The matrixP
is usually chosen as the inverse covariance of the sensor measurements."
-
- About the initialization:
- Divergence is noticed for random initialization of parameters. Especially when propagating the dynamics recursively.
- The optimization is thus performed in two stages:
1-
Learn1
-step predictions only.2-
Optimize over the entire trajectory segment.
- Training:
20,000
trajectory segments with a0.5
-second horizon.- Mini-batch gradient descent with a batch size of
200
samples. - Ablation study: The addition of a second layer along with physics function improves the performance of prediction significantly.
- What is done with this learnt
transition
model? Used inMPC
to track.- The
MPC
derives the optimal trajectoryτ∗
at each time step. - Only one element is used, together with the
PID
:- Either the first one (classical
MPC
) - Or some future stamped steering torque if there are delays in sending the torque command.
- Either the first one (classical
- The
- Main idea: augment the
"On Monte Carlo Tree Search and Reinforcement Learning"
-
[
2017
] [📝] [ 🎓Universities of Ljubljana and Essex
] -
[
RL
,MCTS
,learning
,planning
]
Click to expand
Four parameters introduced in a TD-Tree Search (TDTS ) algorithm related to forgetting , first-visit updating , discounting and initial bias . Source. |
Author: Vodopivec, T., Samothrakis, S., & Ster, B.
-
Goal: The authors aim at promoting a "unified view of
learning
,planning
, andsearch
".- First, the difference
planning
/learning
is discussed.- It depends on the source of experience: simulated / real interaction.
- Then, sample-based (RL and MCTS) search algorithms can all be described as a combination of:
- 1- A learning algorithm, = how the estimates get updated from gathered experience.
- 2- A control policy, = how actions get selected.
- 3- And a representation policy = how is the underlying representation model adapted.
- First, the difference
-
Some
RL
/MCTS
similarities:- They are somehow connected to MDP formulations.
- Both cope with the exploration-exploitation dilemma.
- Both are value-based and share the concepts of:
- Policy evaluation:
backpropagation
phase - Policy improvement:
selection
phase.
- Policy evaluation:
- Both exhibits anytime behaviours.
-
Two major
RL
/MCTS
differences:- RL methods do not recognize the playout phase, i.e. a separate policy for non-memorized (i.e., non-represented) parts of space.
- In MCTS it is often called the "default policy".
- It would be beneficial to have a more complex default policy (where expert knowledge could be included).
- They present two different memorization approaches and two different approximation accuracies:
RL
methods based on value function approximation (e.g. NN) can weakly describe the whole state space.- The
RL
theory should acknowledge a non-represented (i.e., non-memorized) part of the state space, i.e. the part that is not described (estimated) by the representation model at a given moment.
- The
MCTS
algorithms initially approximate only a part of the state space (with high accuracy).- Therefore
MCTS
maintains the distinction between a memorized and non-memorized part of the state space. - The state-space representation is changed online: it is an "adaptive (incremental) representation method".
- Indeed “incomplete” representations can be beneficial: it might be better to accurately approximate the relevant part of the space and less accurately (or not at all) the remaining part.
- Therefore
- RL methods do not recognize the playout phase, i.e. a separate policy for non-memorized (i.e., non-represented) parts of space.
-
Contribution: Based on this comparison, the authors introduce a framework called "temporal-difference tree search" (
TDTS
) which aims at combining the advantages of both RL and MCTS methods.- How it extends classic MCTS methods:
- Replace MC backups with bootstrapping backups.
- This is the idea of
TD search
: do not wait for the end of the trajectory to backup but instead update state value estimates based on previous estimates.TD
errors are decayed (γ
) and accumulated (λ
). It boils down to standardUCT
ifλ
=
γ
=
1
.
- The motivation is to reduce the variance of the estimates, I guess similarly to the introduction of a
TD
-based Critic in Actor-Critic methods. - This is done using "Eligibility Traces" ("traces" because it tracks which states were previously visited and gives them credit, i.e. "eligibility"), as in
n
-step SARSA.- The authors note that the eligibility trace decay rate
λ
can be hard to tune.
- The authors note that the eligibility trace decay rate
- This is the idea of
- Replace MC backups with bootstrapping backups.
- How it extends classic RL methods:
- Contrary to
TD search
methods,TDTS
uses:- Some
playout
andexpansion
phases.- It has some representation policy for incremental or adaptive representation of the non-memorized part of the state space.
- A
playout
policy andplayout
values (as opposed to the already existingcontrol
policy).- The idea is to replace the missing value estimates in the non-memorized part of the state space with "playout values".
- These values can be regarded as a placeholder (entry point) for expert knowledge.
- Some
- Contrary to
- All in all: recreate the four MCTS iteration phases:
- (1) Selection – control in the memorized part of the search space.
- (2) Expansion – changing the representation (adaptive incremental model).
- (3) Playout – control in the non-memorized part of the search space.
- (4) Backpropagation – updating the value estimates with bootstrap.
TDTS
is applied toUCT
(UCB
selection policy in Tree search), leading toSarsa
-UCT
(λ
).
- How it extends classic MCTS methods:
-
One takeaway: as identifies in previous summaries, one idea (e.g. in AlphaGo) is to combine:
- Prior knowledge (value estimates from the RL-pre-trained network),
- And online feedback (
MC
evaluations based on playouts).
-
Two terms I learnt:
- "Transposition tables":
- As I understand, it plays the role of generalisation in function approximation: two similar states should have similar values.
- It originates from search of the game tree, where it is possible to reach a given position in more than one way. These are called
transpositions
. - The transposition table is a kind of cache: on encountering a new position, the program checks the table to see whether the state has already been analysed. If yes, the value (stored) can be used instead of calculating it (which would require expending a subtree).
- "afterstates":
- I understand it as a third metric to quantify the "value":
- V(
s
): state value of being in states
and following policyπ
. - V(
s'
): afterstate value of arriving at states'
and thereafter following policyπ
. - Q(
s
,a
): action value of being in states
, taking actiona
and thereafter following policyπ
.
- V(
- This is used in the presented Sarsa-UCT(λ) algorithm.
- I understand it as a third metric to quantify the "value":
- "Transposition tables":
"Online decision-making for scalable autonomous systems"
-
[
2017
] [📝] [🎞️ (slides)] [ 🎓University of Massachusetts
] [ 🚗Nissan
] -
[
POMDP
,scaling
,scene decomposition
]
Click to expand
Two solvers have been developed offline : one to deal with one vehicle (Decision-Problem P1 ), and another that can deal with one pedestrian (P2 ). DP s are instantiated for each detected entity: here two cars and one pedestrian. Therefore, at each timestep, three recommendations are issued. The most conservative one is kept and implemented (here stop ). Source. |
The proposed solution offers number advantages over the direct use of a massive monolithic POMDP for planning and learning. First, it remains tractable by growing linearly in the number of decision-making problems encountered. Secondly, its component-based form simplifies the design and analysis and offers easier interpretability. Source. |
The authors consider that urban deployment of AVs requires mid-level decision-making. Hence both state and action are rather abstract. Source. |
Author: Wray, K. H., Witwicki, S. J., & Zilberstein, S.
- Motivation:
- Scale!
- More precisely, provide a general solution to intersection decision-making that can cope with complex real-world scenarios, with varying number of lanes and number of entities (e.g.
car
andpedestrian
).
- One option: a single (big) policy.
-
"A single all-encompassing
POMDP
with these state factors quickly becomes utterly infeasible, and will vary greatly among intersections." - Instead, the authors suggest decomposing the problem.
-
- Main idea:
1-
Decompose the scene into canonical simple scenarios.2-
Solve each instance separately with solvers that have been learnt offline. It leads to one recommendation per instance.3-
Aggregate all the recommendations using some (conservative) "aggregation operator".- It could have been
min()
if the action were about theacceleration
. - Here, the action is more abstract (
go
,stop
) and the operator uses lexicographic preference for safest action (concept calledLEAF
):stop
>go
.
- It could have been
- About the
POMDP
offline
solver:
"Planning under Uncertainties for Autonomous Driving on Urban Road"
-
[
POMDP
,DESPOT
,urban driving
,intention estimation
]
Click to expand
The author prefers the reaction -based over the goal+motion -based motion intention estimation. It relies on deviation measurement between the observed behaviour and some reference vehicle state. Source. |
The agent maintains a belief over the motion intention of the other vehicles to decide of the longitudinal discrete action in {accelerate , maintain speed , decelerate }. The motion intention is in {stopping , hesitation , normal , aggressive }. Source. |
Source. |
The observation model can be factorized. Then, the emission probability depends on both the hidden intention and physical state. Question: can someone explain me the decomposition, and what to take for the marginal probability? Source. |
Construction of the transition model: The speed action of the other vehicles depends on both their motion intention and the correlations with the other vehicles. Source. |
In the POMDP , the transition model for the other vehicle works as followed: an acceleration action is computed based on both its motion intention and its correlations with the other vehicles, which means that it wants to avoid a potential collision with any other vehicles. A reference state rc (containing among others a reference speed vref that the observed vehicle should follow) is also derived - more precisely sampled - which enables the computation of an expected action for each observed vehicle. Source. |
A risk-aware motion planning algorithm CC-RRT*-D is proposed to address the internal motion uncertainty. It leverages the RRT* algorithm for space exploring and utilized the chance-constrained approach to evaluate the trajectories' collision risks. In each planning loop, the decision-maker will call the motion planner first with a goal being provided. It then will proceed to decide a proper acceleration command and select which trajectory to commit. The control frequency is set at 1Hz while the planning task time is fixed to 0.1s . Source. |
The action space is extended to coordinate the speed and steering control. While choosing the acceleration, the agent selects either the reference path or a path proposed by the motion planner. Source. |
Author: Wei L.
-
The quote in this PhD thesis:
-
"As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality." - Albert Einstein, 1922
-
-
Motivations / main ideas:
- Propose an interaction-aware and uncertainty-aware decision making for urban scenarios.
- Have a general and scalable method, with a focus on real-time applicability.
- Cope with both
internal motion uncertainty
andexternal situation uncertainty
. - Consider both
road context
and the obstacle vehicle'smotion intention
in the decision-making.
-
About motion intention estimation:
- To cope with the lack of an
intention
sensor. -
"The most popular strategy is to define the
motion intention
as their hidden destinations." -
"Instead of relying on the hidden goals, this study aims at employing the obstacle vehicle’s reaction to model the
motion intention
." - Compared to the hidden goal method, employing the reactions to model the
motion intention
is more general and scalable. The measurement of deviation is not restricted to any specific region of interest (ROI
). - How to model the reaction?
- Using a deviation measurement of the observed vehicle state from the corresponding reference vehicle state.
- For that purpose, some
reference state
containing the reference speedvref
that the observed vehicle should follow is derived.
- What are the intentions for crossing in urban driving?
- Four intentions are abstracted from human driving behaviours and evidences are based on
speed deviation
: 1-
Stopping
intention: Theobstacle vehicle's speed
is close to zero, while thereference speed
is much higher.2-
Hesitating
intention: Theobstacle vehicle's speed
is lower than thereference speed
but not close to zero.3-
Normal
intention: Theobstacle vehicle's speed
matches well with thereference speed
.4-
Aggressive
intention: Theobstacle vehicle's current
speed is much higher than thereference speed
.
- Four intentions are abstracted from human driving behaviours and evidences are based on
- To cope with the lack of an
-
About the
POMDP
formulation:- The episode is not episodic but continuous: there is no termination
state
such as "exit of intersection" in other approaches.- That makes this method very general, addressing
junctions
,leader-following
androundabout
scenarios with a singlePOMDP
formulation. - What is the discount rate?
- That makes this method very general, addressing
state
:- It contains:
1-
Physical (or metric) information:pose
andspeed
(x
,y
,θ
,v
), which is assumed to be fully observed.2-
Intention information:i
, which is non-observable. Obviously this is not applicable for the ego-vehicle.
- The author explains that the
reference state
(containing thereference speed
) should not be part of thestate
: -
"The road context can be employed as a reference knowledge, and is consequently excluded from the vehicle state".
- It contains:
action
:- The action space is about longitudinal control and is made discrete:
accelerate
,decelerate
ormaintain current speed
.- Values are set to
±0.25 m/s2
in simulation and±0.5 m/s2
in the real-world experiment.
- Values are set to
-
"The
steering
control of the ego vehicle, on the other hand, is accomplished by tracking the reference routes closely." - The author proposes a modification to include the lateral control as well, e.g. for
overtaking
orlane-changing
manoeuvres and a strategy to properly coordinate thespeed
andsteering
control.-
"An improperly extended action space can break down the real-time applicability of our situation-aware decision making algorithm. As an alternative, the trajectory selection strategy is employed in this study".
- The
steering action
space includes two trajectory candidates to be selected: [Commit
,NewPlan
] and the agent decides whether or not the trajectory planned by the motion planner should be committed in each planning loop.
-
- The action space is about longitudinal control and is made discrete:
transition model
:- The behaviour of other vehicles is driven by both the road context and their motion intentions.
- An
action
about the acceleration is derived for each vehicle.- It depends on both the motion intention and the correlations with the other vehicles to avoid a potential collision with any other vehicles:
1-
If the collision risk with other vehicles is low, theaction
is mapped from the motion intention only asP
(a
|intention
).2-
Otherwise, the obstacle vehicle would either follow the inferred motion intention or choose thedecelerate
action: the higher the speed of the nearest vehicle, the higher the probability to decide to decelerate.
observation model
:-
"Obstacle vehicles' observation functions are more complicated due to the inclusion of their motion intentions".
-
"Given the
reference speed
vref(zi), the observation function for eachmotion intention
ι
∈
I
is modeled as a Gaussian with meanm(vref(zi), ι)
and covariancevref(zi)/σ
."
-
reward function
:- The reward
R
(a
,s
) obtained by executing actiona
in states
balances the driving efficiency and safety. - What is weighting between the different terms?
- The reward
- The episode is not episodic but continuous: there is no termination
-
Another topic: learning the road context to perform vehicle behaviour analysis.
- The inference over the road context is divided into three stages:
1-
The topology learning to model the road network.2-
The motion learning to generalize the typical vehicle motion pattern.3-
The rule learning for traffic rules reasoning.
-
"Given the road context and the vehicle behavior analysis, a situation-aware decision making algorithm was introduced to address the situation uncertainty."
- The inference over the road context is divided into three stages:
-
The corresponding paper: Situation-aware decision making for autonomous driving on urban road using online POMDP by (Liu, Kim, Pendleton, & Ang, 2015).
- It emphasises the need for a proper environment representation, integrating both the road context and the obstacle vehicle's motion intention, leading to the definition of
Urban Road Situation
:-
"The road context should comprise not only the road network and traffic rules, but also the typical vehicle motion patterns."
-
- It also explains the intention estimation based on motion rather than target destinations (more general and scalable, because the measurement of deviation is not restricted to any specific
ROI
):-
"The motion intention is inferred from the vehicle reactions, i.e., the deviation of the observed vehicle states to the reference vehicle behaviors [here the
reference speed
and transition probabilities between regions] that are represented by the road context." -
"The urban road situation (
URS
) features not only the motion intention and the road context, but also the implicit correlations between them. [...] The motion intention is abstracted as the conditional distribution over the reactions, where the implicit correlations between the road context and motion intention are acknowledged."
-
- It also details the two most complicated components of the
POMDP
- thetransition
andobservation
models - and explains how to learn reference behaviours from data. - Finally, it compare the approach so to classic rule-based policies:
- The
ROS
-basedStage
simulator is employed for evaluation. -
[About frozen robot problem] "While the reactive approach seems able to maintain a nice success rate, the efficiency is scarified. For most of the trials using the reactive approach, the ego vehicle always passively decided to wait, even when the gap between itself and the obstacle vehicles is large enough for safe merging."
- The
- It emphasises the need for a proper environment representation, integrating both the road context and the obstacle vehicle's motion intention, leading to the definition of
"Probabilistic Decision-Making under Uncertainty for Autonomous Driving using Continuous POMDPs"
-
[
2014
] [📝] [ 🎓KIT
] -
[
state representation
,occlusion
,offline solver
]
Click to expand
A physical transition model is defined for all vehicles: p(x' given x , a ). The ego-action is known. For the other road users, the action is inferred / estimated, using the situational context Cj , the map knowledge and an estimated planned route Pj (derived from the latter two). This p (aj given cj , pj ) inference / estimation is motivated by the fact that other drivers usually follow the course of the road and interact with each other and the ego car. The authors introduce a context-dependent term for the ego-action too: p (ae given a , ce ). It should account for interaction with other road users such as ''hold the velocity of the car in front''. Should not the policy be responsible for that?. Source. |
Up: intersection scenario, considering noise in measurement as well as occlusion. Bottom: Instead of a priori (equidistant) state discretization, the state representation is learnt offline, while performing value iteration. It offers a more appropriate description of the current scene, deducing what is relevant to the decision-making and what is not. For instance, state when the other car is far away can be aggregated, while finer distinctions are required when driving close. Source. |
Authors: Brechtel, S., Gindele, T., & Dillmann, R.
-
Motivations:
- Adopt an appropriate state representation to solve a
POMDP
-formulated intersection problem. - Equidistant discretization or any a priori discretization of the
state
space are to be avoided.- Instead, the problem should be formulated and solved directly in its natural continuous space.
- More precisely, the idea is to Learn a discrete representation of the continuous
state
space to solve the integrals in continuous POMDPs.
- Adopt an appropriate state representation to solve a
-
POMDP
formulation:- The ego-car interacts with
2
other cars at an intersection.- Each vehicle is described by
4
continuous variables:speed
, globalpose
(x
,y
,ψ
). - Leading to a continuous
state
of size12
.
- Each vehicle is described by
- Occlusions are detected by checking if there is a direct line of sight to the road user that does not intersect with other objects, which are represented as polygons.
- To be honnest, I did not understand where the
occlusion
random variable is used/considered. The authors mention a probability of beeing occluded. But where/how is that tracked by the belief tracker?
- To be honnest, I did not understand where the
- The ego-car interacts with
-
Learning the state representation:
-
[Idea is to] "Apply a learning algorithm during value iteration in order to only distinguish states if the discrimination is necessary to represent the optimal decision."
-
"The
value function
serves as a mathematically sound basis to deduce what is relevant to the decision-making and what is not. [In discretePOMDP
s,state
s with same value can be aggregated without influencing thepolicy
.]" - The problem-specific compression is refined during the solving process:
- The
POMDP
solver learned a decision tree with442
leaves. 4 million
continuous state samples, collected from1000
simulations, are used to learn the state representation, i.e. the assignment of an index to a state vector (12
continuous values).
- The
- Plotting the histogram of "situations per state index", it can be seen that:
- Particles that do need not be differentiated can be aggregated.
- Refinements are performed for states that model critical situations, where accuracy in the information is important.
-
-
Some previous work by the authors:
- "Solving Continuous POMDPs: Value Iteration with Incremental Learning of an Efficient Space Representation" - (Brechtel, Gindele, & Dillmann, 2013).
- "Probabilistic MDP-behavior planning for cars" - (Brechtel, Gindele, & Dillmann, 2011).
-
One quote I like:
-
"The
process
[transition
] andobservation
models and thereward
function can be seen as the glue between the spaces, as they define the meaning of the spaces for the decision process."
-