- Questions when developing a predictor:
- Which application?
- Who?
pedestrians
orvehicles
? - Where?
highway
orurban
?
- Who?
- Scenarios: generic or specific?
- E.g. address all types of intersections, or only one specific
T
-intersection?
- E.g. address all types of intersections, or only one specific
- Which prediction horizon?
- Usually
physics
-based for short term andgoal
-based for longer horizons.
- Usually
- Frequency of
observation
s?- Maybe need to consider a sequence of
observation
s.
- Maybe need to consider a sequence of
- How much robustness is needed?
- Ability to generalize / transfer well in new environments?
- Uncertainty: how to model multi-modal futures distributions?
- Uncertainty comes from the stochastic nature of human motion and the non-perfect perception.
- Be careful with mode-collapse.
- Context-aware: which information to consider?
- About the
target agent
, thestatic
and thedynamic
environment. - Consider/ignore other agents?
- How to model interactions?
- Consider/ignore
map
/road
structure?- Is a
map
available?
- Is a
- About the
- How many obstacles?
- How to represent them?
- Can it scale?
- How much expert / prior knowledge to inject? What can be learnt?
- E.g. the dynamics of moving agents.
- Modern techniques make extensive use of
ML
in order to better estimate context-dependent patterns in real-data. - What dataset available?
- How much computational cost is constrained?
- How to obtain a trajectory, i.e. a sequence of positions?
- One-shot feed-forward: directly predicts a time sequence.
- Roll-outs: repeat one-step prediction with the forward propagation of the model.
- But this requires a
transition
function.
- But this requires a
- What
metrics
and test set for evaluation?- If possible comparable with other works.
- Which application?
"TNT: Target-driveN Trajectory Prediction"
-
[
2020
] [📝] [🚗Waymo
] -
[
multimodal
,VectorNet
,goal-based prediction
]
Click to expand
TNT = target-driven trajectory prediction. The intuition is that the uncertainty of future states can be decomposed into two parts: the target or intent uncertainty, such as the decision between turning left and right ; and the control uncertainty, such as the fine-grained motion required to perform a turn. Accordingly, the probabilistic distribution is decomposed by conditioning on targets and then marginalizing over them. Source. |
Authors: Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., Shen, Y., Shen, Y., Chai, Y., Schmid, C., Li, C., & Anguelov, D.
-
One sentence:
-
"Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target
states
."
-
-
Motivations:
1-
Model multimodal futures distributions.2-
Be able to incorporate expert knowledge.3-
Do not rely on run-time sampling to estimate trajectory distributions.
-
How to model multimodal futures distributions?
1-
Futuremodes
are implicitly modelled aslatent
variables, which should capture the underlyingintents
of the agents.- Diverse trajectories can be generated by sampling from these implicit distributions.
- E.g.
CVAE
inDESIRE
,GAN
inSocialGAN
, single-step policy roll-out methods: they are prone to mode collapse too. - Issue: the use of
latent
variables to model intents prohibits them to be interpreted. Incorporating expert knowledge is made challenging. - Issue: it often requires test-time sampling [stochastic sampling from the
latent
space] to evaluate probabilistic queries (e.g., “how likely is the agent to turn left?”) and obtain implicit distributions. -
"Modeling the future as a discrete set of targets does not suffer from mode averaging, which is the major factor that hampers multimodal predictions."
2-
Decompose the trajectory prediction task into subtasks.- For instance with
planning
-based prediction:- First estimate a Bayesian posterior distribution of destinations.
- Then used
IRL
to plan the trajectories.
- Or by decomposing
goal
distribution estimation andgoal
-directedplanning
.
- For instance with
3-
Discretize the output space as intents or with anchors.IntentNet
[UBER
]: several commonmotion
categories are manually defined (e.g.left turn
andlane changes
) and a separate motion predictor is learnt for eachintent
.-
"
MultiPath
[Waymo
] andCoverNet
[nuTonomy
] chose to quantize the trajectories intoanchors
, where the trajectory prediction task is reformulated intoanchor
selection and offset regression." -
"Unlike
anchor
trajectories, the targets inTNT
are much lower dimensional and can be easily discretized via uniform sampling or based on expert knowledge (e.g. HD maps). Hence, they can be estimated more reliably."
-
TNT
= target-driven trajectory prediction.- The framework has
3
stages that are trained end-to-end: 1-
Target
prediction.- Estimate a distribution over candidate
targets
,T
steps into the future, given the encodedscene context
. - The potential future
targets
are modelled via a set ofN
discrete, quantized locations with continuous offsets.-
"We can see that with regression the performance improved by
0.16m
, which shows the necessity of position refinement from the original target coordinates."
-
- Estimate a distribution over candidate
2-
Target
-conditioned motion estimation.- Predict trajectory
state
sequences conditioned ontargets
. - Two assumptions:
1-
Future time steps are conditionally independent. Sequential predictions are therefore avoided.2-
The distribution of the trajectories is unimodal (normal
) given the target.
- Predict trajectory
3-
Scoring and selection.- Estimates the likelihood of each predicted trajectory, taking into account the
context
of all other predicted trajectories-
"Our final stage estimates the
likelihood
of full future trajectoriessF
. This differs from the second stage, which decomposes over time steps andtargets
, and from the first stage which only has knowledge oftargets
, but not full trajectories — e.g., atarget
might be estimated to have highlikelihood
, but a full trajectory to reach that target might not."
-
- Select a final compact set of trajectory predictions.
-
"This process is inspired by the non-maximum suppression algorithm commonly used for computer vision problems, such as object detection."
-
- Estimates the likelihood of each predicted trajectory, taking into account the
- The framework has
-
How is the
context
information encoded, i.e. the ego-car's interactions with the environment and the other agents?- When the HD map is available: Using the hierarchical graph neural network
VectorNet
[Waymo
].-
"Polylines are used to abstract the HD map elements
cP
(lanes, traffic signs) and agent trajectoriessP
; a subgraph network is applied to encode each polyline, which contains a variable number of vectors; then a global graph is used to model the interactions between polylines."
-
-
" If
scene context
is only available in the form of top-down imagery, aConvNet
is used as the context encoder."
- When the HD map is available: Using the hierarchical graph neural network
-
How to generate the
targets
, i.e. design thetarget space
?- The
target space
is approximated by a set of discrete locations. Expert knowledge (such as road topology) can be incorporated there, for instance by samplingtargets
on and around the lanes."-
"These
targets
are not only grounded in physical entities that are interpretable (e.g.location
), but also correlate well withintent
(e.g. alane change
or aright turn
)."
-
1-
For vehicle: points are uniformly sampled on lane centerlines from the HD map and used astarget
candidates, with the assumption that vehicles never depart far away from lanes.2-
For pedestrians: a virtual grid is generated around the agent and use the grid points as target candidates.- Grid of range
20m
x20m
with a grid size of0.5m
->1600
targets are considered and only the best50
are kept for further processing.
- Grid of range
- The
-
What is the ground truth?
- Not very clear to me how you can use a single demonstration as an oracle: today you
turn left
, but yesterday, with samecontext
, youturned right
. -
"The
ground truth score
of each predicted trajectory is defined by its distance to ground truth trajectoryψ
(sF
)."
- Not very clear to me how you can use a single demonstration as an oracle: today you
-
Results:
TNT
outperforms the state-of-the-art on prediction of:1-
Vehicles:Argoverse
Forecasting andINTERACTION
.2-
Pedestrian:Stanford Drone
and an in-house Pedestrian-at-Intersection dataset (PAID
).
- For benchmark,
MultiPath
andDESIRE
are reimplemented by replacing theirConvNet
context encoders withVectorNet
.
"Modeling and Prediction of Human Driver Behavior: A Survey"
-
[
2020
] [📝] [ 🎓Stanford
,University of Illinois
] [🚗Qualcomm
] -
[
state estimation
,intention estimation
,trait estimation
,motion prediction
]
Click to expand
Terminology: the problem is formulated as a discrete-time multi-agent partially observable stochastic game (POSG ). In particular, the internal state can contain agent’s navigational goals or the behavioural traits . Source. |
Authors: Brown, K., Driggs-Campbell, K., & Kochenderfer, M. J.
-
Motivation:
- A review and taxonomy of
200
models from the literature on driver behaviour modelling.-
"In the context of the partially observable stochastic game (
POSG
) formulation, adriver behavior model
is a collection of assumptions about thehuman observation
functionG
,internal-state update
functionH
andpolicy
functionπ
(thestate-transition
functionF
also plays an important role in driver-modeling applications, though it has more to do with the vehicle than the driver)." - The "References" section is large and good!
-
- Models are categorized based on the tasks they aim to address.
1-
state
estimation.2-
intention
estimation.3-
trait
estimation.4-
motion
prediction.
- A review and taxonomy of
-
The following lists are non-exhaustive, see the tables for full details. Instead, they try to give an overview of the most represented instances:
-
1-
(Physical)state
estimation.- [Algorithm]: approximate recursive Bayesian filters. E.g.
KF
,PF
,moving average filter
. -
"Some advanced
state
estimation models take advantage of the structure inherent in the driving environment to improve filtering accuracy. E.g.DBN
".
- [Algorithm]: approximate recursive Bayesian filters. E.g.
-
2-
(Internal states)intention
estimation.-
"
Intention
estimation usually involves computing a probability distribution over a finite set of possible behavior modes - often corresponding to navigational goals (e.g.,change lanes
,overtake
) - that a driver might execute in the current situation." - [Architecture]:
[D]BN
,SVM
,HMM
,LSTM
. - [Scope]:
highway
,intersection
,lane-changing
,urban
. - [Evaluation]:
accuracy
(classification),ROC
curve,F1
,false positive rate
. - [Intention Space] (set of possible behaviour modes that may exist in a driver’s
internal state
- often combined):lateral
modes (e.g.lane-change
/lane-keeping
intentions),routes
(a sequence of decisions, e.g.turn right → go straight → turn right again
that a driver may intend to execute),longitudinal
modes (e.g.car-following
/cruising
),- joint
configurations
. -
"
Configuration
intentions are defined in terms of spatial relationships to other vehicles. For example,intention estimation
for amerging
scenario might involve reasoning about which gap between vehicles the target car intends to enter. Theintention space
of a car in the other lane might be whether or not to yield and allow the merging vehicle to enter."
- [Hypothesis Representation] (how to represent uncertainty in the intention hypothesis?):
discrete probability distribution
over possibleintentions
.-
"In contrast,
point estimate
hypothesis ignores uncertainty and simply assigns a probability of1
to a single (presumably the most likely) behavior mode."
-
- [Estimation / Inference Paradigm]:
single-shot
,recursive
,Bayesian
(based on probabilistic graphical models),black-box
,game theory
.-
"
Recursive
estimation algorithms operate by repeatedly updating the intention hypothesis at each time step based on the new information received. In contrast,single-shot
estimators compute a new hypothesis from scratch at each inference step. The latter may operate over a history of observations, but it does not store any information between successive inference iterations." -
"Game-theoretic models are distinguished by being
interaction-aware
. They explicitly consider possible situational outcomes in order to compute or refine an intention hypothesis. Thisinteraction-awareness
can be as simple as pruning intentions with a high probability of conflicting with other drivers, or it can mean computing the Nash equilibrium of an explicitly formulated game with a payoff matrix."
-
-
-
3-
trait
estimation.-
"Whereas
intention
estimation reasons about what a driver is trying to do,trait estimation
reasons about factors that affect how the driver will do it. Broadly speaking, traits encompassskills
,preferences
, andstyle
, as well as properties likefatigue
,distractedness
, etc." -
"
Trait
estimation may be interpreted as the process of inferring the “parameters” of the driver’spolicy function π
on the basis of observed driving behavior. [...]Traits
can also be interpreted as part of the driver’s internal state." -
[Architecture]:
IDM
,MOBIL
,reward
parameters. -
[Training]:
IRL
,EM
,genetic algorithms
,heuristic
. -
[Theory]: Inverse
RL
. -
[Scope]:
car following
(IDM
),highway
,intersection
,urban
. -
[Trait Space]:
policy
parameters,reward
parameters (assuming that drivers are trying to optimize acost
function).-
"Some of the most widely known driver models are simple parametric controllers with tuneable “style” or “preference”
policy
parameters that represent intuitive behavioraltraits
of drivers. E.g.IDM
." IDM
traits:minimum desired gap
,desired time headway
,maximum feasible acceleration
,preferred deceleration
,maximum desired speed
.-
"
Reward
function parameters often correspond to the same intuitive notions mentioned above (e.g.,preferred velocity
), the important difference being that they parametrize areward
function rather than a closed-loop controlpolicy
."
-
-
[Hypothesis Representation] (uncertainty): in almost all cases, the hypothesis is represented by a
point estimate
rather than adistribution
. -
[Estimation Paradigm]:
offline
/online
.-
"Some models combine the two paradigms by computing a prior distribution
offline
, then tuning itonline
. This tuning procedure often relies on Bayesian methods."
-
-
[Model Class]:
heuristic
,optimization
,Bayesian
,IRL
,contextually varying
.-
"One simple approach to
offline
trait estimation is to settrait
parameters heuristically. Specifying parameters manually is one way to incorporate expert domain knowledge into models." -
"In some approaches,
trait
parameters are modeled as contextually varying, meaning that they vary based on the region of thestate
space (the context) or the current behavior mode."
-
-
-
4-
motion
prediction.-
"Infer the future physical
states
of the surrounding vehicles". - [Architecture]:
IDM
,LSTM
(and otherRNN
/NN
), constant acceleration / speed (CA
,CV
),encoder-decoder
,GMM
,GP
,adaptive
,spline
. - [Training]:
heuristic
.-
"Simple examples include rule-based heuristic control laws like
IDM
. More sophisticated examples include closed-loop policies based onNN
,DBN
, and random forests."
-
- [Theory]:
RL
,MPC
, trajectory optimization.-
"Some
MPC
policy models (including those used within a forward simulation paradigm) fall into the game theoretic category because they explicitly predict the future states of their environment (including other cars) before computing a planned trajectory."
-
- [Scope]:
highway
,car-following
(e.g. usingIDM
),intersection
,urban
. - [Evaluation]:
RMSE
,NLL
,MAE
,collision rate
. - [Vehicle dynamics model]:
linear
,learned
,bicycle kinematic
.-
"Many models in the literature assume linear
state-transition
dynamics. Linear models can be first order (i.e.,output
isposition
,input
isvelocity
), second order (i.e.,output
isposition
,input
isacceleration
), and so forth." -
"Kinematic models are simpler than dynamic models, but the no-slip assumption can lead to significant modeling errors."
-
"Some
state-transition
models are learned, in the sense that the observed correlation between consecutive predictedstates
results entirely from training on large datasets. Some incorporate an explicit transition model where the parameters are learned, whereas others simply output a full trajectory."
-
- [
Scene
-level uncertainty modelling]:single-scenario
(ignoring multimodal uncertainty at the scene level),partial scenario
,multi-scenario
(reason about the different possible scenarios that may follow from an initial traffic scene),reachable set
.-
"Some models reason only about a partial scenario, meaning they predict the motion of only a subset of vehicles in the traffic scene, usually under a single scenario."
-
"Some models reason about multimodal uncertainty on the
scene
-level by performing multiple (parallel) rollouts associated with different scenarios." -
"Rather than reasoning about the likelihood of future
states
, some models reason about reachability. Reachability analysis implies taking a worst-case mindset in terms of predicting vehicle motion."
- [
Agent
-level uncertainty modelling]:single deterministic
,particle set
,Gaussians
. - [Prediction paradigm]:
- open-loop
independent
trajectory prediction.-
"Many models operate under the independent prediction paradigm, meaning that they predict a full trajectory independently for each agent in the scene. These approaches are
interaction-unaware
because they are open-loop. Though they may account for interaction between vehicles at the current timet
, they do not explicitly reason about interaction over the prediction window fromt+1
to the prediction horizontf
. [...] Because independent trajectory prediction models ignore interaction, their predictive power tends to quickly degrade as the prediction horizon extends further into the future."
-
- closed-loop
forward
simulation.-
"In the forward simulation paradigm, motion hypotheses are computed by rolling out a closed-loop control policy
π
for each target vehicle." -game theoretic
prediction.
-
game theoretic
prediction.-
"Agents are modeled as looking ahead to consider the possible ramifications of their
actions
. This notion of looking ahead makes game-theoretic prediction models more deeplyinteraction-aware
than forward simulation models based on reactive closed-loop control."
-
- open-loop
-
"Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge"
-
[
2020
] [📝] [🚗nuTonomy
] -
[
multimodal
,probabilistic
,mode collapse
,domain knowledge
,classification
]
Click to expand
Top: The idea of CoverNet is to first generate feasible future trajectories, and then classify them. It uses the past states of all road users and a HD map to compute a distribution over a vehicle's possible future states. Bottom-left: the set of trajectories can be reduced by considering the current state and the dynamics: at high speeds , sharp turns are not dynamically feasible for instance. Bottom-right: the contribution here also deals with "feasibility", i.e. tries to reduce the set using domain knowledge. A second loss is introduced to penalize predictions that go off-road . The first loss (cross entropy with closest prediction treated as ground truth) is also adapted: instead of a delta distribution over the closest mode, there is also probability assigned to near misses. Source. |
Authors: Boulton, F. A., Grigore, E. C., & Wolff, E. M.
-
Related work:
CoverNet
: Multimodal Behavior Prediction using Trajectory Sets, (Phan-Minh, Grigore, Boulton, Beijbom, & Wolff, 2019). -
Motivation:
- Extend their
CoverNet
by including further "domain knowledge".-
"Both dynamic constraints and "rules-of-the-road" place strong priors on likely motions."
CoverNet
: Predicted trajectories should be consistent with the current dynamic state.- This work : Predicted trajectories should stay on road.
-
- The main idea is to leverage the
map
information by adding an auxiliary loss that penalizes off-road predictions.
- Extend their
-
Motivations and ideas of
CoverNet
:1-
Avoid the issue of "mode collapse".- The prediction problem is treated as classification over a diverse set of trajectories.
- The trajectory sets for
CoverNet
is available onnuscenes-devkit
github.
2-
Ensure a desired level of coverage of thestate
space.- The larger and the more diverse the set, the higher the coverage. One can play with the resolution to ensure coverage guarantees, while pruning of the set improves the efficiency.
3-
Eliminate dynamically infeasible trajectories, i.e. introduced dynamic constraints.- Trajectories that are not physically possible are not considered, which limits the set of reachable states and improves the efficiency.
-
"We create a dynamic trajectory set based on the current
state
by integrating forward with our dynamic model over diverse control sequences."
-
Two losses:
1-
Moving beyond "standard"cross-entropy
loss for classification.- What is the ground truth trajectory? Obviously, it is not part of the set.
- One solution: designate the closest one in the set.
-
"We utilize cross-entropy with positive samples determined by the element in the
trajectory set
closest to the actual ground truth in minimum average of point-wise Euclidean distances." - Issue: This will penalize the second-closest trajectory just as much as the furthest, since it ignores the geometric structure of the trajectory set.
-
- Another idea: use a weighted cross-entropy loss, where the
weight
is a function of distance to the ground truth.-
"Instead of a
delta
distribution over the closest mode, there is also probability assigned tonear misses
." - A threshold defines which trajectories are "close enough" to the ground truth.
-
- This weighted loss is adapted to favour mode diversity:
-
"We tried an "Avoid Nearby" weighted cross entropy loss that assigns weight of
1
to the closest match,0
to all other trajectories within2
meters of ground truth, and1/|K|
to the rest. We see that we are able to increase mode diversity and recover the performance of the baseline loss." -
"Our results indicate that losses that are better able to enforce mode diversity may lead to improved performance."
-
2-
Add an auxiliary loss for off-road predictions.- This helps learn domain knowledge, i.e. partially encode "rules-of-the-road".
-
"This auxiliary loss can easily be pretrained using only
map
information (e.g.,off-road
area), which significantly improves performance on small datasets."
-
Related works for
predictions
(we want multimodal and probabilistic trajectory predictions):- Input, i.e. encoding of the scene:
-
"State-of-the-art motion prediction algorithms now typically use
CNNs
to learn appropriate features from abirds-eye-view
rendering of the scene (map and road users)." - Graph neural networks (
GNNs
) looks promising to encode interactions. - Here: A
BEV
rasterRGB
image (fixed size) containingmap
information and the paststates
of all objects.- It is inspired by the work of
UBER
: Multimodal Trajectory Predictions for Autonomous Driving using Deep Convolutional Networks, (Cui et al., 2018).
- It is inspired by the work of
-
- Output, i.e. representing the possible future motions:
1-
Generative models.- They encode choice over multiple actions via sampling latent variables.
- Issue: multiple trajectory samples or
1
-step policy rollouts (e.g.R2P2
) are required at inference. - Examples: Stochastic policies,
CVAE
s andGAN
s.
2-
Regression.- Unimodal: predict a single future trajectory. Issue: unrealistically average over behaviours, even when predicting Gaussian uncertainty.
- Multimodal: distribution over multiple trajectories. Issue: suffer from mode collapse.
3-
Classification.-
"We choose not to learn an uncertainty distribution over the space. The density of our trajectory sets reduces its benefit compared to the case when there are a only a handful of modes."
- How to deal with varying number of classes to predict? Not clear to me.
-
- Input, i.e. encoding of the scene:
-
How to solve
mode collapse
in regression?- The authors consider
MultiPath
byWaymo
(detailed also in this page) as their baseline. - A set of anchor boxes can be used, much like in object detection:
-
"This model implements ordinal regression by first choosing among a fixed set of anchors (computed a priori) and then regressing to residuals from the chosen anchor. This model predicts a fixed number of trajectories (
modes
) and their associated probabilities." - The authors extend
MultiPath
with dynamically-computed anchors, based on the agent's currentspeed
.- Again, it makes no sense to consider anchors that are not dynamically reachable.
- They also found that using one order of magnitude more “anchor” trajectories that Waymo (
64
) is beneficial: better coverage of space via anchors, leaving the network to learn smaller residuals.
- The authors consider
-
Extensions:
- As pointed out by this
KIT
Master Thesis offer, the current state ofCoverNet
only has a motion model for cars. Predicting bicycles and pedestrians' motions would be a next step. - Interactions are ignored now.
- As pointed out by this
"PnPNet: End-to-End Perception and Prediction with Tracking in the Loop"
-
[
2020
] [📝] [ 🎓University of Toronto
] [🚗Uber
] -
[
joint perception + prediction
,multi-object tracking
]
Click to expand
The authors propose to leverage tracking for the joint perception +prediction task. Source. |
Top: One main idea is to make the prediction module directly reuse the scene context captured in the perception features, and also consider the past object tracks . Bottom: a second contribution is the use of a LSTM as a sequence model to learn the object trajectory representation. This encoding is jointly used for the tracking and prediction tasks. Source. |
Authors: Liang, M., Yang, B., Zeng, W., Chen, Y., Hu, R., Casas, S., & Urtasun, R.
-
Motivations:
1-
Performperception
andprediction
jointly, with a single neural network.- There for it is called "Perception and Prediction":
PnP
. - The whole model is also said
end-to-end
, because it isend-to-end
trainable.- This constrast with modular sequential architectures where both the perception output and map information is forwarded to an independent
prediction
module, for instance in a bird’s eye view (BEV
) raster representation.
- This constrast with modular sequential architectures where both the perception output and map information is forwarded to an independent
- There for it is called "Perception and Prediction":
2-
Improveprediction
by leveraging the (past) temporal information (motion history) contained intracking
results.- In particular, one goal is to recover from long-term object occlusion.
-
"While all these [vanilla
PnP
] approaches share the sensor features fordetection
andprediction
, they fail to exploit the rich information of actors along the time dimension [...]. This may cause problems when dealing with occluded actors and may produce temporal inconsistency inpredictions
."
-
- The idea is to include
tracking
in the loop to improveprediction
(motion forecasting):-
"While the
detection
module processes sequential sensor data and generates object detections at each time step independently, thetracking
module associates these estimates across time for better understanding of object states (e.g., occlusion reasoning, trajectory smoothing), which in turn provides richer information for theprediction
module to produce accurate future trajectories." -
"Exploiting motion from explicit object trajectories is more accurate than inferring motion from the features computed from the raw sensor data. [this reduces the prediction error by (
∼6%
) in the experiment]"
-
- All modules share computation as there is a single backbone network, and the full model can be trained
end-to-end
.-
"While previous joint
perception
andprediction
models make the prediction module another convolutional header on top of thedetection
backbone network, which shares the same features with thedetection
header, inPnPNet
we put theprediction
module after explicit objecttracking
, with the object trajectory representation as input."
-
- In particular, one goal is to recover from long-term object occlusion.
-
How to represent (long-term) trajectories?
- The idea is to capture both
sensor
observation andmotion
information of actors. -
"For each object we first extract its inferred
motion
(from past detection estimates) and raw observations (fromsensor
features) at each time step, and then model its dynamics using a recurrent network." -
[interesting choice] "For angular velocity of ego car we parameterize it as its
cosine
andsine
values." - This trajectory representation is utilized in both
tracking
andprediction
modules.
- The idea is to capture both
-
About multi-object
tracking
(MOT
):- There exist two distinct challenges:
1-
The discrete problem ofdata association
between previous tracks and current detections.- Association errors (i.e.,
identity switches
) are prone to accumulate through time. -
"The
association problem
is formulated as abipartite
matching problem so that exclusivetrack-to-detection
correspondence is guaranteed. [...] Solved with theHungarian algorithm
." -
"Many frameworks have been proposed to solve the
data association problem
: e.g., Markov Decision Processes (MDP
),min-cost flow
,linear assignment
problem andgraph cut
."
- Association errors (i.e.,
2-
The continuous problem oftrajectory estimation
.- In the proposed approach, the
LSTM
representation of associated new tracks are refined to generate smoother trajectories:-
"For
trajectory refinement
, since it reduces the localization error of online generated perception results, it helps establish a smoother and more accurate motion history."
-
- In the proposed approach, the
- The proposed multi-object tracker solves both problems, therefore it is said "
discrete
-continuous
".
- There exist two distinct challenges:
"VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation"
Click to expand
Both map features and our sensor input can be simplified into either a point , a polygon , or a curve , which can be approximately represented as polylines, and eventually further split into vector fragments. The set of such vectors form a simplified abstracted world used to make prediction with less computation than rasterized images encoded with ConvNets . Source. |
A vectorized representation of the scene is preferred to the combination (rasterized rendering + ConvNet encoding). A global interaction graph can be built from these vectorized elements, to model the higher-order relationships between entities. To further improve the prediction performance, a supervision auxiliary task is introduced. Source. |
Authors: Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C.
- Motivations:
1-
Reduce computation cost while offering good prediction performances.2-
Capture long range context information, for longer horizon prediction.ConvNets
are commonly used to encode the scene context, but they have limited receptive field.- And increasing
kernel size
andinput image size
is not so easy:- FLOPs of
ConvNets
increase quadratically with thekernel size
andinput image size
. - The number of parameters increases quadratically with the
kernel size
.
- FLOPs of
- Four ingredients:
1-
A vectorized representation is preferred to the combination (rasterized rendering +ConvNet
encoding).2-
A graph network, to model interactions.3-
Hierarchy, to first encodemap
andsensor
information, and then learn interactions.4-
A supervision auxiliary task, in parallel to the prediction task.
- Two kind of input:
1-
HD map
information: Structured road context information such aslane boundaries
,stop/yield signs
,crosswalks
andspeed bumps
.2-
Sensor
information: Agent trajectories.
- How to encode the scene context information?
1-
Rasterized representation.Rendering
: in a bird-eye image, with colour-coded attributes. Issue: colouring requires manual specifications.Encoding
: encode the scene context information withConvNets
. Issue: receptive field may be limited.-
"The most popular way to incorporate highly detailed maps into behavior prediction models is by rendering the map into pixels and encoding the scene information, such as
traffic signs
,lanes
, androad boundaries
, with a convolutional neural network (CNN
). However, this process requires a lot of compute and time. Additionally, processing maps as imagery makes it challenging to model long-range geometry, such as lanes merging ahead, which affects the quality of the predictions." - Impacting parameters:
- Convolutional kernel sizes.
- Resolution of the rasterized images.
- Feature cropping:
-
"A larger crop size (
3
vs1
) can significantly improve the performance, and cropping along observed trajectory also leads to better performance."
-
2-
Vectorized representation- All
map
andtrajectory
elements can be approximated as sequences of vectors. -
"This avoids lossy rendering and computationally intensive
ConvNet
encoding steps."
- All
- About graph neural networks (
GNN
s), from Rishabh Anand's medium article:1-
Given a graph, we first convert thenodes
to recurrent units and theedges
to feed-forward neural networks.2-
Then we performNeighbourhood Aggregation
(Message Passing
) for all nodesn
number of times.3-
Then we sum over the embedding vectors of all nodes to get graph representationH
. Here the "Global interaction graph".4-
Feel free to passH
into higher layers or use it to represent the graph’s unique properties! Here to learn interaction models to make prediction.
- About hierarchy:
1-
First, aggregate information among vectors inside a polyline, namelypolyline subgraphs
.- Graph neural networks (
GNN
s) are used to incorporate these sets of vectors -
"We treat each vector vi belonging to a polyline Pj as a node in the graph with node features."
- How to encode attributes of these geometric elements? E.g.
traffic light state
,speed limit
?- I must admit I did not fully understand. But from what I read on medium:
- Each
node
has a set of features defining it. - Each
edge
may connectnodes
together that have similar features.
- Each
- I must admit I did not fully understand. But from what I read on medium:
- Graph neural networks (
2-
Then, model the higher-order relationships among polylines, directly from their vectorized form.- Two interactions are jointly modelled:
1-
The interactions of multiple agents.2-
Their interactions with the entities from road maps.- E.g. a car enters an intersection, or a pedestrian approaches a crosswalk.
-
"We clearly observe that adding map information significantly improves the trajectory prediction performance."
- Two interactions are jointly modelled:
- An auxiliary task:
1-
Randomly masking outmap
features during training, such as astop sign
at a four-way intersection.2-
Require the net to complete it.-
"The goal is to incentivize the model to better capture interactions among nodes."
- And to learn to deal with occlusion.
-
"Adding this objective consistently helps with performance, especially at longer time horizons".
- Two training objectives:
1-
Main task = Prediction. Future trajectories.2-
Auxiliary task = Supervision.Huber loss
between predicted node features and ground-truth masked node features.
- Evaluation metrics:
- The "widely used" Average Displacement Error (
ADE
) computed over the entire trajectories. - The Displacement Error at
t
(DE@ts
) metric, wheret
in {1.0
,2.0
,3.0
} seconds.
- The "widely used" Average Displacement Error (
- Performances and computation cost.
VectorNet
is compared toConvNets
on theArgoverse
forecasting dataset, as well as on someWaymo
in-house prediction dataset.ConvNets
consumes200+
times moreFLOPs
thanVectorNet
for a single agent:10.56G
vs0.041G
. Factor5
when there are50
agents per scene.VectorNet
needs29%
of the parameters ofConvNets
:72K
vs246K
.VectorNet
achieves up to18%
better performance onArgoverse
.
"Online parameter estimation for human driver behavior prediction"
-
[
2020
] [📝] [] [🎓Stanford
] [🚗Toyota Research Institute
] -
[
stochastic IDM
]
Click to expand
The vanilla IDM is a parametric rule-based car-following model that balances two forces: the desire to achieve free speed if there were no vehicle in front, and the need to maintain safe separation with the vehicle in front. It outputs an acceleration that is guaranteed to be collision free. The stochastic version introduces a new model parameter σ-IDM . Source. |
Authors: Bhattacharyya, R., Senanayake, R., Brown, K., & Kochenderfer
- Motivations:
1-
Explicitly model stochasticity in the behaviour of individual drivers.- Complex multi-modal distributions over possible outcomes should be modelled.
2-
Provide safety guarantees3-
Highway scenarios: no urban intersection.- The methods should combine advantages of
rule-based
andlearning-based
estimation/prediction methods:- Interpretability.
- Guarantees on safety (the learning-based model Generative Adversarial Imitation Learning (
GAIL
) used as baseline is not collision free). - Validity even in regions of the state space that are under-represented in the data.
- High expressive power to capture nuanced driving behaviour.
- About the method:
-
"We apply online parameter estimation to an extension of the Intelligent Driver Model IDM that explicitly models stochasticity in the behavior of individual drivers."
- This rule-based method is online, as opposed for instance to the IDM with parameters obtained by offline estimation, using non-linear least squares.
- Particle filtering is used for the recursive Bayesian estimation.
- The derived parameter estimates are then used for forward motion prediction.
-
- About the estimated parameters (per observed vehicle):
1-
The desired velocity (v-des
).2-
The driver-dependent stochasticity on acceleration (σ-IDM
).- They are assumed stationary for each driver, i.e., human drivers do not change their latent driving behaviour over the time horizons.
- About the datasets:
NGSIM
for US Highway 101 at10 Hz
.- Highway Drone Dataset (
HighD
) at25 Hz
. RMSE
of the position and velocity are used to measure “closeness” of a predicted trajectory to the corresponding ground-truth trajectory.- Undesirable events, e.g. collision, going off-the-road, hard braking, that occur in each scene prediction are also considered.
- How to deal with the "particle deprivation problem"?:
- Particle deprivation = particles converge to one region of the state space and there is no exploration of other regions.
Dithering
method = external noise is added to aid exploration of state space regions.- From (Schön, Gustafsson, & Karlsson, 2009) in "The Particle Filter in Practice":
-
"Both the
process
noise andmeasurement
noise distributions need some dithering (increased covariance). Dithering theprocess
noise is a well-known method to mitigate the sample impoverishment problem. Dithering themeasurement
noise is a good way to mitigate the effects of outliers and to robustify thePF
in general".
-
- Here:
-
"We implement dithering by adding random noise to the top
20%
particles ranked according to the corresponding likelihood. The noise is sampled from a discrete uniform distribution withv-des
∈
{−0.5
,0
,0.5
} andσ-IDM
∈
{−0.1
,0
,0.1
}. (This preserves the discretization present in the initial sampling of particles).
-
- Future works:
- Non-stationarity.
- Combination with a lane changing model such as
MOBIL
to extend to two-dimensional driving behaviour.
"PLOP: Probabilistic poLynomial Objects trajectory Planning for autonomous driving"
-
[
2020
] [📝] [[🎞️](TO COME)] [ 🚗Valeo
] -
[
Gaussian mixture
,multi-trajectory prediction
,nuScenes
,A2D2
,auxiliary loss
]
Click to expand
The architecture has two main sections: an encoder to synthesize information and the predictor where we exploit it. Note that PLOP does not use the classic RNN decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points. Also note the navigation command that conditions the ego prediction. Source. |
PLOP uses multimodal sensor data input: Lidar and camera . The map is accumulated over the past 2s , so 20 frames. It produces a multivariate gaussian mixture for a fixed number of K possible trajectories over a 4s horizon. Uncertainty and variability are handled by predicting vehicle trajectories as a probabilistic Gaussian Mixture models, constrained by a polynomial formulation. Source. |
Authors: Buhet, T., Wirbel, E., & Perrotton, X.
-
Motivations:
- The goal is to predicte multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework.
- In addition in an
end-to-end
trainable fashion.
- In addition in an
- It builds on a previous work: "Conditional vehicle trajectories prediction in carla urban environment" - (Buhet, Wirbel, & Perrotton, 2019). See analysis further below.
- The trajectory prediction based on polynomial representation is upgraded from deterministic output to multimodal probabilistic output.
- It re-uses the navigation command input for the conditional part of the network, e.g.
follow
,left
,straight
,right
. - One main difference is the introduction of a new input sensor: Lidar.
- And adding a semantic segmentation auxiliary loss.
- The authors also reflect about what metrics is relevant for trajectory prediction:
-
"We suggest to use two additional criteria to evaluate the predictions errors, one based on the most confident prediction, and one weighted by the confidence [how alternative trajectories with non maximum weights compare to the most confident trajectory]."
-
- The goal is to predicte multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework.
-
One term: "Probabilistic poLynomial Objects trajectory Planning" =
PLOP
. -
I especially like their review on related works about data-driven predictions (section taken from the paper):
SocialLSTM
: encodes the relations between close agents introducing a social pooling layer.- -Deterministic approaches derived from
SocialLSTM
:SEQ2SEQ
presents a newLSTM
-based encoder-decoder network to predict trajectories into an occupancy grid map.SocialGAN
andSoPhie
use generative adversarial networks to tackle uncertainty in future paths and augment the original set of samples.CS-LSTM
extendsSocialLSTM
using convolutional layers to encode the relations between the different agents.
ChauffeurNet
uses a sophisticated neural network with a complex high level scene representation (roadmap
,traffic lights
,speed limit
,route
,dynamic bounding boxes
, etc.) for deterministic ego vehicle trajectory prediction.
- Other works use a graph representation of the interactions between the agents in combination with neural networks for trajectory planning.
- Probabilistic approaches:
- Many works like
PRECOG
,R2P2
,Multiple Futures Prediction
,SocialGAN
include probabilistic estimation by adding a probabilistic framework at the end of their architecture producing multiple trajectories for ego vehicle, nearby vehicles or both. - In
PRECOG
,Rhinehart et al.
build a probabilistic model that explicitly models interactions between agents, using latent variables to model the plausible reactions of agents to each other, with a possibility to pre-condition the trajectory of the ego vehicle by a goal. MultiPath
also reuses an idea from object detection algorithms using trajectory anchors extracted from the training data for ego vehicle prediction.
- Many works like
-
About the auxiliary semantic segmentation task.
- Teaching the network to represent such semantic in its features improves the prediction.
-
"Our objective here is to make sure that in the
RGB
image encoding, there is information about the road position and availability, the applicability of the traffic rules (traffic sign/signal), the vulnerable road users (pedestrians, cyclists, etc.) position, etc. This information is useful for trajectory planning and brings some explainability to our model."
-
About interactions with other vehicles.
- The prediction for each vehicle does not have direct access to the sequence of history positions of others.
-
"The encoding of the interaction between vehicles is implicitly computed by the birdview encoding."
- The number of predicted trajectories is fixed in the network architecture.
K=12
is chosen.-
"It allows our architecture to be agnostic to the number of considered neighbors."
-
-
Multi-trajectory prediction in a probabilistic framework.
-
"We want to predict a fixed number
K
of possible trajectories for each vehicle, and associate them to a probability distribution overx
andy
:x
is the longitudinal axis,y
the lateral axis, pointing left." - About the Gaussian Mixture.
- Vehicle trajectories are predicted as probabilistic Gaussian Mixture models, constrained by a polynomial formulation: The mean of the distribution is expressed using a polynomial of degree
4
of time. -
"In the end, this representation can be interpreted as predicting
K
trajectories, each associated with a confidenceπk
[mixture weights shared for all sampled points belonging to the same trajectory], with sampled points following a Gaussian distribution centered on (µk,x,t
,µk,y,t
) and with standard deviation (σk,x,t
,σk,y,t
)." -
"
PLOP
does not use the classicRNN
decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points." - This offers a measure of uncertainty on the predictions.
- For the ego car, the probability distribution is conditioned by the navigation command.
- Vehicle trajectories are predicted as probabilistic Gaussian Mixture models, constrained by a polynomial formulation: The mean of the distribution is expressed using a polynomial of degree
- About the loss:
negative log-likelihood
over all sampled points of the ground truth ego and neighbour vehicles trajectories.- There is also the auxiliary
cross entropy loss
for segmentation.
-
-
Some findings:
- The presented model seems very robust to the varying number of neighbours.
- Finally, for
5
agents or more,PLOT
outperforms by a large margin allESP
andPRECOG
, on authors-defined metrics. -
"This result might be explained by our interaction encoding which is robust to the variations of
N
using only multiple birdview projections and our non-iterative single step trajectory generation."
- Finally, for
-
"Using
K = 1
approach yields very poor results, also visible in the training loss. It was an anticipated outcome due to the ambiguity of human behavior."
- The presented model seems very robust to the varying number of neighbours.
"Probabilistic Future Prediction for Video Scene Understanding"
-
[
2020
] [📝] [🎞️] [🎞️ (blog)] [ 🎓University of Cambridge
] [ 🚗Wayve
] -
[
multi frame
,multi future
,auxiliary learning
,multi-task
,conditional imitation learning
]
Click to expand
One main motivation is to supply the Control module (e.g. policy learnt via IL ) with a representation capable of modelling probability of future events. The Dynamics module produces such spatio-temporal representation, not directly from images but from learnt scene features. That embeddings, that are used by the Control in order to learn driving policy, can be explicitly decoded to future semantic segmentation , depth , and optical flow . Note that the stochasticity of the future is modelled with a conditional variational approach minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution . Source. |
There are many possible futures approaching this four-way intersection. Using 3 different noise vectors makes the model imagine different driving manoeuvres at an intersection: driving straight , turning left or turning right . These samples predict 10 frames, or 2 seconds into the future. Source. |
The differential entropy of the present distribution, characterizing how unsure the model is about the future is used. As we approach the intersection, it increases. Source. |
Authors: Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A.
-
Motivations:
1-
Supply the control module (e.g.IL
) with an appropriate representation forinteraction-aware
anduncertainty-aware
decision-making, i.e. one capable of modelling probability of future events.- Therefore the policy should receive temporal features explicitly trained to predict the future.
- Motivation for that: It is difficult to learn an effective temporal representation by only using imitation error as a learning signal.
- Therefore the policy should receive temporal features explicitly trained to predict the future.
- Others:
2-
"multi frame
andmulti future
" prediction.- Perform prediction:
- ... based on multiple past frames (i.e. not a single one).
- ... and producing multiple possible outcomes (i.e. not deterministic).
- Predict the stochasticity of the future, i.e. contemplate multiple possible outcomes and estimate the multi-modal uncertainty.
- Perform prediction:
3-
Offer a differentiable / end-to-end trainable system, as opposed to system that reason over hand-coded representations.- I understand it as considering the loss of the
IL
part into the layers that create the latent representation.
- I understand it as considering the loss of the
4-
Cope with multi-agent interaction situations such as traffic merging, i.e. do not predict the behaviour of each actor in the scene independently.- For instance by jointly predicting ego-motion and motion of other dynamic agents.
5-
Do not rely on anyHD-map
to predict the static scene, to stay resilient toHD-map
errors due to e.g. roadworks.
-
auxiliary learning
: The loss used to train the latent representation is composed of three terms (c.f. motivation3-
):future-prediction
: weighted sum of futuresegmentation
,depth
andoptical flow
losses.probabilistic
:KL
-divergence between thepresent
and thefuture
distributions.control
: regression for future time-steps up to someFuture control horizon
.
-
Let's explore some ideas behinds these three components.
-
1-
Temporal video encoding: How to build a temporal and visual representation?-
What should be predicted?
-
"Previous work on probabilistic future prediction focused on trajectory forecasting [DESIRE, Lee et al. 2017, Bhattacharyya et al. 2018, PRECOG, Rhinehart et al. 2019] or were restricted to single-frame image generation and low resolution (64x64) datasets that are either simulated (Moving MNIST) or with static scenes and limited dynamics."
-
"Directly predicting in the high-dimensional space of image pixels is unnecessary, as some details about the appearance of the world are irrelevant for planning and control."
- Instead, the task is to predict a more complete scene representation with
segmentation
,depth
, andflow
, two seconds in the future.
-
-
What should the
temporal module
process?- The temporal model should learn the spatio-temporal features from perception encodings [as opposed to RGB images].
- These encodings are "scene features" extracted from images by a
Perception
module. They constitute a more powerful and compact representation compared to RGB images.
-
How does the
temporal module
look like?-
"We propose a new spatio-temporal architecture that can learn hierarchically more complex features with a novel 3D convolutional structure incorporating both local and global space and time context."
-
-
The authors introduce a so-called
Temporal Block
module for temporal video encoding.- These
Temporal Block
should help to learn hierarchically more complex temporal features. With two main ideas: 1-
Decompose the convolutional filters and play with all possible configuration.-
"Learning
3D
filters is hard. Decomposing into two subtasks helps the network learn more efficient." -
"State-of-the-art works decompose
3D
filters into spatial and temporal convolutions. The model we propose further breaks down convolutions into many space-time combinations and context aggregation modules, stacking them together in a more complex hierarchical representation."
-
2-
Incorporate the "global context" in the features (I did not fully understand that).- They concatenate some local features based on
1x1x1
compression with some global features extracted withaverage pooling
. -
"By pooling the features spatially and temporally at different scales, each individual feature map also has information about the global scene context, which helps in ambiguous situations."
- They concatenate some local features based on
- These
-
-
2-
Probabilistic prediction: how to generate multiple futures?-
"There are various reasons why modelling the future is incredibly difficult: natural-scene data is rich in details, most of which are irrelevant for the driving task, dynamic agents have complex temporal dynamics, often controlled by unobservable variables, and the future is inherently uncertain, as multiple futures might arise from a unique and deterministic past."
- The idea is that the uncertainty of the future can be estimated by making the prediction probabilistic.
-
"From a unique past in the real-world, many futures are possible, but in reality we only observe one future. Consequently, modelling multi-modal futures from deterministic video training data is extremely challenging."
- Another challenge when trying to learn a multi-modal prediction model:
-
"If the network predicts a plausible future, but one that did not match the given training sequence, it will be heavily penalised."
-
-
"Our work addresses this by encoding the future state into a low-dimensional
future distribution
. We then allow the model to have a privileged view of the future through the future distribution at training time. As we cannot use the future at test time, we train apresent distribution
(using only the current state) to match thefuture distribution
through aKL
-divergence loss. We can then sample from the present distribution during inference, when we do not have access to the future."
-
- To put it another way, two probability distributions are modelled, in a conditional variational approach:
- A present distribution
P
, that represents all what could happen given the past context. - A future distribution
F
, that represents what actually happened in that particular observation.
- A present distribution
-
[Learning to align the
present distribution
with thefuture distribution
] "As the future is multimodal, different futures might arise from a unique past contextzt
. Each of these futures will be captured by the future distributionF
that will pull the present distributionP
towards it." - How to evaluate predictions?
-
"Our probabilistic model should be accurate, that is to say at least one of the generated future should match the ground truth future. It should also be diverse".
- The authors use a diversity distance metric (
DDM
), which measures both accuracy and diversity of the distribution.
-
- How to quantify uncertainty?
- The framework can automatically infer which scenes are unusual or unexpected and where the model is uncertain of the future, by computing the differential entropy of the
present distribution
. - This is useful for understanding edge-cases and when the model needs to "pay more attention".
- The framework can automatically infer which scenes are unusual or unexpected and where the model is uncertain of the future, by computing the differential entropy of the
-
-
3-
The rich spatio-temporal features explicitly trained to predict the future are used to learn a driving policy.-
Conditional Imitation Learning is used to learn
speed
andsterring
controls, i.e. regressing to the expert's true control actions {v
,θ
}.- One reason is that it is immediately transferable to the real world.
-
From the ablation study, it seems to highly benefit from both:
1-
Thetemporal features
."It is too difficult to forecast how the future is going to evolve with a single image".
2-
The fact that these features are capable ofprobabilistic predictions
.- Especially for multi-agent interaction scenarios.
-
About the training set:
-
"We address the inherent dataset bias by sampling data uniformly across lateral and longitudinal dimensions. First, the data is split into a histogram of bins by
steering
, and subsequently byspeed
. We found that weighting each data point proportionally to the width of the bin it belongs to avoids the need for alternative approaches such as data augmentation."
-
-
-
One exciting future direction:
- For the moment, the
control
module takes the representation learned from dynamics models. And ignores the predictions themselves.- By the way, why are predictions, especially for the ego trajectories, not conditionned on possible actions?
- It could use these probabilistic embedding capable of predicting multi-modal and plausible futures to generate imagined experience to train a policy in a model-based
RL
. - The design of the
reward
function from the latent space looks challenging at first sight.
- For the moment, the
"Efficient Behavior-aware Control of Automated Vehicles at Crosswalks using Minimal Information Pedestrian Prediction Model"
-
[
2020
] [📝] [ 🎓University of Michigan
,University of Massachusetts
] -
[
interaction-aware decision-making
,probabilistic hybrid automaton
]
Click to expand
The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton. Source. |
The interaction is captured inside a gap-acceptance model: the pedestrian evaluates the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk. Source. |
The baseline controller used for comparison is a finite state machine (FSM ) with four states. Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by yielding or through hard stop . Source. |
Authors: Jayaraman, S. K., Jr, L. P. R., Yang, X. J., Pradhan, A. K., & Tilbury, D. M.
-
Motivations:
- Scenario: interaction with a pedestrian
approaching
/crossing
/waiting
at a crosswalk. 1-
A (1.1
) simple and (1.2
) interaction-aware pedestrianprediction
model.- That means no requirement of extensive amounts of data.
-
"The crossing model as a hybrid system with a gap acceptance model that required minimal information, namely pedestrian's
position
andvelocity
".- It does not require information about pedestrian
actions
orpose
. - It builds on "Analysis and prediction of pedestrian crosswalk behavior during automated vehicle interactions" by (Jayaraman, Tilbury, Yang, Pradhan, & Jr, 2020).
- It does not require information about pedestrian
2-
Effectively incorporating thesepredictions
in acontrol
framework- The idea is to first forecast the position of the pedestrian using a pedestrian model, and then react accordingly.
3-
Be efficient on bothwaiting
andapproaching
pedestrian scenarios.- Assuming always a
crossing
may lead to over-conservative policies. -
"[in simulation] only a fraction of pedestrians (
80%
) are randomly assigned the intention to cross the street."
- Assuming always a
- Scenario: interaction with a pedestrian
-
Why are
CV
andCA
prediction models not applicable?-
"At crosswalks, pedestrian behavior is much more unpredictable as they have to wait for an opportunity and decide when to cross."
- Longer durations are needed.
1-
Interaction must be taken into account.2-
The authors decide to model pedestrians as a hybrid automaton that switches between discrete actions.
-
-
One term: Behavior-aware Model Predictive Controller (
B-MPC
)1-
The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton:- Four states:
Approach Crosswalk
,Wait
,Cross
,Walk away
. - Probabilistic transitions: using pedestrian's
gap acceptance
- hence capturing interactions.-
"What is the probability of accepting the current traffic gap?
-
"Pedestrians evaluate the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk."
-
- Four states:
2-
The problem is formulated as a constrained quadratic optimization problem:- Cost:
success
(passing the crosswalk),comfort
(penalize jerk and sudden changes in acceleration),efficiency
(deviation from the reference speed). - Constraints: respect
motion model
, restrictvelocity
,acceleration
, as well asjerk
, and ensurecollision avoidance
. - Solver: standard quadratic program solver in
MATLAB
.
- Cost:
-
Performances:
- Baseline controller:
- Finite state machine (
FSM
) with four states:Maintain Speed
,Accelerate
,Yield
, andHard Stop
. -
"Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by
yielding
or throughhard stop
." -
"The Boolean variable
InCW
, denotes the pedestrian’s crossing activity:InCW=1
from the time the pedestrian started moving laterally to cross until they completely crossed theAV
lane, andInCW=0
otherwise." - That means the baseline controller does not react at all to "non-crossing" cases since it never sees the pedestrian crossing laterally.
- Finite state machine (
-
"It can be seen that the
B-MPC
is more aggressive, efficient, and comfortable than the baseline as observed through the higher average velocity, lower average acceleration effort, and lower average jerk respectively."
- Baseline controller:
"Human Motion Trajectory Prediction: A Survey"
-
[
2019
] [📝] [ 🎓Örebro University
,CMU
,TU Delft
] [ 🚗Bosch
] -
[
survey
,physics-based
,pattern-based
,planning-based
,manoeuvre recognition
,goal-informed
,context-aware
,contextual cues
,social force
]
Click to expand
The focus is on human motion prediction: not on cars. Approaches differ in the way they represent, parametrize, (learn) and solve the task. Modelling approaches can be divided into three families: physics -based, pattern -based, planning -based. Source. |
What information to consider? All modelling approaches can be extended with so-called contextual cues of the target agent, static and dynamic environment. Source. |
Some motion trajectories datasets. Mainly for pedestrians. Source. |
Authors: Rudenko, A., Palmieri, L., Herman, M., Kitani, K. M., Gavrila, D. M., & Arras, K. O.
-
About
pedestrian
(notvehicle
) motion prediction inAD
:-
"Most works consider the scenario of the laterally crossing pedestrian, dealing with the question what the latter will do at the curbside:
start walking
,continue walking
, orstop walking
." -
"It is difficult to compare the experimental results, as the datasets are varying (different
timings
of same scenario, differentsensors
, differentmetrics
)."
-
-
Prediction methods can be classified in three large categories:
-
1-
physics
-based: reactivesense
-predict
. -
2-
pattern
-based:sense
-learn
-predict
. -
3-
planning
-based:sense
-reason
-predict
.- Agents reason about
intentions
and possible ways to thegoal
.
- Agents reason about
-
These categories do not form a partition. Example of combinations:
planning
-based usingphysics
-basedtransition
functions.physics
-based methods tuned with learned parameters.planning
-based approaches usingIRL
(learning
methods) to recover the hiddenreward
function.- In general,
combinations
hold a great potential, e.g. use disagreements to detect uncertainty.
-
-
1-
Physics
-based methods.-
Suitable in those situations where:
- The effect of other agents or the static environment can be modelled as
forces
acting upon an agent. - The agent's motion
dynamics
can be modelled by an explicittransition
function.
- The effect of other agents or the static environment can be modelled as
-
Limitations:
- They might not capture well the complexity of the real world.
- Mainly valid for short prediction horizons.
- Potential burden of parameter tuning.
-
Advantages:
- Applicability across multiple domains under mild conditions.
"
physics
-based approaches can be readily applied across multiple environments, without the need for training datasets (some data for parameter estimation is useful, though)." - Interpretability.
- Usually simple and fast approximate inference.
- Applicability across multiple domains under mild conditions.
-
Idea:
-
"Motion is predicted by forward simulating a set of explicitly defined
dynamics
equations that follow aphysics
inspired model." - In
1
-step ahead predictions with first-orderMarkov
assumption:- A
transition
function (ordynamics
model)f
is defined. With somedt
. next_state
=f
(state
,action
) +noise
.- The task is to estimate (
sense
)state
andaction
. Inference is made from various estimated or observedcues
. - Extension to consider not just the current
state
(first-orderMarkov
), but an history ofstates
:-
"The early work by
Zhu (1991)
uses an autoregressive moving average model astransition
function of a Hidden Markov Model (HMM
) to predict occupancy probabilities of moving obstacles over multiple time steps with applications to predictive planning."
-
- A
- What to do with this
f
?- Use it alone:
CA
,CV
. - Recursive Bayesian filters, e.g.
Kalman
filters. - Multiple-model algorithms.
- Use it alone:
-
-
Sub-classes: are multiple
dynamics
motion models considered, or only one? -
1-1.
Single model.-
kinematic
vsdynamic
models.- Are forces that govern the motion of other agents considered?
- Rarely. They are not directly observable from sensory data. And
dynamics
models can be very complex.
-
Examples:
CV
,CA
, Curvilinear motion and Coordinated turn model (CT
) (assumes constantturn rate
andspeed
with white noise linear and white noise turnacceleration
).
-
Miscellaneous:
-
"The
bicycle model
is an often used as an approximation to model the vehicle dynamics [rather 'kinematics
' I would have said]." - Parameters (e.g. for
IDM
) can be learnt / estimated on-line.
-
-
How to improve them and achieve longer horizons? By considering the
static
anddynamic
environment.-
1-
Considermap
information.- Simply said: Roads are structured. Cars try to stay on the road. Considering the free / drivable space as constraints is therefore beneficial, e.g. to focus on high-probability regions.
Examples:
- Formalize relevant traffic rules, e.g. pedestrian crossing permission on the green light, as additional motion constraints.
-
"When there are several possible turns at a
node
, i.e. at bifurcations, predictions are propagated along all outgoingedges
." - Reachability-based models.
- Simply said: Roads are structured. Cars try to stay on the road. Considering the free / drivable space as constraints is therefore beneficial, e.g. to focus on high-probability regions.
Examples:
-
2-
Consider local interactions between multiple agents.- Social force models. E.g. potential field model.
- Intelligent Driver Model (
IDM
) andMOBIL
. - Group motion models: detecting and considering groups of people who walk together.
-
-
-
1-2.
Multi-model: the question is "how to combine individual models?"-
"Complex agent motion is poorly described by a single dynamical model
f
. Although the incorporation ofmap
information and influences from multiple agents render such approaches more flexible, they remain inherently limited." -
Idea:
- Define (
1
) and fuse (2
) different prototypicalmotion modes
, each described by a different dynamic regimef
. - Example of
modes
:CA
,CV
,turning
,crossing
orstopping
(for pedestrians).
- Define (
-
Hybrid estimation and Multi-Model (
MM
) methods.-
Why "hybrid"?
-
"
MM
methods maintain a hybrid systemstate
ξ
= (x
,s
) that augments the continuous valuedx
by a discrete-valuedmodal
states
."
-
-
Motion
modes
are not observable. Uncertainty must be considered. Believes must be maintained. -
Ingredients:
1-
Amodel
set. Fixed or on-line adaptive.2-
A strategy to deal with the discrete-valued uncertainties, for example, model sequences under a Markov or semi-Markov assumption.3-
A recursive estimation scheme to deal with the continuous valued components conditioned on the model.4-
A mechanism to generate the overall best estimate from a fusion or selection of the individual filter.
-
Interactive multiple model filters (
IMM
).- "Interactive" because
context
-based interactions are considered. -
"The
IMM
-filter is a computationally efficient and in many cases well performing suboptimal estimation algorithm for Markovian switching systems. Basically it consists of3
major steps:"interaction
(mixing): Compute an estimate of thestate
with respect to allN
transition
models and from these estimates computesN
mixed inputs.filtering
: E.g. standardKF
for each model.combination
: A weighted combination of updatedstate
estimates produced by all the filters.
- E.g. switching between
CV
, constant position (CP
) and coordinated turn (CT
). - Model
transitions
can be controlled by an intention recognition system. IMM
can also be used as a belief updater inPOMDP
s.- It takes as input a
belief state
and anobservation
and returns the updatedbelief state
. - E.g. to estimate whether the car is following a
CV
or aCA
model.
- It takes as input a
- "Interactive" because
-
-
Other hybrid approaches: stochastic
Reachability
analysis.-
"Model agents as hybrid systems (with multiple
modes
) and infer agents' future motions by computing stochastic reachable sets."
-
-
Non-hybrid: Dynamic Bayesian networks (
DBN
).- Why is it said "non-hybrid"? Not clear to me
- Because it does not focus on the joint (
manoeuvre
,physical state
) estimation? - As I understand, it is used to infer unobserved variables such as
intentions
,manoeuvre
, or awareness of an oncoming vehicle (head orientation).
- Because it does not focus on the joint (
-
[Example of
DBN
-based multi-model] "Coupling a set of dynamic systems (i.e. a bank of Kalman filters (KF
)) with anHMM
, which is a special case of theDBNs
." -
"Infer human future behaviors, a set of
macro-actions
described by a set ofKFs
, based on measured dynamic quantities (i.e.acceleration
,torque
)."
- Why is it said "non-hybrid"? Not clear to me
-
-
-
2-
Pattern
-based methods.-
Idea:
- Learn motion
patterns
from data of observed agent trajectories. -
"Many
pattern
-based methods treat agents as particles, placed in the field of learnedtransitions
, dictating the direction of future motion."
- Learn motion
-
Supervised
orunsupervised
? Not clear to me. Maybe both are possible.- It could be
supervised
: instead of manually define (physics
-based)x
=x
+v
*dt
, one could regress it.- Function approximators are fit to data:
NN
,HMM
,GP
s.
- Function approximators are fit to data:
- It could be
unsupervised
: manoeuvre recognition using clustering methods on canonical scenarios.
- It could be
-
Advantages:
- Ability to discover statistical behavioural patterns:
dynamics
functions are not hand-crafted, but rather approximated from training data.
- Ability to discover statistical behavioural patterns:
-
Limitations:
- Require training data.
- Generalization capability of the learned model: whether it can be transferred to a different site, especially if the
map
topology changes. -
"
Pattern
-based approaches tend to be used in non-safety
critical applications, where explainability is less of an issue and where the environment is spatially constrained."
-
Two categories. Depending if the function approximator makes temporal factorization of the
dynamics
: -
2-1.
Sequential methods.-
Idea: Learn conditional models over time and recursively apply learned
transition
functions for inference. -
Comparison with many
physics
-based approaches:- Similarity: they learn a one-step predictor
s
.t+1
=f
(s
.t−n:t
,a.t
).transition
functions of sequential models have Markovian property.-
"In order to predict a sequence of
state
transitions (i.e. a trajectory), consecutive one-step predictions are made to compose a single long-term trajectory."
- Difference: the function, often non-parametric, is learned from statistical
observations
, and its parameters cannot be directly interpreted.
- Similarity: they learn a one-step predictor
-
Sub-categories, depending on the
time
-input andspace
range:2-1.1
Local spatialtransition
patterns.- Learning local motion patterns, such as probabilities of
transitions
between cells on a grid-map.
- Learning local motion patterns, such as probabilities of
2-1.2
Location-independent behavioural patterns.-
"Unlike the local transition patterns, which are learned and applied for prediction only in a particular environment, location-independent patterns are used for predicting transitions of an agent in the general free space."
-
2-1.3
Higher-orderMarkov
models.- Neural networks for time series prediction.
- Assuming higher-order Markov property.
-
"Such time series-based models are making a natural transition between the first-order
Markovian
methods (e.g. local transition patterns) and non-sequential techniques (e.g.clustering
-based)."
LSTM
s are very popular.Social
-LSTM
, with itssocial pooling
layer, learns to predict joint location-independent transitions in continuous spaces.-
"Since humans are influenced by nearby people,
LSTMs
are connected in the social pooling system, sharing information from the hidden state of theLSTMs
with the neighbouring pedestrians."
- Neural networks for time series prediction.
-
-
2-2.
Not-sequential methods.- Idea:
- Directly learn a set of full motion patterns from data or a distribution over trajectories.
- No temporal factorization of the
dynamics
, as opposed to sequential models (i.e. Markov assumption). -
"Specifying causal constraints, e.g.
Markovian
assumptions for the sequential models or particular functional form for thephysics
-based methods, might be too restrictive or require a very large learning dataset."
- Mainly unsupervised with clustering-based methods.
- E.g. with
GPs
or mixture models as cluster centroids representation. - Global structure can be imposed on top of a sequential model:
-
[Example] "Cluster recorded trajectories of humans into global motion patterns using the expectation maximization (
EM
) algorithm and build anHMM
model for each cluster. For prediction, the method compares the observed track with the learned motion patterns, and reasons about which patterns best explain it. Uncertainty is handled by probabilistic mixing of the most likely patterns."
-
- E.g. with
- Idea:
-
-
3-
Planning
-based methods-
Idea:
-
Goal
-informed predictions: reason on motionintent
of rational agents. -
1-
Explicitly reason about the agent’s long-term motiongoals
. -
2-
Computepolicies
or path hypotheses that enable the agent to reach thesegoals
. -
They are intrinsically
map
- andobstacle
-aware. -
Requirements:
- They work well when
goals
are explicitly defined.-
"Most
planning
-based methods rely on a given set ofgoals
, which makes them unusable or imprecise in a situation where nogoals
are known beforehand, or the number of possiblegoals
is too high."
-
- They often need a
map
of the environment.- Potential
goals
(e.g. exits on a roundabout) in the environment are often used as prior knowledge.
- Potential
- They work well when
-
"They tend to generate better long-term predictions than the
physics
-based techniques and generalize to new environments better than thepattern
-based approaches."
-
-
They solve sequential decision making problem:
-
"Much of the work use objective functions that minimizes some notion of the total
cost
of a sequence ofactions
(motions), and not just thecost
of oneaction
in isolation."
-
-
Two categories: Is the
cost
function pre-defined? -
3-1.
"yes": Forward planning.-
Explicit assumption regarding the optimality criteria.
-
3-1.1
Motion and path planning methods for one single observed agent.-
Use optimal motion and path planning techniques with a hand-crafted
cost
-function. -
Examples of classical planning algorithms:
Dijkstra
, Fast Marching Method (FMM
), optimal sampling-based motion planners,value
iteration. -
Improvement: multi-modality.
- We do not know where the observed pedestrian wants to go.
- Idea: account for several goals in the environment.
- E.g. Bayesian framework that exploits the set of path hypotheses to estimate the intended destination.
- Hypotheses can be generated from a Probabilistic Roadmap (
PRM
).
- Hypotheses can be generated from a Probabilistic Roadmap (
-
-
3-1.2
Multi-agent forward planning.- Idea: consider
interactions
between agents in the scene. - E.g. Combine global planning (to goals) and local collision avoidance.
- E.g. cooperative planning. Consider a joint
state
-space that includes all agents.
- Idea: consider
-
-
3-2.
"no": Inverse planning methods.-
Idea:
- Estimate the
reward
function oraction
model (policy
) from observed trajectories.
- Estimate the
-
3-2.1
Single-agent inverse planning.- Inverse optimal control =
IRL
. -
[
MaxEnt IRL
] "Humans are assumed to be near-optimal decision makers with stochastic policies, learned from observations, which are used to predict motion as a probability distribution over trajectories."
- Inverse optimal control =
-
3-2.2
Directly extract apolicy
from the data.-
No
reward
function is explicitly learnt. -
GAIL
: A discriminator tries to distinguish between observations from experts and generated ones. -
"Differently from
GAIL
, thedeep generative technique
by Rhinehart et al. (2018) adopts a fully differentiable model, which is easy to train without the need of an expensive policy gradient search."
-
-
3-2.3
Multi-agent inverse planning.- In addition to being rational, here, agents are assumed to be (somehow) cooperative.
-
DESIRE
"The method reasons on multi-modal future trajectories accounting for agent interactions, scene semantics and expected reward function, learned using a sampling-basedIRL
scheme. [...] TheRNN
architecture allows incorporation of past trajectory into the inference process, which improves prediction accuracy compared to the standardIRL
-based techniques."
-
-
-
About input contextual cues = factors that influence pedestrian motion.
-
1-
Describing the target agent.- [History of recent]
position
andvelocity
. - Class in {
pedestrian
,cyclist
}. attention
andawareness
of the robot's presence.-
"Considering the
head orientation
or full articulated pose of the person may bring valuable insights on the target agent’s immediateintentions
or theirawareness
of the environment." - How to deal represent uncertainty?
position
s may come with ellipses denoting uncertainty. How to integrate it?
- [History of recent]
-
2-
Describing the surroundingdynamic
agents.-
Interaction-
unaware
: simplephysics
-based method, e.g.CV
,CA
models. -
Individual
-aware:physics
-based methods with social forces or similar local interaction models.- Many deep learning methods explicitly model interacting entities. E.g.
social pooling
layers. - Learning
joint
motion patterns is a considerably harder task.-
"Most existing socially-aware methods still assume that all observed people are behaving similarly and that their motion can be predicted by the same model and with the same features."
-
"Most available approaches assume
cooperative
behavior, while real humans might rather optimize personalgoals
instead of joint strategies. In such cases, game-theoretic approaches are possibly better suited for modeling human behavior."
-
-
Group
-aware:- Capable of implicitly learning intra- and inter-group coherence behaviours.
-
-
3-
Describing the static environment.-
"Humans adapt their behaviors according not only to the movements of the other agents but also to the environment’s shape and structure, making extensive use of its topology to reason on the possible paths to reach the long-term goal."
-
"Due to the high level of structure in the environment, methods in autonomous driving scenarios extensively use available
semantic
information, such as street layout and traffic rules or current state of the traffic lights." - (static)
obstacle
-aware andmap
-aware methods.- Mainly with occupancy grid maps. Also
semantic
maps.
- Mainly with occupancy grid maps. Also
-
-
"Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions"
-
[
2019
] [📝] [ 🎓Caltech
] [🚗zoox
] -
[
multimodal
,probabilistic
,1-shot
]
Click to expand
The idea is to encode a history of world states (both static and dynamic ) and semantic map information in a unified, top-down spatial grid . This allows to use a deep convolutional architecture to model entity dynamics, entity interactions, and scene context jointly. The authors found that temporal convolutions achieved better performance and significantly faster training than an RNN structure. Source. |
Three 1-shot models are proposed. Top: parametric and continuous. Bottom: non-parametric and discrete (trajectories are here sampled for display from the state probabilities). Source. |
Authors: Hong, J., Sapp, B., & Philbin, J.
-
Motivations: A prediction method of future distributions of entity
state
that is:1-
Probabilistic.-
"A single most-likely point estimate isn't sufficient for a safety-critical system."
-
"Our perception module also gives us
state
estimation uncertainty in the form of covariance matrices, and we include this information in our representation via covariance norms."
-
2-
Multimodal.- It is important to cover a diversity of possible implicit actions an entity might take (e.g., which way through a junction).
3-
One-shot.- It should directly predict distributions of future states, rather than a single point estimate at each future timestep.
-
"For efficiency reasons, it is desirable to predict full trajectories (time sequences of
state
distributions) without iteratively applying a recurrence step." -
"The problem can be naturally formulated as a sequence-to-sequence generation problem. [...] We chose
ℓ=2.5s
of past history, and predict up tom=5s
in the future." -
"
DESIRE
andR2P2
address multimodality, but both do so via1
-step stochastic policies, in contrast to ours which directly predicts a time sequence of multimodal distributions. Such policy-based methods require bothfuture roll-out
andsampling
to obtain a set of possible trajectories, which has computational trade-offs to ourone-shot
feed-forward approach."
-
How to model entity interactions?
1-
Implicitly: By encoding them as surrounding dynamic context.2-
Explicitly: For instance,SocialLSTM
pools hidden temporalstate
between entity models.- Here, all surrounding entities are encoded within a specific tensor.
-
Input:
- A stack of
2d
-top-view-grids. Each frame has128×128
pixels, corresponding to50m×50m
.- For instance, the dynamic context is encoded in a
RGB
image with unique colours corresponding to each element type. - The
state
history of the considered entity is encoded in a stack of binary maps.- One could have use only
1
channel and play with the colour to represent the history.
- One could have use only
- For instance, the dynamic context is encoded in a
-
"Note that through rendering, we lose the true graph structure of the road network, leaving it as a modeling challenge to learn valid road rules like legal traffic direction, and valid paths through a junction."
- Cannot it just be coded in another tensor?
- A stack of
-
Output. Three approaches are proposed:
1-
Continuous + parametric representations.1.1.
A Gaussian distribution is regressed per future timestep.1.2.
Multi-modal Gaussian Regression (GMM-CVAE
).- A set of Gaussians is predicted by sampling from a categorial latent variable.
- If non enhanced, this method is naive and suffers from exchangeability, and mode collapse.
-
"In general, our mixture of sampled Gaussian trajectories underperformed our other proposed methods; we observed that some samples were implausible."
- One could have added an auxiliary loss that penalize off-road predictions, as in the improved version of
CoverNet
- One could have added an auxiliary loss that penalize off-road predictions, as in the improved version of
2-
Discrete + non-parametric representations.- Predict occupancy grid maps.
- A grid is produced for each future modelled timestep.
- Each grid location holds the probability of the corresponding output
state
. - For comparison, trajectories are extracted via some trajectory sampling procedure.
- Predict occupancy grid maps.
-
Against non-learnt baselines:
-
"Interestingly, both
Linear
andIndustry
baselines performed worse relative to our methods at larger time offsets, but better at smaller offsets. This can be attributed to the fact that predicting near futures can be accurately achieved with classical physics (which both baselines leverage) — more distant future predictions, however, require more challenging semantic understanding."
-
"Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios"
-
[
2019
] [📝] [ 🎓TUM
] [🚗BMW
] -
[
probabilistic predictions
,multi-modality
]
Click to expand
The network produces an action distribution for the next time-step. The features are function of one selected driver's route intention (such as turning left or right ) and the map . Redundant features can be pruned to reduce the complexity of the model: even with as few as 5 features (framed in blue), it is possible for the network to learn basic behaviour models that achieve lower losses than both baseline recurrent networks. Source. |
Right: At each time step, the confidence in the action changes. Left: How to compute some loss from the predicted variance if the ground-truth is a point-estimate? The predicted distribution can be evaluated at ground-truth, forming a likelihood. The negative-log-likelihood becomes the objective to minimize. The network can output high variance if it is not sure. But a regularization term deters it from being too uncertain. Source. |
Authors: Schulz, J., Hubmann, C., Morin, N., Löchner, J., & Darius, B.
-
Motivations:
1-
Learn a driver model that is:- Probabilistic. I.e. capture multi-modality and uncertainty in the predicted low-level actions.
- Interaction-aware. Well, here the
actions
of surrounding vehicles are ignored, but theirstates
are considered - "Markovian", i.e. that makes
1
-step prediction from the currentstate
, assuming independence of previousstate
s /action
s.
2-
Simplicity + lightweight.- This model is intended to be integrated as a probabilistic transition model into sampling-based algorithms, e.g.
particle filtering
. - Applications include:
1-
Forward simulation-based interaction-awareplanning
algorithms, e.g.Monte Carlo tree search
.2-
Driver intention estimation and trajectoryprediction
, here aDBN
example.
- Since samples are plenty, runtime should be kept low. And therefore, nested net structures such as
DESIRE
are excluded.
- This model is intended to be integrated as a probabilistic transition model into sampling-based algorithms, e.g.
- Ingredients:
- Feedforward net predicting
steering
andacceleration
distributions. - Enable multi-modality by building one
input vector
, and making one prediction, per possibleroute
.
- Feedforward net predicting
-
About the model:
- Input: a set of features build from:
- One route intention. For instance, the distances of both agents to
entry
andexit
of the related conflict areas are computed. - The map.
- The kinematic state (
pos
,heading
,vel
) of the2
closest agents.
- One route intention. For instance, the distances of both agents to
- Output:
steering
andacceleration
distributions, modelled as Gaussian:mean
andstd
are estimated (cov
=0
).- Not the next
state
!!-
"Instead of directly learning a
state
transition model, we restrict the neural network to learn a2
-dimensionalaction
distribution comprisingacceleration
andsteering angle
."
-
- Practical implication when building the dataset from real data: the
action
s of observed vehicles are unknown, but inferred using aninverse bicycle model
.
- Not the next
- Using the model at run time:
1-
Sample the possibleroutes
.2-
For each route:- Start with one
state
. - Get one
action
distribution. Note that the uncertainty can change at each step. - Sample (
acc
,steer
) from this distribution. - Move to next
state
. - Repeat.
- Start with one
- Input: a set of features build from:
-
Issue with the accumulation of
1
-step to form long-term predictions:- As in vanilla imitation learning, it suffers from distribution shift resulting from the accumulating errors.
-
"If this error is too high, the features determined during forward simulation are not represented within the training data anymore."
- A
DAgger
-like solution could be considered.
-
About conditioning on a driver's route intention:
- Without, one could pack all the road info in the input. How many routes to describe? And expect multiple trajectories to be produced. How many output heads? Tricky.
- Conditioning offers two advantages:
-
"The learning algorithm does not have to cope with the multi-modality induced by different route options. The varying number of possible
routes
(depending on the road topology) is handled outside of the neural network." - It also allows to define (and detail) relevant features along the considered path: upcoming
road curvature
or longitudinal distances tostop lines
.
-
- Limit to this approach (again related to the "off-distribution" issue):
-
"When enumerating all possible routes and running a forward simulation for each of the conditioned models, there might exist route candidates that are so unlikely that they have never been followed in the training data. Thus their features may result in unreasonable actions during inference, as the network only learns what actions are reasonable given a route, but not which routes are reasonable given a situation."
-
"Learning Predictive Models From Observation and Interaction"
-
[
2019
] [📝] [🎞️] [ 🎓University of Pennsylvania
,Stanford University
,UC Berkeley
] [ 🚗Honda
] -
[
visual prediction
,domain transfer
,nuScenes
,BDD100K
]
Click to expand
The idea is to learn a latent representation z that corresponds to the true action . The model can then perform joint training on the two kinds of data: it optimizes the likelihood of the interaction data, for which the action s are available, and observation data, for which the action s are missing. Hence the visual predictive model can predict the next frame xt+1 conditioned on the current frame xt and action learnt representation zt . Source. |
The visual prediction model is trained using two driving sets: action -conditioned videos from Boston and action -free videos from the Singapore. Frames from both subsets come from BDD100K and nuScenes datasets. Source. |
Authors: Schmeckpeper, K., Xie, A., Rybkin, O., Tian, S., Daniilidis, K., Levine, S., & Finn, C.
- On concrete industrial use-case:
-
"Imagine that a self-driving car company has data from a fleet of cars with sensors that record both
video
and the driver’sactions
in one city, and a second fleet of cars that only record dashboardvideo
, withoutaction
s, in a second city." -
"If the goal is to train an
action
-conditioned model that can be utilized to predict the outcomes of steeringaction
s, our method allows us to train such a model using data from both cities, even though only one of them hasaction
s."
-
- Motivations (mainly for robotics, but also AD):
- Generate predictions for complex tasks and new environments, without costly expert demonstrations.
- More precisely, learn an
action
-conditioned video predictive model from two kinds of data:1-
passive observations: [x0
,a1
,x1
...aN
,xN
].- Videos of another agent, e.g. a human, might show the robot how to use a tool.
- Observations represent a powerful source of information about the world and how actions lead to outcomes.
- A learnt model could also be used for
planning
andcontrol
, i.e. to plan coordinated sequences of actions to bring about desired outcomes. - But may suffer from large domain shifts.
2-
active interactions: [x0
,x1
...xN
].- Usually more expensive.
- Two challenges:
1-
Observations are not annotated with suitableaction
s: e.g. only access to the dashcam, not thethrottle
for instance.- In other words,
action
s are only observed in a subset of the data. - The goal is to learn from videos without
action
s, allowing it to leverage videos of agents for which the actions are unknown (unsupervised manner).
- In other words,
2-
Shift in the "embodiment" of the agent: e.g. robots' arms and humans' ones have physical differences.- The goal is to bridge the gap between the two domains (e.g.,
human arms
vs.robot arms
).
- The goal is to bridge the gap between the two domains (e.g.,
- What is learnt?
p
(xc+1:T
|x1:c
,a1:T
)- I.e. prediction of future frames conditioned on a set of
c
context frames and sequence of actions.
- What tests?
1-
Different environment within the same underlying dataset: driving inBoston
andSingapore
.2-
Same environment but different embodiment:humans
androbots
manipulate objects with different arms.
- What is assessed?
1-
Prediction quality (AD
test).2-
Control performance (robotics
test).
"Deep Learning-based Vehicle Behaviour Prediction For Autonomous Driving Applications: A Review"
-
[
2019
] [📝] [ 🎓University of Warwick
] [ 🚗Jaguar Land Rover
] -
[
multi-modality prediction
]
Click to expand
The author propose new classification of behavioural prediction methods. Only deep learning approaches are considered and physics -based approaches are excluded. The criteria are about the input , ouput and deep learning method . Source. |
First criterion is about the input: What is the prediction based on? Important is to capture road structure and interactions while staying flexible in the representation (e.g. describe different types of intersections and work with varying numbers of target vehicles and surrounding vehicles ). Partial observability should be considered by design. Source. |
Second criterion is about the output: What is predicted? Important is to propagate the uncertainty from the input and consider multiple options (multi-modality). Therefore to reason with probabilities. Bottom - why multi-modality is important. Source. |
Authors: Mozaffari, S., Al-Jarrah, O. Y., Dianati, M., Jennings, P., & Mouzakitis, A.
- One mentioned review: (Lefèvre et al.) classifies vehicle (behaviour) prediction models to three groups:
1-
physics
-based- Use dynamic or kinematic models of vehicles, e.g. a constant velocity (
CV
) Kalman Filter model.
- Use dynamic or kinematic models of vehicles, e.g. a constant velocity (
2-
manoeuvre
-based- Predict vehicles' manoeuvres, i.e. a classification problem from a defined set.
3-
interaction
-aware- Consider interaction of vehicles in the input.
- About the terminology:
- "Target Vehicles" (
TV
) are vehicles whose behaviour we are interested in predicting. - The other are "Surrounding Vehicles" (
SV
). - The "Ego Vehicle" (
EV
) can be also considered as anSV
, if it is close enough toTV
s.
- "Target Vehicles" (
- Here, the authors ignore the
physics
-based methods and propose three criteria for comparison:1-
Input.- Track history of
TV
only. - Track history of
TV
andSV
s. - Simplified bird’s eye view.
- Raw sensor data.
- Track history of
2-
Output.- Intention
class
: From a set of pre-defined discrete classes, e.g.go straight
,turn left
, andturn right
. - Unimodal
trajectory
: Usually the one with highest likelihood or the average). - Intention-based
trajectory
: Predict the trajectory that corresponds to the most probable intention (first case). - Multimodal
trajectory
: Combine the previous ones. Two options, depending if the intention set is fixed or dynamically learnt:static
intention set: predict for each member of the set (an extension to intention-based trajectory prediction approaches).dynamic
intention set: due to dynamic definition of manoeuvres, they are prone to converge to a single manoeuvre or not being able to explore all the existing manoeuvres.
- Intention
3-
In-between (deep learning method).RNN
are used because of their temporal feature extracting power.CNN
are used for their **spatial feature extracting ability (especially with bird’s eye views).
- Important considerations for behavioural prediction:
- Traffic rules.
- Road geometry.
- Multimodality: there may exist more than one possible future behaviour.
- Interaction.
- Uncertainty: both
aleatoric
(measurement noise) andepistemic
(partial observability). Hence the prediction should be probabilistic. - Prediction horizon: approaches can serve different purposes based on how far in the future they predict (
short-term
orlong-term
future motion).
- Two methods I would like to learn more about:
social pooling
layers, e.g. used by (Deo & Trivedi, 2019):-
"A social tensor is a spatial grid around the target vehicle that the occupied cells are filled with the processed temporal data (e.g.,
LSTM
hidden state value) of the corresponding vehicle. It contains both the temporal dynamic of vehicles represented and spatial inter-dependencies among them."
-
graph
neural networks, e.g. (Diehl et al., 2019) or (Li et al., 2019):- Graph Convolutional Network (
GCN
). - Graph Attention Network (
GAT
).
- Graph Convolutional Network (
- Comments:
- Contrary to the object detection task, there is no benchmark for systematically evaluating previous studies on vehicle behaviour prediction.
- Urban scenarios are excluded in the comparison since
NGSIM I-80
andUS-101 highway
driving datasets are used. - Maybe the
INTERACTION Dataset
could be used.
- Urban scenarios are excluded in the comparison since
- The authors suggest embedding domain knowledge in the prediction, and call for practical considerations (industry-supported research).
-
"Factors such as environment conditions and set of traffic rules are not directly inputted to the prediction model."
-
"Practical limitations such as sensor impairments and limited computational resources have not been fully taken into account."
-
- Contrary to the object detection task, there is no benchmark for systematically evaluating previous studies on vehicle behaviour prediction.
"Multi-Modal Simultaneous Forecasting of Vehicle Position Sequences using Social Attention"
-
[
2019
] [📝] [ 🎓Ecole CentraleSupelec
] [ 🚗Renault
] -
[
multi-modality prediction
,attention mechanism
]
Click to expand
Two multi-head attention layers are used to account for social interactions between all vehicles. They are combined with LSTM layers to offer joint, long-range and multi-modal forecasts. Source. |
Source. |
Authors: Mercat, J., Gilles, T., Zoghby, N. El, Sandou, G., Beauvois, D., & Gil, G. P.
- Previous work:
"Social Attention for Autonomous Decision-Making in Dense Traffic"
by (Leurent, & Mercat, 2019), detailed on this page as well. - Motivations:
1-
joint
- Considering interactions between all vehicles.2-
flexible
- Independant of the number/order of vehicles.3-
multi-modal
- Considering uncertainty.4-
long-horizon
- Predicting over a long range. Here5s
on simple highway scenarios.5-
interpretable
- E.g. using the social attention coefficients.6-
long distance interdependencies
- The authors decide to exclude the spatial grid representations that "limit the zone of interest to a predefined fixed size and the spatial relation precision to the grid cell size".
- Main idea: Stack
LSTM
layers withsocial
multi-head
attention
layers.- More precisely, the model is broken into four parts:
1-
AnEncoder
processes the sequences of all vehicle positions (no information aboutspeed
,orientation
,size
orblinker
).2-
ASelf-attention
layer captures interactions between all vehicles using "dot product attention". It has "multiple head", each specializing on different interaction patterns, e.g."closest front vehicle in any lane"
.3-
APredictor
, usingLSTM
cells, forecasts the positions.- A second multi-head self-attention layer is placed here.
4-
A finalDecoder
produces sequences of Gaussians mixtures for each vehicle.-
"What is forecast is not a mixture of trajectory density functions but a sequence of position mixture density functions. There is a dependency between forecasts at time
tk
and at timetk+1
but no explicit link between the modes at those times."
-
- More precisely, the model is broken into four parts:
- Two quotes about multi-modality prediction:
-
"When considering multiple modes, there is a challenging trade-off to find between anticipating a wide diversity of modes and focusing on realistic ones".
-
"
VAE
andGANs
are only able to generate an output distribution with sampling and do not express aPDF
".
-
- Baselines used to compare the presented "Social Attention Multi-Modal Prediction" approach:
- Constant velocity (
CV
), that uses Kalman filters (hence single modality). - Convolutional Social Pooling (
CSP
), that uses convolutional social pooling on a coarse spatial grid. Six mixture components are used. - Graph-based Interaction-aware Trajectory Prediction (
GRIP
), that uses aspatial
andtemporal
graph representation of the scene.
- Constant velocity (
"MultiPath : Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction"
-
[
2019
] [📝] [ 🚗Waymo
] -
[
anchor
,multi-modality prediction
,weighted prediction
,mode collapse
]
Click to expand
Source. |
A discrete set of intents is modelled as a set of K=3 anchor trajectories. Uncertainty is assumed to be unimodal given intent (here 3 intents are considered) while control uncertainty is modelled with a Gaussian distribution dependent on each waypoint state of an anchor trajectory. Such an example shows that modelling multiple intents is important. Source. |
Authors: Chai, Y., Sapp, B., Bansal, M., & Anguelov, D.
- One idea: "Anchor Trajectories".
- "Anchor" is a common idea in
ML
. Concrete applications of "anchor" methods forAD
includeFaster-RCNN
andYOLO
for object detections.- Instead of directly predicting the size of a bounding box, the
NN
predicts offsets from a predetermined set of boxes with particular height-width ratios. Those predetermined set of boxes are the anchor boxes. (explanation from this page).
- Instead of directly predicting the size of a bounding box, the
- One could therefore draw a parallel between the sizes of bounding boxes in
Yolo
and the shape of trajectories: they could be approximated with some static predetermined patterns and refined to the current context (the actual task of theNN
here).-
"After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios." [explanation about Yolo from this page]
-
"Our trajectory anchors are modes found in our training data in state-sequence space via unsupervised learning. These anchors provide templates for coarse-granularity futures for an agent and might correspond to semantic concepts like
change lanes
, orslow down
." [from the presented paper]
-
- This idea reminds also me the concept of
pre-defined templates
used for path planning.
- "Anchor" is a common idea in
- One motivation: model multiple intents.
- This contrasts with the numerous approaches which predict one single most-likely trajectory per agent, usually via supervised regression.
- The multi-modality is important since prediction is inherently stochastic.
- The authors distinguish between
intent uncertainty
andcontrol uncertainty
(conditioned on intent).
- The authors distinguish between
- A Gaussian Mixture Model (
GMM
) distribution is used to model both types of uncertainty.-
"At inference, our model predicts a discrete distribution over the anchors and, for each anchor, regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at each time step."
-
- One risk when working with multi-modality: directly learning a mixture suffers from issues of "mode collapse".
- This issue is common in
GAN
where the generator starts producing limited varieties of samples. - The solution implemented here is to estimate the anchors a priori before fixing them to learn the rest of our parameters (as for
Faster-RCNN
andYolo
for instance).
- This issue is common in
- Second motivation: weight the several trajectory predictions.
- This contrasts with methods that randomly sample from a generative model (e.g.
CVAE
andGAN
), leading to an unweighted set of trajectory samples (not to mention the problem of reproducibility and analysis). - Here, a parametric probability distribution is directly predicted: p(
trajectory
|observation
), together with a compact weighted set of explicit trajectories which summarizes this distribution well.- This contrasts with methods that outputs a probabilistic occupancy grid.
- This contrasts with methods that randomly sample from a generative model (e.g.
- About the "top-down" representation, structured in a
3d
array:- The first
2
dimensions represent spatial locations in the top-down image -
"The channels in the depth dimension hold
static
and time-varying (dynamic
) content of a fixed number of previous time steps."- Static context includes
lane connectivity
,lane type
,stop lines
,speed limit
. - Dynamic context includes
traffic light states
over the past5
time-steps. - The previous positions of the different dynamic objects are also encoded in some depth channels.
- Static context includes
- The first
- One word about the training dataset.
- The model is trained via
imitation learning
by fitting the parameters to maximize the log-likelihood of recorded driving trajectories. -
"The balanced dataset totals
3.85 million
examples, contains5.75 million
agent trajectories and constitutes approximately200 hours
of (real-world) driving."
- The model is trained via
"SafeCritic: Collision-Aware Trajectory Prediction"
-
[
2019
] [📝] [ 🎓University of Amsterdam
] [ 🚗BMW
] -
[
Conditional GAN
]
Click to expand
The Generator predicts trajectories that are scored against two criteria: The Discriminator (as in GAN ) for accuracy (i.e. consistent with the observed inputs) and the Critic (the generator acts as an Actor) for safety . The random noise vector variable z in the Generator can be sampled from N (0 , 1 ) to sample novel trajectories. Source. |
Several features offered by the predictions of SafeCritic : accuracy, diversity, attention and safety. Source. |
Authors: van der Heiden, T., Nagaraja, N. S., Weiss, C., & Gavves, E.
- Main motivation:
-
"We argue that one should take into account
safety
, when designing a model to predict future trajectories. Our focus is to generate trajectories that are not justaccurate
but also lead to minimum collisions and thus aresafe
. Safe trajectories are different from trajectories that try to imitate the ground truth, as the latter may lead toimplausible
paths, e.g, pedestrians going through walls." - Hence the trajectory predictions of the Generator are evaluated against multiple criteria:
Accuracy
: The Discriminator checks if the prediction is coherent / plausible with the observation.Safety
: Some Critic predicts the likelihood of a future dynamic and static collision.
- A third loss term is introduced:
-
"Training the generator is harder than training the discriminator, leading to slow convergence or even failure."
- An additional auto-encoding loss to the ground truth is introduced.
- It should encourage the model to avoid trivial solutions and mode collapse, and should increase the diversity of future generated trajectories.
- The term
mode collapse
means that instead of suggesting multiple trajectory candidates (multi-modal
), the model restricts its prediction to only one instance.
-
-
- About
RL
:- The authors mentioned several terms related to
RL
, in particular they try to dray a parallel withInverse RL
:-
"
GANs
resembleIRL
in that the discriminator learns the cost function and the generator represents the policy."
-
- I got the feeling of that idea, but I was honestly did not understand where it was implemented here. In particular no
MDP
formulation is given.
- The authors mentioned several terms related to
- About attention mechanism:
-
"We rely on attention mechanism for spatial relations in the scene to propose a compact representation for modelling interaction among all agents [...] We employ an attention mechanism to prioritize certain elements in the latent state representations."
- The grid-like scene representation is shared by both the Generator and the Critic.
-
- About the baselines:
- I like the "related work" section which shortly introduces the state-of-the-art trajectory prediction models based on deep learning.
SafeCritic
takes inspiration from some of their ideas, such as:- Aggregation of past information about multiple agents in a recurrent model.
- Use of Conditional
GAN
to offer the possibility to also generate novel trajectory given observation via sampling (standardGANs
have not encoder). - Generation of multi-modal future trajectories.
- Incorporation of semantic visual features (extracted by deep networks) combined with an attention mechanism.
SocialGAN
,SocialLSTM
,Car-Net
,SoPhie
andDESIRE
are used as baselines.R2P2
andSocialAttention
are also mentioned.
- I like the "related work" section which shortly introduces the state-of-the-art trajectory prediction models based on deep learning.
"A Review of Tracking, Prediction and Decision Making Methods for Autonomous Driving"
- [
2019
] [📝] [ 🎓University of Iasi
]
Click to expand
One figure:
Classification of motion models based on three increasingly abstract levels - adapted from (Lefèvre, S., Vasquez. D. & Laugier C. - 2014). Source. |
Authors: Leon, F., & Gavrilescu, M.
- A reference to one white paper: "Safety first for automated driving" 2019 - from Aptiv, Audi, Baidu, BMW, Continental, Daimler, Fiat Chrysler Automobiles, HERE, Infineon, Intel and Volkswagen (alphabetical order). The authors quote some of the good practices about Interpretation and Prediction:
- Predict only a short time into the future (the further the predicted state is in the future, the less likely it is that the prediction is correct).
- Rely on physics where possible (a vehicle driving in front of the automated vehicle will not stop in zero time on its own).
- Consider the compliance of other road users with traffic rules.
- Miscellaneous notes about prediction:
- The authors point the need of high-level reasoning (the more abstract the feature, the more reliable it is long term), mentioning both "affinity" and "attention" mechanisms.
- They also call for jointly addressing vehicle motion modelling and risk estimation (criticality assessment).
- Gaussian Processed is found to be a flexible tool for modelling motion patterns and is compared to Markov Models for prediction.
- In particular, GP regressions have the ability to quantify uncertainty (e.g. occlusion).
-
"CNNs can be superior to LSTMs for temporal modelling since trajectories are continuous in nature, do not have complicated "state", and have high spatial and temporal correlations".
"Deep Predictive Autonomous Driving Using Multi-Agent Joint Trajectory Prediction and Traffic Rules"
Click to expand
One figure:
The framework consists of four modules: encoder module, interaction module, prediction module and control module. Source. |
Authors: Cho, K., Ha, T., Lee, G., & Oh, S.
- One previous work: "Learning-Based Model Predictive Control under Signal Temporal Logic Specifications" by (Cho & Ho, 2018).
- One term: "robustness slackness" for
STL
-formula.- The motivation is to solve dilemma situations (inherent to strict compliance when all rules cannot be satisfied) by disobeying certain rules based on their predicted degree of satisfaction.
- The idea is to filter out non-plausible trajectories in the prediction step to only consider valid prediction candidates during planning.
- The filter considers some "rules" such as
Lane keeping
andCollision avoidance of front vehicle
orSpeed limit
(I did not understand why they are equally considered). - These rules are represented by Signal Temporal Logic (
STL
) formulas.- Note:
STL
is an extension of Linear Temporal Logic (with boolean predicates and discrete-time) with real-time and real-valued constraints.
- Note:
- A metric can be introduced to measure how well a given signal (here, a trajectory candidate) satisfies a
STL
formula.- This is called "robustness slackness" and acts as a margin to satisfaction of
STL
-formula.
- This is called "robustness slackness" and acts as a margin to satisfaction of
- This enables a "control under temporal logic specification" as mentioned by the authors.
- Architecture
- Encoder module: The observed trajectories are fed to some
LSTM
whose internal state is used by the two subsequent modules. - Interaction module: To consider interaction, all
LSTM
states are concatenated (joint state) together with a feature vector of relative distances. In addition, a CVAE is used for multi-modality (several possible trajectories are generated) and capture interactions (I did not fully understand that point), as stated by the authors:-
"The latent variable
z
models inherent structure in the interaction of multiple vehicles, and it also helps to describe underlying ambiguity of future behaviours of other vehicles."
-
- Prediction module: Based on the
LSTM
states, the concatenated vector and the latent variable, both future trajectories and margins to the satisfaction of each rule are predicted. - Control module: An
MPC
optimizes the control of the ego car, deciding which rules should be prioritized based on the two predicted objects (trajectories and robustness slackness).
- Encoder module: The observed trajectories are fed to some
"An Online Evolving Framework for Modeling the Safe Autonomous Vehicle Control System via Online Recognition of Latent Risks"
-
[
2019
] [📝] [ 🎓Ohio State University
] [ 🚗Ford
] -
[
MDP
,action-state transitions matrix
,SUMO
,risk assessment
]
Click to expand
One figure:
Both the state space and the transition model are adapted online, offering two features: prediction about the next state and detection of unknown (i.e. risky ) situations. Source. |
Authors: Han, T., Filev, D., & Ozguner, U.
- Motivation
- "Rule-based and supervised-learning methods cannot recognize unexpected situations so that the AV controller cannot react appropriately under unknown circumstances."
- Based on their previous work on RL “Highway Traffic Modeling and Decision Making for Autonomous Vehicle Using Reinforcement Learning” by (You, Lu, Filev, & Tsiotras, 2018).
- Main ideas: Both the state space and the transition model (here discrete state space so transition matrices) of an MDP are adapted online.
- I understand it as trying to learn the transition model (experience is generated using
SUMO
), hence to some extent going toward model-based RL. - The motivation is to assist any AV control framework with a so-called "evolving Finite State Machine" (
e
-FSM
).- By identifying state-transitions precisely, the future states can be predicted.
- By determining states uniquely (using online-clustering methods) and recognizing the state consistently (expressed by a probability distribution), initially unexpected dangerous situations can be detected.
- It reminds some ideas about risk assessment discussed during IV19: the discrepancy between expected outcome and observed outcome is used to quantify risk, i.e. the surprise or misinterpretation of the current situation).
- I understand it as trying to learn the transition model (experience is generated using
- Some concerns:
- "The dimension of transition matrices should be expanded to represent state-transitions between all existing states"
- What when the scenario gets more complex than the presented "simple car-following" and that the state space (treated as discrete) becomes huge?
- In addition, "the total number of transition matrices is identical to the total number of actions".
- Alone for the simple example, the acceleration command was sampled into
17
bins. Continuous action spaces are not an option.
- Alone for the simple example, the acceleration command was sampled into
- "The dimension of transition matrices should be expanded to represent state-transitions between all existing states"
"A Driving Intention Prediction Method Based on Hidden Markov Model for Autonomous Driving"
-
[
2019
] [📝] [ 🎓IEEE
] -
[
HMM
,Baum-Welch algorithm
,forward algorithm
]
Click to expand
One figure:
Source. |
Authors: Liu, S., Zheng, K., Member, S., Zhao, L., & Fan, P.
- One term: "mobility feature matrix"
- The recorded data (e.g. absolute positions, timestamps ...) are processed to form the mobility feature matrix (e.g. speed, relative position, lateral gap in lane ...).
- Its size is
T × L × N
:T
time steps,L
vehicles,N
types of mobility features. - In the discrete characterization, this matrix is then turned into a set of observations using K-means clustering.
- In the continuous case, mobility features are modelled as Gaussian mixture models (GMMs).
- This work implements HMM concepts presented in my project Educational application of Hidden Markov Model to Autonomous Driving.
"Online Risk-Bounded Motion Planning for Autonomous Vehicles in Dynamic Environments"
-
[
2019
] [📝] [ 🎓MIT
] [ 🚗Toyota
] -
[
intention-aware planning
,manoeuvre-based motion prediction
,POMDP
,probabilistic risk assessment
,CARLA
]
Click to expand
One figure:
Source. |
Authors: Huang, X., Hong, S., Hofmann, A., & Williams, B.
- One term: "Probabilistic Flow Tubes" (
PFT
)- A motion representation used in the "Motion Model Generator".
- Instead of using hand-crafted rules for the transition model, the idea is to learns human behaviours from demonstration.
- The inferred models are encoded with PFTs and are used to generate probabilistic predictions for both manoeuvre (long-term reasoning) and motion of the other vehicles.
- The advantage of belief-based probabilistic planning is that it can avoid over-conservative behaviours while offering probabilistic safety guarantees.
- Another term: "Risk-bounded POMDP Planner"
- The uncertainty in the intention estimation is then propagated to the decision module.
- Some notion of risk, defined as the probability of collision, is evaluated and considered when taking actions, leading to the introduction of a "chance-constrained POMDP" (
CC-POMDP
). - The online solver uses a heuristic-search algorithm, Risk-Bounded AO* (
RAO*
), takes advantage of the risk estimation to prune the over-risky branches that violate the risk constraints and eventually outputs a plan with a guarantee over the probability of success.
- One quote (this could apply to many other works):
"One possible future work is to test our work in real systems".
"Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios"
-
[
planning-based motion prediction
,manoeuvre-based motion prediction
]
Click to expand
Author: Sierra Gonzalez, D.
-
Prediction techniques are often classified into three types:
physics-based
manoeuvre-based
(andgoal-based
).interaction-aware
-
As I understood, the main idea here is to combine prediction techniques (and their advantages).
- The driver-models (i.e. the reward functions previously learnt with IRL) can be used to identify the most likely, risk-aversive, anticipatory manoeuvres. This is called the
model-based
prediction by the author since it relies on one model.- But relying only on driver models to predict the behaviour of surrounding traffic might fail to predict dangerous manoeuvres.
- As stated, "the model-based method is not a reliable alternative for the short-term estimation of behaviour, since it cannot predict dangerous actions that deviate from what is encoded in the model".
- One solution is to add a term that represents how the observed movement of the target matches a given maneuver.
- In other words, to consider the noisy observation of the dynamics of the targets and include these so-called
dynamic evidence
into the prediction.
- The driver-models (i.e. the reward functions previously learnt with IRL) can be used to identify the most likely, risk-aversive, anticipatory manoeuvres. This is called the
-
Usage:
- The resulting approach is used in the probabilistic filtering framework to update the belief in the POMDP and in its rollout (to bias the construction of the history tree towards likely situations given the state and intention estimations of the surrounding vehicles).
- It improves the inference of manoeuvres, reducing rate of false positives in the detection of
lane change
manoeuvres and enables the exploration of situations in which the surrounding vehicles behave dangerously (not possible if relying on safe generative models such asIDM
).
-
One quote about this combination:
"This model mimics the reasoning process of human drivers: they can guess what a given vehicle is likely to do given the situation (the model-based prediction), but they closely monitor its dynamics to detect deviations from the expected behaviour".
-
One idea: use this combination for risk assessment.
- As stated, "if the intended and expected maneuver of a vehicle do not match, the situation is classified as dangerous and an alert is triggered".
- This is an important concept of risk assessment I could identify at IV19: a situation is dangerous if there is a discrepancy between what is expected (given the context) and what is observed.
-
One term: "Interacting Multiple Model" (
IMM
), used as baseline in the comparison.- The idea is to consider a group of motion models (e.g.
lane keeping with CV
,lane change with CV
) and continuously estimate which of them captures more accurately the dynamics exhibited by the target. - The final predictions are produced as a weighted combination of the individual predictions of each filter.
IMM
belongs to the physics-based predictions approaches and could be extended formanoeuvre inference
(called dynamics matching). It is often used to maintain the beliefs and guide the observation sampling in POMDP.- But the issue is that IMM completely disregards the interactions between vehicles.
- The idea is to consider a group of motion models (e.g.
"Decision making in dynamic and interactive environments based on cognitive hierarchy theory: Formulation, solution, and application to autonomous driving"
-
[
2019
] [📝] [ 🎓University of Michigan
] -
[
level-k game theory
,cognitive hierarchy theory
,interaction modelling
,interaction-aware decision making
]
Click to expand
Authors: Li, S., Li, N., Girard, A., & Kolmanovsky, I.
-
One concept:
cognitive hierarchy
.- Other drivers are assumed to follow some "cognitive behavioural models", parametrized with a so called "cognitive level"
σ
. - The goal is to obtain and maintain belief about
σ
based on observation in order to optimally respond (using anMPC
). - Three levels are considered:
- level-
0
: driver that treats other vehicles on road as stationary obstacles. - level-
1
: cautious/conservative driver. - level-
2
: aggressive driver.
- level-
- Other drivers are assumed to follow some "cognitive behavioural models", parametrized with a so called "cognitive level"
-
One quote about the "cognitive level" of human drivers:
"Humans are most commonly level-1 and level-2 reasoners".
Related works:
-
Li, S., Li, N., Girard, A. & Kolmanovsky, I. [2019]. "Decision making in dynamic and interactive environments based on cognitive hierarchy theory, Bayesian inference, and predictive control" [pdf]
-
Li, N., Oyler, D., Zhang, M., Yildiz, Y., Kolmanovsky, I., & Girard, A. [2016]. "Game-theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems" [pdf]
-
"If a driver assumes that the other drivers are level-
1
and takes an action accordingly, this driver is a level-2
driver". - Use RL with hierarchical assignment to learn the policy:
- First, the
π-0
(for level-0
) is learnt for the ego-agent. - Then
π-1
with all the other participants followingπ-0
. - Then
π-2
...
- First, the
- Action masking: "If a car in the left lane is in a parallel position, the controlled car cannot change lane to the left".
- "The use of these hard constrains eliminates the clearly undesirable behaviours better than through penalizing them in the reward function, and also increases the learning speed during training"
-
-
Ren, Y., Elliott, S., Wang, Y., Yang, Y., & Zhang, W. [2019]. "How Shall I Drive ? Interaction Modeling and Motion Planning towards Empathetic and Socially-Graceful Driving" [pdf] [code]
Source. |
Source. |