Skip to content

Latest commit

 

History

History
1741 lines (1423 loc) · 155 KB

4_prediction_and_manoeuvre_recognition.md

File metadata and controls

1741 lines (1423 loc) · 155 KB

Prediction and Manoeuvre Recognition

  • Questions when developing a predictor:
    • Which application?
      • Who? pedestrians or vehicles?
      • Where? highway or urban?
    • Scenarios: generic or specific?
      • E.g. address all types of intersections, or only one specific T-intersection?
    • Which prediction horizon?
      • Usually physics-based for short term and goal-based for longer horizons.
    • Frequency of observations?
      • Maybe need to consider a sequence of observations.
    • How much robustness is needed?
      • Ability to generalize / transfer well in new environments?
    • Uncertainty: how to model multi-modal futures distributions?
      • Uncertainty comes from the stochastic nature of human motion and the non-perfect perception.
      • Be careful with mode-collapse.
    • Context-aware: which information to consider?
      • About the target agent, the static and the dynamic environment.
      • Consider/ignore other agents?
        • How to model interactions?
      • Consider/ignore map / road structure?
        • Is a map available?
    • How many obstacles?
      • How to represent them?
      • Can it scale?
    • How much expert / prior knowledge to inject? What can be learnt?
      • E.g. the dynamics of moving agents.
      • Modern techniques make extensive use of ML in order to better estimate context-dependent patterns in real-data.
      • What dataset available?
    • How much computational cost is constrained?
    • How to obtain a trajectory, i.e. a sequence of positions?
      • One-shot feed-forward: directly predicts a time sequence.
      • Roll-outs: repeat one-step prediction with the forward propagation of the model.
        • But this requires a transition function.
    • What metrics and test set for evaluation?
      • If possible comparable with other works.

"TNT: Target-driveN Trajectory Prediction"

  • [ 2020 ] [📝] [🚗 Waymo]

  • [ multimodal, VectorNet, goal-based prediction ]

Click to expand
Source.
TNT = target-driven trajectory prediction. The intuition is that the uncertainty of future states can be decomposed into two parts: the target or intent uncertainty, such as the decision between turning left and right; and the control uncertainty, such as the fine-grained motion required to perform a turn. Accordingly, the probabilistic distribution is decomposed by conditioning on targets and then marginalizing over them. Source.

Authors: Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., Shen, Y., Shen, Y., Chai, Y., Schmid, C., Li, C., & Anguelov, D.

  • One sentence:

    • "Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states."

  • Motivations:

    • 1- Model multimodal futures distributions.
    • 2- Be able to incorporate expert knowledge.
    • 3- Do not rely on run-time sampling to estimate trajectory distributions.
  • How to model multimodal futures distributions?

    • 1- Future modes are implicitly modelled as latent variables, which should capture the underlying intents of the agents.
      • Diverse trajectories can be generated by sampling from these implicit distributions.
      • E.g. CVAE in DESIRE, GAN in SocialGAN, single-step policy roll-out methods: they are prone to mode collapse too.
      • Issue: the use of latent variables to model intents prohibits them to be interpreted. Incorporating expert knowledge is made challenging.
      • Issue: it often requires test-time sampling [stochastic sampling from the latent space] to evaluate probabilistic queries (e.g., “how likely is the agent to turn left?”) and obtain implicit distributions.
      • "Modeling the future as a discrete set of targets does not suffer from mode averaging, which is the major factor that hampers multimodal predictions."

    • 2- Decompose the trajectory prediction task into subtasks.
      • For instance with planning-based prediction:
        • First estimate a Bayesian posterior distribution of destinations.
        • Then used IRL to plan the trajectories.
      • Or by decomposing goal distribution estimation and goal-directed planning.
    • 3- Discretize the output space as intents or with anchors.
      • IntentNet [UBER]: several common motion categories are manually defined (e.g. left turn and lane changes) and a separate motion predictor is learnt for each intent.
      • "MultiPath [Waymo] and CoverNet [nuTonomy] chose to quantize the trajectories into anchors, where the trajectory prediction task is reformulated into anchor selection and offset regression."

      • "Unlike anchor trajectories, the targets in TNT are much lower dimensional and can be easily discretized via uniform sampling or based on expert knowledge (e.g. HD maps). Hence, they can be estimated more reliably."

  • TNT = target-driven trajectory prediction.

    • The framework has 3 stages that are trained end-to-end:
    • 1- Target prediction.
      • Estimate a distribution over candidate targets, T steps into the future, given the encoded scene context.
      • The potential future targets are modelled via a set of N discrete, quantized locations with continuous offsets.
        • "We can see that with regression the performance improved by 0.16m, which shows the necessity of position refinement from the original target coordinates."

    • 2- Target-conditioned motion estimation.
      • Predict trajectory state sequences conditioned on targets.
      • Two assumptions:
        • 1- Future time steps are conditionally independent. Sequential predictions are therefore avoided.
        • 2- The distribution of the trajectories is unimodal (normal) given the target.
    • 3- Scoring and selection.
      • Estimates the likelihood of each predicted trajectory, taking into account the context of all other predicted trajectories
        • "Our final stage estimates the likelihood of full future trajectories sF. This differs from the second stage, which decomposes over time steps and targets, and from the first stage which only has knowledge of targets, but not full trajectories — e.g., a target might be estimated to have high likelihood, but a full trajectory to reach that target might not."

      • Select a final compact set of trajectory predictions.
        • "This process is inspired by the non-maximum suppression algorithm commonly used for computer vision problems, such as object detection."

  • How is the context information encoded, i.e. the ego-car's interactions with the environment and the other agents?

    • When the HD map is available: Using the hierarchical graph neural network VectorNet [Waymo].
      • "Polylines are used to abstract the HD map elements cP (lanes, traffic signs) and agent trajectories sP; a subgraph network is applied to encode each polyline, which contains a variable number of vectors; then a global graph is used to model the interactions between polylines."

    • " If scene context is only available in the form of top-down imagery, a ConvNet is used as the context encoder."

  • How to generate the targets, i.e. design the target space?

    • The target space is approximated by a set of discrete locations. Expert knowledge (such as road topology) can be incorporated there, for instance by sampling targets on and around the lanes."
      • "These targets are not only grounded in physical entities that are interpretable (e.g. location), but also correlate well with intent (e.g. a lane change or a right turn)."

    • 1- For vehicle: points are uniformly sampled on lane centerlines from the HD map and used as target candidates, with the assumption that vehicles never depart far away from lanes.
    • 2- For pedestrians: a virtual grid is generated around the agent and use the grid points as target candidates.
      • Grid of range 20mx20m with a grid size of 0.5m -> 1600 targets are considered and only the best 50 are kept for further processing.
  • What is the ground truth?

    • Not very clear to me how you can use a single demonstration as an oracle: today you turn left, but yesterday, with same context, you turned right.
    • "The ground truth score of each predicted trajectory is defined by its distance to ground truth trajectory ψ(sF)."

  • Results:

    • TNT outperforms the state-of-the-art on prediction of:
    • For benchmark, MultiPath and DESIRE are reimplemented by replacing their ConvNet context encoders with VectorNet.

"Modeling and Prediction of Human Driver Behavior: A Survey"

  • [ 2020 ] [📝] [ 🎓 Stanford, University of Illinois ] [🚗 Qualcomm]

  • [ state estimation, intention estimation, trait estimation, motion prediction ]

Click to expand
Source.
Terminology: the problem is formulated as a discrete-time multi-agent partially observable stochastic game (POSG). In particular, the internal state can contain agent’s navigational goals or the behavioural traits. Source.

Authors: Brown, K., Driggs-Campbell, K., & Kochenderfer, M. J.

  • Motivation:

    • A review and taxonomy of 200 models from the literature on driver behaviour modelling.
      • "In the context of the partially observable stochastic game (POSG) formulation, a driver behavior model is a collection of assumptions about the human observation function G, internal-state update function H and policy function π (the state-transition function F also plays an important role in driver-modeling applications, though it has more to do with the vehicle than the driver)."

      • The "References" section is large and good!
    • Models are categorized based on the tasks they aim to address.
      • 1- state estimation.
      • 2- intention estimation.
      • 3- trait estimation.
      • 4- motion prediction.
  • The following lists are non-exhaustive, see the tables for full details. Instead, they try to give an overview of the most represented instances:

  • 1- (Physical) state estimation.

    • [Algorithm]: approximate recursive Bayesian filters. E.g. KF, PF, moving average filter.
    • "Some advanced state estimation models take advantage of the structure inherent in the driving environment to improve filtering accuracy. E.g. DBN".

  • 2- (Internal states) intention estimation.

    • "Intention estimation usually involves computing a probability distribution over a finite set of possible behavior modes - often corresponding to navigational goals (e.g., change lanes, overtake) - that a driver might execute in the current situation."

    • [Architecture]: [D]BN, SVM, HMM, LSTM.
    • [Scope]: highway, intersection, lane-changing, urban.
    • [Evaluation]: accuracy (classification), ROC curve, F1, false positive rate.
    • [Intention Space] (set of possible behaviour modes that may exist in a driver’s internal state - often combined):
      • lateral modes (e.g. lane-change / lane-keeping intentions),
      • routes (a sequence of decisions, e.g. turn right → go straight → turn right again that a driver may intend to execute),
      • longitudinal modes (e.g. car-following / cruising),
      • joint configurations.
      • "Configuration intentions are defined in terms of spatial relationships to other vehicles. For example, intention estimation for a merging scenario might involve reasoning about which gap between vehicles the target car intends to enter. The intention space of a car in the other lane might be whether or not to yield and allow the merging vehicle to enter."

    • [Hypothesis Representation] (how to represent uncertainty in the intention hypothesis?): discrete probability distribution over possible intentions.
      • "In contrast, point estimate hypothesis ignores uncertainty and simply assigns a probability of 1 to a single (presumably the most likely) behavior mode."

    • [Estimation / Inference Paradigm]: single-shot, recursive, Bayesian (based on probabilistic graphical models), black-box, game theory.
      • "Recursive estimation algorithms operate by repeatedly updating the intention hypothesis at each time step based on the new information received. In contrast, single-shot estimators compute a new hypothesis from scratch at each inference step. The latter may operate over a history of observations, but it does not store any information between successive inference iterations."

      • "Game-theoretic models are distinguished by being interaction-aware. They explicitly consider possible situational outcomes in order to compute or refine an intention hypothesis. This interaction-awareness can be as simple as pruning intentions with a high probability of conflicting with other drivers, or it can mean computing the Nash equilibrium of an explicitly formulated game with a payoff matrix."

  • 3- trait estimation.

    • "Whereas intention estimation reasons about what a driver is trying to do, trait estimation reasons about factors that affect how the driver will do it. Broadly speaking, traits encompass skills, preferences, and style, as well as properties like fatigue, distractedness, etc."

    • "Trait estimation may be interpreted as the process of inferring the “parameters” of the driver’s policy function π on the basis of observed driving behavior. [...] Traits can also be interpreted as part of the driver’s internal state."

    • [Architecture]: IDM, MOBIL, reward parameters.

    • [Training]: IRL, EM, genetic algorithms, heuristic.

    • [Theory]: Inverse RL.

    • [Scope]: car following (IDM), highway, intersection, urban.

    • [Trait Space]: policy parameters, reward parameters (assuming that drivers are trying to optimize a cost function).

      • "Some of the most widely known driver models are simple parametric controllers with tuneable “style” or “preference” policy parameters that represent intuitive behavioral traits of drivers. E.g. IDM."

      • IDM traits: minimum desired gap, desired time headway, maximum feasible acceleration, preferred deceleration, maximum desired speed.
      • "Reward function parameters often correspond to the same intuitive notions mentioned above (e.g., preferred velocity), the important difference being that they parametrize a reward function rather than a closed-loop control policy."

    • [Hypothesis Representation] (uncertainty): in almost all cases, the hypothesis is represented by a point estimate rather than a distribution.

    • [Estimation Paradigm]: offline / online.

      • "Some models combine the two paradigms by computing a prior distribution offline, then tuning it online. This tuning procedure often relies on Bayesian methods."

    • [Model Class]: heuristic, optimization, Bayesian, IRL, contextually varying.

      • "One simple approach to offline trait estimation is to set trait parameters heuristically. Specifying parameters manually is one way to incorporate expert domain knowledge into models."

      • "In some approaches, trait parameters are modeled as contextually varying, meaning that they vary based on the region of the state space (the context) or the current behavior mode."

  • 4- motion prediction.

    • "Infer the future physical states of the surrounding vehicles".

    • [Architecture]: IDM, LSTM (and other RNN/NN), constant acceleration / speed (CA, CV), encoder-decoder, GMM, GP, adaptive, spline.
    • [Training]: heuristic.
      • "Simple examples include rule-based heuristic control laws like IDM. More sophisticated examples include closed-loop policies based on NN, DBN, and random forests."

    • [Theory]: RL, MPC, trajectory optimization.
      • "Some MPC policy models (including those used within a forward simulation paradigm) fall into the game theoretic category because they explicitly predict the future states of their environment (including other cars) before computing a planned trajectory."

    • [Scope]: highway, car-following (e.g. using IDM), intersection, urban.
    • [Evaluation]: RMSE, NLL, MAE, collision rate.
    • [Vehicle dynamics model]: linear, learned, bicycle kinematic.
      • "Many models in the literature assume linear state-transition dynamics. Linear models can be first order (i.e., output is position, input is velocity), second order (i.e., output is position, input is acceleration), and so forth."

      • "Kinematic models are simpler than dynamic models, but the no-slip assumption can lead to significant modeling errors."

      • "Some state-transition models are learned, in the sense that the observed correlation between consecutive predicted states results entirely from training on large datasets. Some incorporate an explicit transition model where the parameters are learned, whereas others simply output a full trajectory."

    • [Scene-level uncertainty modelling]:
      • single-scenario (ignoring multimodal uncertainty at the scene level),
      • partial scenario,
      • multi-scenario (reason about the different possible scenarios that may follow from an initial traffic scene),
      • reachable set.
      • "Some models reason only about a partial scenario, meaning they predict the motion of only a subset of vehicles in the traffic scene, usually under a single scenario."

      • "Some models reason about multimodal uncertainty on the scene-level by performing multiple (parallel) rollouts associated with different scenarios."

      • "Rather than reasoning about the likelihood of future states, some models reason about reachability. Reachability analysis implies taking a worst-case mindset in terms of predicting vehicle motion."

    • [Agent-level uncertainty modelling]: single deterministic, particle set, Gaussians.
    • [Prediction paradigm]:
      • open-loop independent trajectory prediction.
        • "Many models operate under the independent prediction paradigm, meaning that they predict a full trajectory independently for each agent in the scene. These approaches are interaction-unaware because they are open-loop. Though they may account for interaction between vehicles at the current time t, they do not explicitly reason about interaction over the prediction window from t+1 to the prediction horizon tf. [...] Because independent trajectory prediction models ignore interaction, their predictive power tends to quickly degrade as the prediction horizon extends further into the future."

      • closed-loop forward simulation.
        • "In the forward simulation paradigm, motion hypotheses are computed by rolling out a closed-loop control policy π for each target vehicle." - game theoretic prediction.

      • game theoretic prediction.
        • "Agents are modeled as looking ahead to consider the possible ramifications of their actions. This notion of looking ahead makes game-theoretic prediction models more deeply interaction-aware than forward simulation models based on reactive closed-loop control."


"Motion Prediction using Trajectory Sets and Self-Driving Domain Knowledge"

  • [ 2020 ] [📝] [🚗 nuTonomy]

  • [ multimodal, probabilistic, mode collapse, domain knowledge, classification ]

Click to expand
Source.
Top: The idea of CoverNet is to first generate feasible future trajectories, and then classify them. It uses the past states of all road users and a HD map to compute a distribution over a vehicle's possible future states. Bottom-left: the set of trajectories can be reduced by considering the current state and the dynamics: at high speeds, sharp turns are not dynamically feasible for instance. Bottom-right: the contribution here also deals with "feasibility", i.e. tries to reduce the set using domain knowledge. A second loss is introduced to penalize predictions that go off-road. The first loss (cross entropy with closest prediction treated as ground truth) is also adapted: instead of a delta distribution over the closest mode, there is also probability assigned to near misses. Source.

Authors: Boulton, F. A., Grigore, E. C., & Wolff, E. M.

  • Related work: CoverNet: Multimodal Behavior Prediction using Trajectory Sets, (Phan-Minh, Grigore, Boulton, Beijbom, & Wolff, 2019).

  • Motivation:

    • Extend their CoverNet by including further "domain knowledge".
      • "Both dynamic constraints and "rules-of-the-road" place strong priors on likely motions."

      • CoverNet: Predicted trajectories should be consistent with the current dynamic state.
      • This work : Predicted trajectories should stay on road.
    • The main idea is to leverage the map information by adding an auxiliary loss that penalizes off-road predictions.
  • Motivations and ideas of CoverNet:

    • 1- Avoid the issue of "mode collapse".
      • The prediction problem is treated as classification over a diverse set of trajectories.
      • The trajectory sets for CoverNet is available on nuscenes-devkit github.
    • 2- Ensure a desired level of coverage of the state space.
      • The larger and the more diverse the set, the higher the coverage. One can play with the resolution to ensure coverage guarantees, while pruning of the set improves the efficiency.
    • 3- Eliminate dynamically infeasible trajectories, i.e. introduced dynamic constraints.
      • Trajectories that are not physically possible are not considered, which limits the set of reachable states and improves the efficiency.
      • "We create a dynamic trajectory set based on the current state by integrating forward with our dynamic model over diverse control sequences."

  • Two losses:

    • 1- Moving beyond "standard" cross-entropy loss for classification.
      • What is the ground truth trajectory? Obviously, it is not part of the set.
      • One solution: designate the closest one in the set.
        • "We utilize cross-entropy with positive samples determined by the element in the trajectory set closest to the actual ground truth in minimum average of point-wise Euclidean distances."

        • Issue: This will penalize the second-closest trajectory just as much as the furthest, since it ignores the geometric structure of the trajectory set.
      • Another idea: use a weighted cross-entropy loss, where the weight is a function of distance to the ground truth.
        • "Instead of a delta distribution over the closest mode, there is also probability assigned to near misses."

        • A threshold defines which trajectories are "close enough" to the ground truth.
      • This weighted loss is adapted to favour mode diversity:
        • "We tried an "Avoid Nearby" weighted cross entropy loss that assigns weight of 1 to the closest match, 0 to all other trajectories within 2 meters of ground truth, and 1/|K| to the rest. We see that we are able to increase mode diversity and recover the performance of the baseline loss."

        • "Our results indicate that losses that are better able to enforce mode diversity may lead to improved performance."

    • 2- Add an auxiliary loss for off-road predictions.
      • This helps learn domain knowledge, i.e. partially encode "rules-of-the-road".
      • "This auxiliary loss can easily be pretrained using only map information (e.g., off-road area), which significantly improves performance on small datasets."

  • Related works for predictions (we want multimodal and probabilistic trajectory predictions):

    • Input, i.e. encoding of the scene:
      • "State-of-the-art motion prediction algorithms now typically use CNNs to learn appropriate features from a birds-eye-view rendering of the scene (map and road users)."

      • Graph neural networks (GNNs) looks promising to encode interactions.
      • Here: A BEV raster RGB image (fixed size) containing map information and the past states of all objects.
    • Output, i.e. representing the possible future motions:
      • 1- Generative models.
        • They encode choice over multiple actions via sampling latent variables.
        • Issue: multiple trajectory samples or 1-step policy rollouts (e.g. R2P2) are required at inference.
        • Examples: Stochastic policies, CVAEs and GANs.
      • 2- Regression.
        • Unimodal: predict a single future trajectory. Issue: unrealistically average over behaviours, even when predicting Gaussian uncertainty.
        • Multimodal: distribution over multiple trajectories. Issue: suffer from mode collapse.
      • 3- Classification.
        • "We choose not to learn an uncertainty distribution over the space. The density of our trajectory sets reduces its benefit compared to the case when there are a only a handful of modes."

        • How to deal with varying number of classes to predict? Not clear to me.
  • How to solve mode collapse in regression?

    • The authors consider MultiPath by Waymo (detailed also in this page) as their baseline.
    • A set of anchor boxes can be used, much like in object detection:
    • "This model implements ordinal regression by first choosing among a fixed set of anchors (computed a priori) and then regressing to residuals from the chosen anchor. This model predicts a fixed number of trajectories (modes) and their associated probabilities."

    • The authors extend MultiPath with dynamically-computed anchors, based on the agent's current speed.
      • Again, it makes no sense to consider anchors that are not dynamically reachable.
      • They also found that using one order of magnitude more “anchor” trajectories that Waymo (64) is beneficial: better coverage of space via anchors, leaving the network to learn smaller residuals.
  • Extensions:

    • As pointed out by this KIT Master Thesis offer, the current state of CoverNet only has a motion model for cars. Predicting bicycles and pedestrians' motions would be a next step.
    • Interactions are ignored now.

"PnPNet: End-to-End Perception and Prediction with Tracking in the Loop"

  • [ 2020 ] [📝] [ 🎓 University of Toronto ] [🚗 Uber]

  • [ joint perception + prediction, multi-object tracking ]

Click to expand
Source.
The authors propose to leverage tracking for the joint perception+prediction task. Source.
Source.
Top: One main idea is to make the prediction module directly reuse the scene context captured in the perception features, and also consider the past object tracks. Bottom: a second contribution is the use of a LSTM as a sequence model to learn the object trajectory representation. This encoding is jointly used for the tracking and prediction tasks. Source.

Authors: Liang, M., Yang, B., Zeng, W., Chen, Y., Hu, R., Casas, S., & Urtasun, R.

  • Motivations:

    • 1- Perform perception and prediction jointly, with a single neural network.
      • There for it is called "Perception and Prediction": PnP.
      • The whole model is also said end-to-end, because it is end-to-end trainable.
        • This constrast with modular sequential architectures where both the perception output and map information is forwarded to an independent prediction module, for instance in a bird’s eye view (BEV) raster representation.
    • 2- Improve prediction by leveraging the (past) temporal information (motion history) contained in tracking results.
      • In particular, one goal is to recover from long-term object occlusion.
        • "While all these [vanilla PnP] approaches share the sensor features for detection and prediction, they fail to exploit the rich information of actors along the time dimension [...]. This may cause problems when dealing with occluded actors and may produce temporal inconsistency in predictions."

      • The idea is to include tracking in the loop to improve prediction (motion forecasting):
        • "While the detection module processes sequential sensor data and generates object detections at each time step independently, the tracking module associates these estimates across time for better understanding of object states (e.g., occlusion reasoning, trajectory smoothing), which in turn provides richer information for the prediction module to produce accurate future trajectories."

        • "Exploiting motion from explicit object trajectories is more accurate than inferring motion from the features computed from the raw sensor data. [this reduces the prediction error by (∼6%) in the experiment]"

      • All modules share computation as there is a single backbone network, and the full model can be trained end-to-end.
        • "While previous joint perception and prediction models make the prediction module another convolutional header on top of the detection backbone network, which shares the same features with the detection header, in PnPNet we put the prediction module after explicit object tracking, with the object trajectory representation as input."

  • How to represent (long-term) trajectories?

    • The idea is to capture both sensor observation and motion information of actors.
    • "For each object we first extract its inferred motion (from past detection estimates) and raw observations (from sensor features) at each time step, and then model its dynamics using a recurrent network."

    • [interesting choice] "For angular velocity of ego car we parameterize it as its cosine and sine values."

    • This trajectory representation is utilized in both tracking and prediction modules.
  • About multi-object tracking (MOT):

    • There exist two distinct challenges:
      • 1- The discrete problem of data association between previous tracks and current detections.
        • Association errors (i.e., identity switches) are prone to accumulate through time.
        • "The association problem is formulated as a bipartite matching problem so that exclusive track-to-detection correspondence is guaranteed. [...] Solved with the Hungarian algorithm."

        • "Many frameworks have been proposed to solve the data association problem: e.g., Markov Decision Processes (MDP), min-cost flow, linear assignment problem and graph cut."

      • 2- The continuous problem of trajectory estimation.
        • In the proposed approach, the LSTM representation of associated new tracks are refined to generate smoother trajectories:
          • "For trajectory refinement, since it reduces the localization error of online generated perception results, it helps establish a smoother and more accurate motion history."

    • The proposed multi-object tracker solves both problems, therefore it is said "discrete-continuous".

"VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation"

  • [ 2020 ] [📝] [📝] [🎞️] [🚗 Waymo]

  • [ GNN, vectorized representation ]

Click to expand
Source.
Both map features and our sensor input can be simplified into either a point, a polygon, or a curve, which can be approximately represented as polylines, and eventually further split into vector fragments. The set of such vectors form a simplified abstracted world used to make prediction with less computation than rasterized images encoded with ConvNets. Source.
Source.
A vectorized representation of the scene is preferred to the combination (rasterized rendering + ConvNet encoding). A global interaction graph can be built from these vectorized elements, to model the higher-order relationships between entities. To further improve the prediction performance, a supervision auxiliary task is introduced. Source.

Authors: Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., & Schmid, C.

  • Motivations:
    • 1- Reduce computation cost while offering good prediction performances.
    • 2- Capture long range context information, for longer horizon prediction.
      • ConvNets are commonly used to encode the scene context, but they have limited receptive field.
      • And increasing kernel size and input image size is not so easy:
        • FLOPs of ConvNets increase quadratically with the kernel size and input image size.
        • The number of parameters increases quadratically with the kernel size.
    • Four ingredients:
      • 1- A vectorized representation is preferred to the combination (rasterized rendering + ConvNet encoding).
      • 2- A graph network, to model interactions.
      • 3- Hierarchy, to first encode map and sensor information, and then learn interactions.
      • 4- A supervision auxiliary task, in parallel to the prediction task.
  • Two kind of input:
    • 1- HD map information: Structured road context information such as lane boundaries, stop/yield signs, crosswalks and speed bumps.
    • 2- Sensor information: Agent trajectories.
  • How to encode the scene context information?
    • 1- Rasterized representation.
      • Rendering: in a bird-eye image, with colour-coded attributes. Issue: colouring requires manual specifications.
      • Encoding: encode the scene context information with ConvNets. Issue: receptive field may be limited.
      • "The most popular way to incorporate highly detailed maps into behavior prediction models is by rendering the map into pixels and encoding the scene information, such as traffic signs, lanes, and road boundaries, with a convolutional neural network (CNN). However, this process requires a lot of compute and time. Additionally, processing maps as imagery makes it challenging to model long-range geometry, such as lanes merging ahead, which affects the quality of the predictions."

      • Impacting parameters:
        • Convolutional kernel sizes.
        • Resolution of the rasterized images.
        • Feature cropping:
          • "A larger crop size (3 vs 1) can significantly improve the performance, and cropping along observed trajectory also leads to better performance."

    • 2- Vectorized representation
      • All map and trajectory elements can be approximated as sequences of vectors.
      • "This avoids lossy rendering and computationally intensive ConvNet encoding steps."

    • About graph neural networks (GNNs), from Rishabh Anand's medium article:
      • 1- Given a graph, we first convert the nodes to recurrent units and the edges to feed-forward neural networks.
      • 2- Then we perform Neighbourhood Aggregation (Message Passing) for all nodes n number of times.
      • 3- Then we sum over the embedding vectors of all nodes to get graph representation H. Here the "Global interaction graph".
      • 4- Feel free to pass H into higher layers or use it to represent the graph’s unique properties! Here to learn interaction models to make prediction.
  • About hierarchy:
    • 1-First, aggregate information among vectors inside a polyline, namely polyline subgraphs.
      • Graph neural networks (GNNs) are used to incorporate these sets of vectors
      • "We treat each vector vi belonging to a polyline Pj as a node in the graph with node features."

      • How to encode attributes of these geometric elements? E.g. traffic light state, speed limit?
        • I must admit I did not fully understand. But from what I read on medium:
          • Each node has a set of features defining it.
          • Each edge may connect nodes together that have similar features.
    • 2- Then, model the higher-order relationships among polylines, directly from their vectorized form.
      • Two interactions are jointly modelled:
        • 1- The interactions of multiple agents.
        • 2- Their interactions with the entities from road maps.
          • E.g. a car enters an intersection, or a pedestrian approaches a crosswalk.
          • "We clearly observe that adding map information significantly improves the trajectory prediction performance."

  • An auxiliary task:
    • 1- Randomly masking out map features during training, such as a stop sign at a four-way intersection.
    • 2- Require the net to complete it.
    • "The goal is to incentivize the model to better capture interactions among nodes."

    • And to learn to deal with occlusion.
    • "Adding this objective consistently helps with performance, especially at longer time horizons".

  • Two training objectives:
    • 1- Main task = Prediction. Future trajectories.
    • 2- Auxiliary task = Supervision. Huber loss between predicted node features and ground-truth masked node features.
  • Evaluation metrics:
    • The "widely used" Average Displacement Error (ADE) computed over the entire trajectories.
    • The Displacement Error at t (DE@ts) metric, where t in {1.0, 2.0, 3.0} seconds.
  • Performances and computation cost.
    • VectorNet is compared to ConvNets on the Argoverse forecasting dataset, as well as on some Waymo in-house prediction dataset.
    • ConvNets consumes 200+ times more FLOPs than VectorNet for a single agent: 10.56G vs 0.041G. Factor 5 when there are 50 agents per scene.
    • VectorNet needs 29% of the parameters of ConvNets: 72K vs 246K.
    • VectorNet achieves up to 18% better performance on Argoverse.

"Online parameter estimation for human driver behavior prediction"

  • [ 2020 ] [📝] [:octocat:] [🎓 Stanford] [🚗 Toyota Research Institute]

  • [ stochastic IDM ]

Click to expand
Source.
The vanilla IDM is a parametric rule-based car-following model that balances two forces: the desire to achieve free speed if there were no vehicle in front, and the need to maintain safe separation with the vehicle in front. It outputs an acceleration that is guaranteed to be collision free. The stochastic version introduces a new model parameter σ-IDM. Source.

Authors: Bhattacharyya, R., Senanayake, R., Brown, K., & Kochenderfer

  • Motivations:
    • 1- Explicitly model stochasticity in the behaviour of individual drivers.
      • Complex multi-modal distributions over possible outcomes should be modelled.
    • 2- Provide safety guarantees
    • 3- Highway scenarios: no urban intersection.
    • The methods should combine advantages of rule-based and learning-based estimation/prediction methods:
      • Interpretability.
      • Guarantees on safety (the learning-based model Generative Adversarial Imitation Learning (GAIL) used as baseline is not collision free).
      • Validity even in regions of the state space that are under-represented in the data.
      • High expressive power to capture nuanced driving behaviour.
  • About the method:
    • "We apply online parameter estimation to an extension of the Intelligent Driver Model IDM that explicitly models stochasticity in the behavior of individual drivers."

    • This rule-based method is online, as opposed for instance to the IDM with parameters obtained by offline estimation, using non-linear least squares.
    • Particle filtering is used for the recursive Bayesian estimation.
    • The derived parameter estimates are then used for forward motion prediction.
  • About the estimated parameters (per observed vehicle):
    • 1- The desired velocity (v-des).
    • 2- The driver-dependent stochasticity on acceleration (σ-IDM).
    • They are assumed stationary for each driver, i.e., human drivers do not change their latent driving behaviour over the time horizons.
  • About the datasets:
    • NGSIM for US Highway 101 at 10 Hz.
    • Highway Drone Dataset (HighD) at 25 Hz.
    • RMSE of the position and velocity are used to measure “closeness” of a predicted trajectory to the corresponding ground-truth trajectory.
    • Undesirable events, e.g. collision, going off-the-road, hard braking, that occur in each scene prediction are also considered.
  • How to deal with the "particle deprivation problem"?:
    • Particle deprivation = particles converge to one region of the state space and there is no exploration of other regions.
    • Dithering method = external noise is added to aid exploration of state space regions.
    • From (Schön, Gustafsson, & Karlsson, 2009) in "The Particle Filter in Practice":
      • "Both the process noise and measurement noise distributions need some dithering (increased covariance). Dithering the process noise is a well-known method to mitigate the sample impoverishment problem. Dithering the measurement noise is a good way to mitigate the effects of outliers and to robustify the PF in general".

    • Here:
      • "We implement dithering by adding random noise to the top 20% particles ranked according to the corresponding likelihood. The noise is sampled from a discrete uniform distribution with v-des {−0.5, 0, 0.5} and σ-IDM {−0.1, 0, 0.1}. (This preserves the discretization present in the initial sampling of particles).

  • Future works:
    • Non-stationarity.
    • Combination with a lane changing model such as MOBIL to extend to two-dimensional driving behaviour.

"PLOP: Probabilistic poLynomial Objects trajectory Planning for autonomous driving"

  • [ 2020 ] [📝] [[🎞️](TO COME)] [ 🚗 Valeo ]

  • [ Gaussian mixture, multi-trajectory prediction, nuScenes, A2D2, auxiliary loss ]

Click to expand
Source.
The architecture has two main sections: an encoder to synthesize information and the predictor where we exploit it. Note that PLOP does not use the classic RNN decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points. Also note the navigation command that conditions the ego prediction. Source.
Source.
PLOP uses multimodal sensor data input: Lidar and camera. The map is accumulated over the past 2s, so 20 frames. It produces a multivariate gaussian mixture for a fixed number of K possible trajectories over a 4s horizon. Uncertainty and variability are handled by predicting vehicle trajectories as a probabilistic Gaussian Mixture models, constrained by a polynomial formulation. Source.

Authors: Buhet, T., Wirbel, E., & Perrotton, X.

  • Motivations:

    • The goal is to predicte multiple feasible future trajectories both for the ego vehicle and neighbors through a probabilistic framework.
      • In addition in an end-to-end trainable fashion.
    • It builds on a previous work: "Conditional vehicle trajectories prediction in carla urban environment" - (Buhet, Wirbel, & Perrotton, 2019). See analysis further below.
      • The trajectory prediction based on polynomial representation is upgraded from deterministic output to multimodal probabilistic output.
      • It re-uses the navigation command input for the conditional part of the network, e.g. follow, left, straight, right.
      • One main difference is the introduction of a new input sensor: Lidar.
      • And adding a semantic segmentation auxiliary loss.
    • The authors also reflect about what metrics is relevant for trajectory prediction:
      • "We suggest to use two additional criteria to evaluate the predictions errors, one based on the most confident prediction, and one weighted by the confidence [how alternative trajectories with non maximum weights compare to the most confident trajectory]."

  • One term: "Probabilistic poLynomial Objects trajectory Planning" = PLOP.

  • I especially like their review on related works about data-driven predictions (section taken from the paper):

    • SocialLSTM: encodes the relations between close agents introducing a social pooling layer.
    • -Deterministic approaches derived from SocialLSTM:
      • SEQ2SEQ presents a new LSTM-based encoder-decoder network to predict trajectories into an occupancy grid map.
      • SocialGAN and SoPhie use generative adversarial networks to tackle uncertainty in future paths and augment the original set of samples.
        • CS-LSTM extends SocialLSTM using convolutional layers to encode the relations between the different agents.
      • ChauffeurNet uses a sophisticated neural network with a complex high level scene representation (roadmap, traffic lights, speed limit, route, dynamic bounding boxes, etc.) for deterministic ego vehicle trajectory prediction.
    • Other works use a graph representation of the interactions between the agents in combination with neural networks for trajectory planning.
    • Probabilistic approaches:
      • Many works like PRECOG, R2P2, Multiple Futures Prediction, SocialGAN include probabilistic estimation by adding a probabilistic framework at the end of their architecture producing multiple trajectories for ego vehicle, nearby vehicles or both.
      • In PRECOG, Rhinehart et al. build a probabilistic model that explicitly models interactions between agents, using latent variables to model the plausible reactions of agents to each other, with a possibility to pre-condition the trajectory of the ego vehicle by a goal.
      • MultiPath also reuses an idea from object detection algorithms using trajectory anchors extracted from the training data for ego vehicle prediction.
  • About the auxiliary semantic segmentation task.

    • Teaching the network to represent such semantic in its features improves the prediction.
    • "Our objective here is to make sure that in the RGB image encoding, there is information about the road position and availability, the applicability of the traffic rules (traffic sign/signal), the vulnerable road users (pedestrians, cyclists, etc.) position, etc. This information is useful for trajectory planning and brings some explainability to our model."

  • About interactions with other vehicles.

    • The prediction for each vehicle does not have direct access to the sequence of history positions of others.
    • "The encoding of the interaction between vehicles is implicitly computed by the birdview encoding."

    • The number of predicted trajectories is fixed in the network architecture. K=12 is chosen.
      • "It allows our architecture to be agnostic to the number of considered neighbors."

  • Multi-trajectory prediction in a probabilistic framework.

    • "We want to predict a fixed number K of possible trajectories for each vehicle, and associate them to a probability distribution over x and y: x is the longitudinal axis, y the lateral axis, pointing left."

    • About the Gaussian Mixture.
      • Vehicle trajectories are predicted as probabilistic Gaussian Mixture models, constrained by a polynomial formulation: The mean of the distribution is expressed using a polynomial of degree 4 of time.
      • "In the end, this representation can be interpreted as predicting K trajectories, each associated with a confidence πk [mixture weights shared for all sampled points belonging to the same trajectory], with sampled points following a Gaussian distribution centered on (µk,x,t, µk,y,t) and with standard deviation (σk,x,t, σk,y,t)."

      • "PLOP does not use the classic RNN decoder scheme for trajectory generation, preferring a single step version which predicts the coefficients of a polynomial function instead of the consecutive points."

      • This offers a measure of uncertainty on the predictions.
      • For the ego car, the probability distribution is conditioned by the navigation command.
    • About the loss:
      • negative log-likelihood over all sampled points of the ground truth ego and neighbour vehicles trajectories.
      • There is also the auxiliary cross entropy loss for segmentation.
  • Some findings:

    • The presented model seems very robust to the varying number of neighbours.
      • Finally, for 5 agents or more, PLOT outperforms by a large margin all ESP and PRECOG, on authors-defined metrics.
      • "This result might be explained by our interaction encoding which is robust to the variations of N using only multiple birdview projections and our non-iterative single step trajectory generation."

    • "Using K = 1 approach yields very poor results, also visible in the training loss. It was an anticipated outcome due to the ambiguity of human behavior."


"Probabilistic Future Prediction for Video Scene Understanding"

  • [ 2020 ] [📝] [🎞️] [🎞️ (blog)] [ 🎓 University of Cambridge ] [ 🚗 Wayve ]

  • [ multi frame, multi future, auxiliary learning, multi-task, conditional imitation learning ]

Click to expand
Source.
One main motivation is to supply the Control module (e.g. policy learnt via IL) with a representation capable of modelling probability of future events. The Dynamics module produces such spatio-temporal representation, not directly from images but from learnt scene features. That embeddings, that are used by the Control in order to learn driving policy, can be explicitly decoded to future semantic segmentation, depth, and optical flow. Note that the stochasticity of the future is modelled with a conditional variational approach minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution. Source.
Source.
There are many possible futures approaching this four-way intersection. Using 3 different noise vectors makes the model imagine different driving manoeuvres at an intersection: driving straight, turning left or turning right. These samples predict 10 frames, or 2 seconds into the future. Source.
Source.
The differential entropy of the present distribution, characterizing how unsure the model is about the future is used. As we approach the intersection, it increases. Source.

Authors: Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A.

  • Motivations:

    • 1- Supply the control module (e.g. IL) with an appropriate representation for interaction-aware and uncertainty-aware decision-making, i.e. one capable of modelling probability of future events.
      • Therefore the policy should receive temporal features explicitly trained to predict the future.
        • Motivation for that: It is difficult to learn an effective temporal representation by only using imitation error as a learning signal.
    • Others:
    • 2- "multi frame and multi future" prediction.
      • Perform prediction:
        • ... based on multiple past frames (i.e. not a single one).
        • ... and producing multiple possible outcomes (i.e. not deterministic).
          • Predict the stochasticity of the future, i.e. contemplate multiple possible outcomes and estimate the multi-modal uncertainty.
    • 3- Offer a differentiable / end-to-end trainable system, as opposed to system that reason over hand-coded representations.
      • I understand it as considering the loss of the IL part into the layers that create the latent representation.
    • 4- Cope with multi-agent interaction situations such as traffic merging, i.e. do not predict the behaviour of each actor in the scene independently.
      • For instance by jointly predicting ego-motion and motion of other dynamic agents.
    • 5- Do not rely on any HD-map to predict the static scene, to stay resilient to HD-map errors due to e.g. roadworks.
  • auxiliary learning: The loss used to train the latent representation is composed of three terms (c.f. motivation 3-):

    • future-prediction: weighted sum of future segmentation, depth and optical flow losses.
    • probabilistic: KL-divergence between the present and the future distributions.
    • control: regression for future time-steps up to some Future control horizon.
  • Let's explore some ideas behinds these three components.

  • 1- Temporal video encoding: How to build a temporal and visual representation?

    • What should be predicted?

      • "Previous work on probabilistic future prediction focused on trajectory forecasting [DESIRE, Lee et al. 2017, Bhattacharyya et al. 2018, PRECOG, Rhinehart et al. 2019] or were restricted to single-frame image generation and low resolution (64x64) datasets that are either simulated (Moving MNIST) or with static scenes and limited dynamics."

      • "Directly predicting in the high-dimensional space of image pixels is unnecessary, as some details about the appearance of the world are irrelevant for planning and control."

      • Instead, the task is to predict a more complete scene representation with segmentation, depth, and flow, two seconds in the future.
    • What should the temporal module process?

      • The temporal model should learn the spatio-temporal features from perception encodings [as opposed to RGB images].
      • These encodings are "scene features" extracted from images by a Perception module. They constitute a more powerful and compact representation compared to RGB images.
    • How does the temporal module look like?

      • "We propose a new spatio-temporal architecture that can learn hierarchically more complex features with a novel 3D convolutional structure incorporating both local and global space and time context."

    • The authors introduce a so-called Temporal Block module for temporal video encoding.

      • These Temporal Block should help to learn hierarchically more complex temporal features. With two main ideas:
      • 1- Decompose the convolutional filters and play with all possible configuration.
        • "Learning 3D filters is hard. Decomposing into two subtasks helps the network learn more efficient."

        • "State-of-the-art works decompose 3D filters into spatial and temporal convolutions. The model we propose further breaks down convolutions into many space-time combinations and context aggregation modules, stacking them together in a more complex hierarchical representation."

      • 2- Incorporate the "global context" in the features (I did not fully understand that).
        • They concatenate some local features based on 1x1x1 compression with some global features extracted with average pooling.
        • "By pooling the features spatially and temporally at different scales, each individual feature map also has information about the global scene context, which helps in ambiguous situations."

  • 2- Probabilistic prediction: how to generate multiple futures?

    • "There are various reasons why modelling the future is incredibly difficult: natural-scene data is rich in details, most of which are irrelevant for the driving task, dynamic agents have complex temporal dynamics, often controlled by unobservable variables, and the future is inherently uncertain, as multiple futures might arise from a unique and deterministic past."

    • The idea is that the uncertainty of the future can be estimated by making the prediction probabilistic.
      • "From a unique past in the real-world, many futures are possible, but in reality we only observe one future. Consequently, modelling multi-modal futures from deterministic video training data is extremely challenging."

      • Another challenge when trying to learn a multi-modal prediction model:
        • "If the network predicts a plausible future, but one that did not match the given training sequence, it will be heavily penalised."

      • "Our work addresses this by encoding the future state into a low-dimensional future distribution. We then allow the model to have a privileged view of the future through the future distribution at training time. As we cannot use the future at test time, we train a present distribution (using only the current state) to match the future distribution through a KL-divergence loss. We can then sample from the present distribution during inference, when we do not have access to the future."

    • To put it another way, two probability distributions are modelled, in a conditional variational approach:
      • A present distribution P, that represents all what could happen given the past context.
      • A future distribution F, that represents what actually happened in that particular observation.
    • [Learning to align the present distribution with the future distribution] "As the future is multimodal, different futures might arise from a unique past context zt. Each of these futures will be captured by the future distribution F that will pull the present distribution P towards it."

    • How to evaluate predictions?
      • "Our probabilistic model should be accurate, that is to say at least one of the generated future should match the ground truth future. It should also be diverse".

      • The authors use a diversity distance metric (DDM), which measures both accuracy and diversity of the distribution.
    • How to quantify uncertainty?
      • The framework can automatically infer which scenes are unusual or unexpected and where the model is uncertain of the future, by computing the differential entropy of the present distribution.
      • This is useful for understanding edge-cases and when the model needs to "pay more attention".
  • 3- The rich spatio-temporal features explicitly trained to predict the future are used to learn a driving policy.

    • Conditional Imitation Learning is used to learn speed and sterring controls, i.e. regressing to the expert's true control actions {v, θ}.

      • One reason is that it is immediately transferable to the real world.
    • From the ablation study, it seems to highly benefit from both:

      • 1- The temporal features.

        "It is too difficult to forecast how the future is going to evolve with a single image".

      • 2- The fact that these features are capable of probabilistic predictions.
        • Especially for multi-agent interaction scenarios.
    • About the training set:

      • "We address the inherent dataset bias by sampling data uniformly across lateral and longitudinal dimensions. First, the data is split into a histogram of bins by steering, and subsequently by speed. We found that weighting each data point proportionally to the width of the bin it belongs to avoids the need for alternative approaches such as data augmentation."

  • One exciting future direction:

    • For the moment, the control module takes the representation learned from dynamics models. And ignores the predictions themselves.
      • By the way, why are predictions, especially for the ego trajectories, not conditionned on possible actions?
    • It could use these probabilistic embedding capable of predicting multi-modal and plausible futures to generate imagined experience to train a policy in a model-based RL.
    • The design of the reward function from the latent space looks challenging at first sight.

"Efficient Behavior-aware Control of Automated Vehicles at Crosswalks using Minimal Information Pedestrian Prediction Model"

  • [ 2020 ] [📝] [ 🎓 University of Michigan, University of Massachusetts ]

  • [ interaction-aware decision-making, probabilistic hybrid automaton ]

Click to expand
Source.
The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton. Source.
Source.
The interaction is captured inside a gap-acceptance model: the pedestrian evaluates the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk. Source.
Source.
The baseline controller used for comparison is a finite state machine (FSM) with four states. Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by yielding or through hard stop. Source.

Authors: Jayaraman, S. K., Jr, L. P. R., Yang, X. J., Pradhan, A. K., & Tilbury, D. M.

  • Motivations:

    • Scenario: interaction with a pedestrian approaching/crossing/waiting at a crosswalk.
    • 1- A (1.1) simple and (1.2) interaction-aware pedestrian prediction model.
    • 2- Effectively incorporating these predictions in a control framework
      • The idea is to first forecast the position of the pedestrian using a pedestrian model, and then react accordingly.
    • 3- Be efficient on both waiting and approaching pedestrian scenarios.
      • Assuming always a crossing may lead to over-conservative policies.
      • "[in simulation] only a fraction of pedestrians (80%) are randomly assigned the intention to cross the street."

  • Why are CV and CA prediction models not applicable?

    • "At crosswalks, pedestrian behavior is much more unpredictable as they have to wait for an opportunity and decide when to cross."

    • Longer durations are needed.
      • 1- Interaction must be taken into account.
      • 2- The authors decide to model pedestrians as a hybrid automaton that switches between discrete actions.
  • One term: Behavior-aware Model Predictive Controller (B-MPC)

    • 1- The pedestrian crossing behaviour is modelled as a probabilistic hybrid automaton:
      • Four states: Approach Crosswalk, Wait, Cross, Walk away.
      • Probabilistic transitions: using pedestrian's gap acceptance - hence capturing interactions.
        • "What is the probability of accepting the current traffic gap?

        • "Pedestrians evaluate the available time gap to cross the street and either accept the gap by starting to cross or reject the gap by waiting at the crosswalk."

    • 2- The problem is formulated as a constrained quadratic optimization problem:
      • Cost: success (passing the crosswalk), comfort (penalize jerk and sudden changes in acceleration), efficiency (deviation from the reference speed).
      • Constraints: respect motion model, restrict velocity, acceleration, as well as jerk, and ensure collision avoidance.
      • Solver: standard quadratic program solver in MATLAB.
  • Performances:

    • Baseline controller:
      • Finite state machine (FSM) with four states: Maintain Speed, Accelerate, Yield, and Hard Stop.
      • "Whenever a pedestrian starts walking to cross the road, the controller always tries to stop, either by yielding or through hard stop."

      • "The Boolean variable InCW, denotes the pedestrian’s crossing activity: InCW=1 from the time the pedestrian started moving laterally to cross until they completely crossed the AV lane, and InCW=0 otherwise."

      • That means the baseline controller does not react at all to "non-crossing" cases since it never sees the pedestrian crossing laterally.
    • "It can be seen that the B-MPC is more aggressive, efficient, and comfortable than the baseline as observed through the higher average velocity, lower average acceleration effort, and lower average jerk respectively."


"Human Motion Trajectory Prediction: A Survey"

  • [ 2019 ] [📝] [ 🎓 Örebro University, CMU, TU Delft ] [ 🚗 Bosch ]

  • [ survey, physics-based, pattern-based, planning-based, manoeuvre recognition, goal-informed, context-aware, contextual cues, social force ]

Click to expand
Source.
The focus is on human motion prediction: not on cars. Approaches differ in the way they represent, parametrize, (learn) and solve the task. Modelling approaches can be divided into three families: physics-based, pattern-based, planning-based. Source.
Source.
What information to consider? All modelling approaches can be extended with so-called contextual cues of the target agent, static and dynamic environment. Source.
Source.
Some motion trajectories datasets. Mainly for pedestrians. Source.

Authors: Rudenko, A., Palmieri, L., Herman, M., Kitani, K. M., Gavrila, D. M., & Arras, K. O.

  • About pedestrian (not vehicle) motion prediction in AD:

    • "Most works consider the scenario of the laterally crossing pedestrian, dealing with the question what the latter will do at the curbside: start walking, continue walking, or stop walking."

    • "It is difficult to compare the experimental results, as the datasets are varying (different timings of same scenario, different sensors, different metrics)."

  • Prediction methods can be classified in three large categories:

    • 1- physics-based: reactive sense-predict.

    • 2- pattern-based: sense-learn-predict.

    • 3- planning-based: sense-reason-predict.

      • Agents reason about intentions and possible ways to the goal.
    • These categories do not form a partition. Example of combinations:

      • planning-based using physics-based transition functions.
      • physics-based methods tuned with learned parameters.
      • planning-based approaches using IRL (learning methods) to recover the hidden reward function.
      • In general, combinations hold a great potential, e.g. use disagreements to detect uncertainty.
  • 1- Physics-based methods.

    • Suitable in those situations where:

      • The effect of other agents or the static environment can be modelled as forces acting upon an agent.
      • The agent's motion dynamics can be modelled by an explicit transition function.
    • Limitations:

      • They might not capture well the complexity of the real world.
      • Mainly valid for short prediction horizons.
      • Potential burden of parameter tuning.
    • Advantages:

      • Applicability across multiple domains under mild conditions.

        "physics-based approaches can be readily applied across multiple environments, without the need for training datasets (some data for parameter estimation is useful, though)."

      • Interpretability.
      • Usually simple and fast approximate inference.
    • Idea:

      • "Motion is predicted by forward simulating a set of explicitly defined dynamics equations that follow a physics inspired model."

      • In 1-step ahead predictions with first-order Markov assumption:
        • A transition function (or dynamics model) f is defined. With some dt.
        • next_state = f(state, action) + noise.
        • The task is to estimate (sense) state and action. Inference is made from various estimated or observed cues.
        • Extension to consider not just the current state (first-order Markov), but an history of states:
          • "The early work by Zhu (1991) uses an autoregressive moving average model as transition function of a Hidden Markov Model (HMM) to predict occupancy probabilities of moving obstacles over multiple time steps with applications to predictive planning."

      • What to do with this f?
        • Use it alone: CA, CV.
        • Recursive Bayesian filters, e.g. Kalman filters.
        • Multiple-model algorithms.
    • Sub-classes: are multiple dynamics motion models considered, or only one?

    • 1-1. Single model.

      • kinematic vs dynamic models.

        • Are forces that govern the motion of other agents considered?
        • Rarely. They are not directly observable from sensory data. And dynamics models can be very complex.
      • Examples:

        • CV, CA, Curvilinear motion and Coordinated turn model (CT) (assumes constant turn rate and speed with white noise linear and white noise turn acceleration).
      • Miscellaneous:

        • "The bicycle model is an often used as an approximation to model the vehicle dynamics [rather 'kinematics' I would have said]."

        • Parameters (e.g. for IDM) can be learnt / estimated on-line.
      • How to improve them and achieve longer horizons? By considering the static and dynamic environment.

        • 1- Consider map information.

          • Simply said: Roads are structured. Cars try to stay on the road. Considering the free / drivable space as constraints is therefore beneficial, e.g. to focus on high-probability regions. Examples:
            • Formalize relevant traffic rules, e.g. pedestrian crossing permission on the green light, as additional motion constraints.
            • "When there are several possible turns at a node, i.e. at bifurcations, predictions are propagated along all outgoing edges."

            • Reachability-based models.
        • 2- Consider local interactions between multiple agents.

          • Social force models. E.g. potential field model.
          • Intelligent Driver Model (IDM) and MOBIL.
          • Group motion models: detecting and considering groups of people who walk together.
    • 1-2. Multi-model: the question is "how to combine individual models?"

      • "Complex agent motion is poorly described by a single dynamical model f. Although the incorporation of map information and influences from multiple agents render such approaches more flexible, they remain inherently limited."

      • Idea:

        • Define (1) and fuse (2) different prototypical motion modes, each described by a different dynamic regime f.
        • Example of modes: CA, CV, turning, crossing or stopping (for pedestrians).
      • Hybrid estimation and Multi-Model (MM) methods.

        • Why "hybrid"?

          • "MM methods maintain a hybrid system state ξ = (x, s) that augments the continuous valued x by a discrete-valued modal state s."

        • Motion modes are not observable. Uncertainty must be considered. Believes must be maintained.

        • Ingredients:

          • 1- A model set. Fixed or on-line adaptive.
          • 2- A strategy to deal with the discrete-valued uncertainties, for example, model sequences under a Markov or semi-Markov assumption.
          • 3- A recursive estimation scheme to deal with the continuous valued components conditioned on the model.
          • 4- A mechanism to generate the overall best estimate from a fusion or selection of the individual filter.
        • Interactive multiple model filters (IMM).

          • "Interactive" because context-based interactions are considered.
          • "The IMM-filter is a computationally efficient and in many cases well performing suboptimal estimation algorithm for Markovian switching systems. Basically it consists of 3 major steps:"

            • interaction (mixing): Compute an estimate of the state with respect to all N transition models and from these estimates computes N mixed inputs.
            • filtering: E.g. standard KF for each model.
            • combination: A weighted combination of updated state estimates produced by all the filters.
          • E.g. switching between CV, constant position (CP) and coordinated turn (CT).
          • Model transitions can be controlled by an intention recognition system.
          • IMM can also be used as a belief updater in POMDPs.
            • It takes as input a belief state and an observation and returns the updated belief state.
            • E.g. to estimate whether the car is following a CV or a CA model.
      • Other hybrid approaches: stochastic Reachability analysis.

        • "Model agents as hybrid systems (with multiple modes) and infer agents' future motions by computing stochastic reachable sets."

      • Non-hybrid: Dynamic Bayesian networks (DBN).

        • Why is it said "non-hybrid"? Not clear to me
          • Because it does not focus on the joint (manoeuvre, physical state) estimation?
          • As I understand, it is used to infer unobserved variables such as intentions, manoeuvre, or awareness of an oncoming vehicle (head orientation).
        • [Example of DBN-based multi-model] "Coupling a set of dynamic systems (i.e. a bank of Kalman filters (KF)) with an HMM, which is a special case of the DBNs."

        • "Infer human future behaviors, a set of macro-actions described by a set of KFs, based on measured dynamic quantities (i.e. acceleration, torque)."

  • 2- Pattern-based methods.

    • Idea:

      • Learn motion patterns from data of observed agent trajectories.
      • "Many pattern-based methods treat agents as particles, placed in the field of learned transitions, dictating the direction of future motion."

    • Supervised or unsupervised? Not clear to me. Maybe both are possible.

      • It could be supervised: instead of manually define (physics-based) x = x + v*dt, one could regress it.
        • Function approximators are fit to data: NN, HMM, GPs.
      • It could be unsupervised: manoeuvre recognition using clustering methods on canonical scenarios.
    • Advantages:

      • Ability to discover statistical behavioural patterns: dynamics functions are not hand-crafted, but rather approximated from training data.
    • Limitations:

      • Require training data.
      • Generalization capability of the learned model: whether it can be transferred to a different site, especially if the map topology changes.
      • "Pattern-based approaches tend to be used in non-safety critical applications, where explainability is less of an issue and where the environment is spatially constrained."

    • Two categories. Depending if the function approximator makes temporal factorization of the dynamics:

    • 2-1. Sequential methods.

      • Idea: Learn conditional models over time and recursively apply learned transition functions for inference.

      • Comparison with many physics-based approaches:

        • Similarity: they learn a one-step predictor s.t+1 = f(s.t−n:t, a.t).
          • transition functions of sequential models have Markovian property.
          • "In order to predict a sequence of state transitions (i.e. a trajectory), consecutive one-step predictions are made to compose a single long-term trajectory."

        • Difference: the function, often non-parametric, is learned from statistical observations, and its parameters cannot be directly interpreted.
      • Sub-categories, depending on the time-input and space range:

        • 2-1.1 Local spatial transition patterns.
          • Learning local motion patterns, such as probabilities of transitions between cells on a grid-map.
        • 2-1.2 Location-independent behavioural patterns.
          • "Unlike the local transition patterns, which are learned and applied for prediction only in a particular environment, location-independent patterns are used for predicting transitions of an agent in the general free space."

        • 2-1.3 Higher-order Markov models.
          • Neural networks for time series prediction.
            • Assuming higher-order Markov property.
            • "Such time series-based models are making a natural transition between the first-order Markovian methods (e.g. local transition patterns) and non-sequential techniques (e.g. clustering-based)."

          • LSTMs are very popular.
            • Social-LSTM, with its social pooling layer, learns to predict joint location-independent transitions in continuous spaces.
            • "Since humans are influenced by nearby people, LSTMs are connected in the social pooling system, sharing information from the hidden state of the LSTMs with the neighbouring pedestrians."

    • 2-2. Not-sequential methods.

      • Idea:
        • Directly learn a set of full motion patterns from data or a distribution over trajectories.
        • No temporal factorization of the dynamics, as opposed to sequential models (i.e. Markov assumption).
        • "Specifying causal constraints, e.g. Markovian assumptions for the sequential models or particular functional form for the physics-based methods, might be too restrictive or require a very large learning dataset."

      • Mainly unsupervised with clustering-based methods.
        • E.g. with GPs or mixture models as cluster centroids representation.
        • Global structure can be imposed on top of a sequential model:
          • [Example] "Cluster recorded trajectories of humans into global motion patterns using the expectation maximization (EM) algorithm and build an HMM model for each cluster. For prediction, the method compares the observed track with the learned motion patterns, and reasons about which patterns best explain it. Uncertainty is handled by probabilistic mixing of the most likely patterns."

  • 3- Planning-based methods

    • Idea:

      • Goal-informed predictions: reason on motion intent of rational agents.

      • 1- Explicitly reason about the agent’s long-term motion goals.

      • 2- Compute policies or path hypotheses that enable the agent to reach these goals.

      • They are intrinsically map- and obstacle-aware.

      • Requirements:

        • They work well when goals are explicitly defined.
          • "Most planning-based methods rely on a given set of goals, which makes them unusable or imprecise in a situation where no goals are known beforehand, or the number of possible goals is too high."

        • They often need a map of the environment.
          • Potential goals (e.g. exits on a roundabout) in the environment are often used as prior knowledge.
      • "They tend to generate better long-term predictions than the physics-based techniques and generalize to new environments better than the pattern-based approaches."

    • They solve sequential decision making problem:

      • "Much of the work use objective functions that minimizes some notion of the total cost of a sequence of actions (motions), and not just the cost of one action in isolation."

    • Two categories: Is the cost function pre-defined?

    • 3-1. "yes": Forward planning.

      • Explicit assumption regarding the optimality criteria.

      • 3-1.1 Motion and path planning methods for one single observed agent.

        • Use optimal motion and path planning techniques with a hand-crafted cost-function.

        • Examples of classical planning algorithms: Dijkstra, Fast Marching Method (FMM), optimal sampling-based motion planners, value iteration.

        • Improvement: multi-modality.

          • We do not know where the observed pedestrian wants to go.
          • Idea: account for several goals in the environment.
          • E.g. Bayesian framework that exploits the set of path hypotheses to estimate the intended destination.
            • Hypotheses can be generated from a Probabilistic Roadmap (PRM).
      • 3-1.2 Multi-agent forward planning.

        • Idea: consider interactions between agents in the scene.
        • E.g. Combine global planning (to goals) and local collision avoidance.
        • E.g. cooperative planning. Consider a joint state-space that includes all agents.
    • 3-2. "no": Inverse planning methods.

      • Idea:

        • Estimate the reward function or action model (policy) from observed trajectories.
      • 3-2.1 Single-agent inverse planning.

        • Inverse optimal control = IRL.
        • [MaxEnt IRL] "Humans are assumed to be near-optimal decision makers with stochastic policies, learned from observations, which are used to predict motion as a probability distribution over trajectories."

      • 3-2.2 Directly extract a policy from the data.

        • No reward function is explicitly learnt.

        • GAIL: A discriminator tries to distinguish between observations from experts and generated ones.

        • Deep generative techniques, e.g. PRECOG and R2P2.

        • "Differently from GAIL, the deep generative technique by Rhinehart et al. (2018) adopts a fully differentiable model, which is easy to train without the need of an expensive policy gradient search."

      • 3-2.3 Multi-agent inverse planning.

        • In addition to being rational, here, agents are assumed to be (somehow) cooperative.
        • DESIRE "The method reasons on multi-modal future trajectories accounting for agent interactions, scene semantics and expected reward function, learned using a sampling-based IRL scheme. [...] The RNN architecture allows incorporation of past trajectory into the inference process, which improves prediction accuracy compared to the standard IRL-based techniques."

  • About input contextual cues = factors that influence pedestrian motion.

    • 1- Describing the target agent.

      • [History of recent] position and velocity.
      • Class in {pedestrian, cyclist}.
      • attention and awareness of the robot's presence.
      • "Considering the head orientation or full articulated pose of the person may bring valuable insights on the target agent’s immediate intentions or their awareness of the environment."

      • How to deal represent uncertainty? positions may come with ellipses denoting uncertainty. How to integrate it?
    • 2- Describing the surrounding dynamic agents.

      • Interaction-unaware: simple physics-based method, e.g. CV, CA models.

      • Individual-aware:

        • physics-based methods with social forces or similar local interaction models.
        • Many deep learning methods explicitly model interacting entities. E.g. social pooling layers.
        • Learning joint motion patterns is a considerably harder task.
          • "Most existing socially-aware methods still assume that all observed people are behaving similarly and that their motion can be predicted by the same model and with the same features."

          • "Most available approaches assume cooperative behavior, while real humans might rather optimize personal goals instead of joint strategies. In such cases, game-theoretic approaches are possibly better suited for modeling human behavior."

      • Group-aware:

        • Capable of implicitly learning intra- and inter-group coherence behaviours.
    • 3- Describing the static environment.

      • "Humans adapt their behaviors according not only to the movements of the other agents but also to the environment’s shape and structure, making extensive use of its topology to reason on the possible paths to reach the long-term goal."

      • "Due to the high level of structure in the environment, methods in autonomous driving scenarios extensively use available semantic information, such as street layout and traffic rules or current state of the traffic lights."

      • (static) obstacle-aware and map-aware methods.
        • Mainly with occupancy grid maps. Also semantic maps.

"Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions"

  • [ 2019 ] [📝] [ 🎓 Caltech ] [🚗 zoox]

  • [ multimodal, probabilistic, 1-shot ]

Click to expand
Source.
The idea is to encode a history of world states (both static and dynamic) and semantic map information in a unified, top-down spatial grid. This allows to use a deep convolutional architecture to model entity dynamics, entity interactions, and scene context jointly. The authors found that temporal convolutions achieved better performance and significantly faster training than an RNN structure. Source.
Source.
Three 1-shot models are proposed. Top: parametric and continuous. Bottom: non-parametric and discrete (trajectories are here sampled for display from the state probabilities). Source.

Authors: Hong, J., Sapp, B., & Philbin, J.

  • Motivations: A prediction method of future distributions of entity state that is:

    • 1- Probabilistic.
      • "A single most-likely point estimate isn't sufficient for a safety-critical system."

      • "Our perception module also gives us state estimation uncertainty in the form of covariance matrices, and we include this information in our representation via covariance norms."

    • 2- Multimodal.
      • It is important to cover a diversity of possible implicit actions an entity might take (e.g., which way through a junction).
    • 3- One-shot.
      • It should directly predict distributions of future states, rather than a single point estimate at each future timestep.
      • "For efficiency reasons, it is desirable to predict full trajectories (time sequences of state distributions) without iteratively applying a recurrence step."

      • "The problem can be naturally formulated as a sequence-to-sequence generation problem. [...] We chose ℓ=2.5s of past history, and predict up to m=5s in the future."

      • "DESIRE and R2P2 address multimodality, but both do so via 1-step stochastic policies, in contrast to ours which directly predicts a time sequence of multimodal distributions. Such policy-based methods require both future roll-out and sampling to obtain a set of possible trajectories, which has computational trade-offs to our one-shot feed-forward approach."

  • How to model entity interactions?

    • 1- Implicitly: By encoding them as surrounding dynamic context.
    • 2- Explicitly: For instance, SocialLSTM pools hidden temporal state between entity models.
    • Here, all surrounding entities are encoded within a specific tensor.
  • Input:

    • A stack of 2d-top-view-grids. Each frame has 128×128 pixels, corresponding to 50m×50m.
      • For instance, the dynamic context is encoded in a RGB image with unique colours corresponding to each element type.
      • The state history of the considered entity is encoded in a stack of binary maps.
        • One could have use only 1 channel and play with the colour to represent the history.
    • "Note that through rendering, we lose the true graph structure of the road network, leaving it as a modeling challenge to learn valid road rules like legal traffic direction, and valid paths through a junction."

      • Cannot it just be coded in another tensor?
  • Output. Three approaches are proposed:

    • 1- Continuous + parametric representations.
      • 1.1. A Gaussian distribution is regressed per future timestep.
      • 1.2. Multi-modal Gaussian Regression (GMM-CVAE).
        • A set of Gaussians is predicted by sampling from a categorial latent variable.
        • If non enhanced, this method is naive and suffers from exchangeability, and mode collapse.
        • "In general, our mixture of sampled Gaussian trajectories underperformed our other proposed methods; we observed that some samples were implausible."

    • 2- Discrete + non-parametric representations.
      • Predict occupancy grid maps.
        • A grid is produced for each future modelled timestep.
        • Each grid location holds the probability of the corresponding output state.
        • For comparison, trajectories are extracted via some trajectory sampling procedure.
  • Against non-learnt baselines:

    • "Interestingly, both Linear and Industry baselines performed worse relative to our methods at larger time offsets, but better at smaller offsets. This can be attributed to the fact that predicting near futures can be accurately achieved with classical physics (which both baselines leverage) — more distant future predictions, however, require more challenging semantic understanding."


"Learning Interaction-Aware Probabilistic Driver Behavior Models from Urban Scenarios"

  • [ 2019 ] [📝] [ 🎓 TUM ] [🚗 BMW]

  • [ probabilistic predictions, multi-modality ]

Click to expand
Source.
The network produces an action distribution for the next time-step. The features are function of one selected driver's route intention (such as turning left or right) and the map. Redundant features can be pruned to reduce the complexity of the model: even with as few as 5 features (framed in blue), it is possible for the network to learn basic behaviour models that achieve lower losses than both baseline recurrent networks. Source.
Source.
Right: At each time step, the confidence in the action changes. Left: How to compute some loss from the predicted variance if the ground-truth is a point-estimate? The predicted distribution can be evaluated at ground-truth, forming a likelihood. The negative-log-likelihood becomes the objective to minimize. The network can output high variance if it is not sure. But a regularization term deters it from being too uncertain. Source.

Authors: Schulz, J., Hubmann, C., Morin, N., Löchner, J., & Darius, B.

  • Motivations:

    • 1- Learn a driver model that is:
      • Probabilistic. I.e. capture multi-modality and uncertainty in the predicted low-level actions.
      • Interaction-aware. Well, here the actions of surrounding vehicles are ignored, but their states are considered
      • "Markovian", i.e. that makes 1-step prediction from the current state, assuming independence of previous states / actions.
    • 2- Simplicity + lightweight.
      • This model is intended to be integrated as a probabilistic transition model into sampling-based algorithms, e.g. particle filtering.
      • Applications include:
        • 1- Forward simulation-based interaction-aware planning algorithms, e.g. Monte Carlo tree search.
        • 2- Driver intention estimation and trajectory prediction, here a DBN example.
      • Since samples are plenty, runtime should be kept low. And therefore, nested net structures such as DESIRE are excluded.
    • Ingredients:
      • Feedforward net predicting steering and acceleration distributions.
      • Enable multi-modality by building one input vector, and making one prediction, per possible route.
  • About the model:

    • Input: a set of features build from:
      • One route intention. For instance, the distances of both agents to entry and exit of the related conflict areas are computed.
      • The map.
      • The kinematic state (pos, heading, vel) of the 2 closest agents.
    • Output: steering and acceleration distributions, modelled as Gaussian: mean and std are estimated (cov = 0).
      • Not the next state!!
        • "Instead of directly learning a state transition model, we restrict the neural network to learn a 2-dimensional action distribution comprising acceleration and steering angle."

      • Practical implication when building the dataset from real data: the actions of observed vehicles are unknown, but inferred using an inverse bicycle model.
    • Using the model at run time:
      • 1- Sample the possible routes.
      • 2- For each route:
        • Start with one state.
        • Get one action distribution. Note that the uncertainty can change at each step.
        • Sample (acc, steer) from this distribution.
        • Move to next state.
        • Repeat.
  • Issue with the accumulation of 1-step to form long-term predictions:

    • As in vanilla imitation learning, it suffers from distribution shift resulting from the accumulating errors.
    • "If this error is too high, the features determined during forward simulation are not represented within the training data anymore."

    • A DAgger-like solution could be considered.
  • About conditioning on a driver's route intention:

    • Without, one could pack all the road info in the input. How many routes to describe? And expect multiple trajectories to be produced. How many output heads? Tricky.
    • Conditioning offers two advantages:
      • "The learning algorithm does not have to cope with the multi-modality induced by different route options. The varying number of possible routes (depending on the road topology) is handled outside of the neural network."

      • It also allows to define (and detail) relevant features along the considered path: upcoming road curvature or longitudinal distances to stop lines.
    • Limit to this approach (again related to the "off-distribution" issue):
      • "When enumerating all possible routes and running a forward simulation for each of the conditioned models, there might exist route candidates that are so unlikely that they have never been followed in the training data. Thus their features may result in unreasonable actions during inference, as the network only learns what actions are reasonable given a route, but not which routes are reasonable given a situation."


"Learning Predictive Models From Observation and Interaction"

  • [ 2019 ] [📝] [🎞️] [ 🎓 University of Pennsylvania, Stanford University, UC Berkeley ] [ 🚗 Honda ]

  • [ visual prediction, domain transfer, nuScenes, BDD100K ]

Click to expand
The idea is to learn a latent representation z that corresponds to the true action. The model can then perform joint training on the two kinds of data: it optimizes the likelihood of the interaction data, for which the actions are available, and observation data, for which the actions are missing. Hence the visual predictive model can predict the next frame xt+1 conditioned on the current frame xt and action learnt representation zt. Source.
The idea is to learn a latent representation z that corresponds to the true action. The model can then perform joint training on the two kinds of data: it optimizes the likelihood of the interaction data, for which the actions are available, and observation data, for which the actions are missing. Hence the visual predictive model can predict the next frame xt+1 conditioned on the current frame xt and action learnt representation zt. Source.
The visual prediction model is trained using two driving sets: action-conditioned videos from Boston and action-free videos from the Singapore. Frames from both subsets come from BDD100K or nuScenes datasets.. Source.
The visual prediction model is trained using two driving sets: action-conditioned videos from Boston and action-free videos from the Singapore. Frames from both subsets come from BDD100K and nuScenes datasets. Source.

Authors: Schmeckpeper, K., Xie, A., Rybkin, O., Tian, S., Daniilidis, K., Levine, S., & Finn, C.

  • On concrete industrial use-case:
    • "Imagine that a self-driving car company has data from a fleet of cars with sensors that record both video and the driver’s actions in one city, and a second fleet of cars that only record dashboard video, without actions, in a second city."

    • "If the goal is to train an action-conditioned model that can be utilized to predict the outcomes of steering actions, our method allows us to train such a model using data from both cities, even though only one of them has actions."

  • Motivations (mainly for robotics, but also AD):
    • Generate predictions for complex tasks and new environments, without costly expert demonstrations.
    • More precisely, learn an action-conditioned video predictive model from two kinds of data:
      • 1- passive observations: [x0, a1, x1 ... aN, xN].
        • Videos of another agent, e.g. a human, might show the robot how to use a tool.
        • Observations represent a powerful source of information about the world and how actions lead to outcomes.
        • A learnt model could also be used for planning and control, i.e. to plan coordinated sequences of actions to bring about desired outcomes.
        • But may suffer from large domain shifts.
      • 2- active interactions: [x0, x1 ... xN].
        • Usually more expensive.
  • Two challenges:
    • 1- Observations are not annotated with suitable actions: e.g. only access to the dashcam, not the throttle for instance.
      • In other words, actions are only observed in a subset of the data.
      • The goal is to learn from videos without actions, allowing it to leverage videos of agents for which the actions are unknown (unsupervised manner).
    • 2- Shift in the "embodiment" of the agent: e.g. robots' arms and humans' ones have physical differences.
      • The goal is to bridge the gap between the two domains (e.g., human arms vs. robot arms).
  • What is learnt?
    • p(xc+1:T|x1:c, a1:T)
    • I.e. prediction of future frames conditioned on a set of c context frames and sequence of actions.
  • What tests?
    • 1- Different environment within the same underlying dataset: driving in Boston and Singapore.
    • 2- Same environment but different embodiment: humans and robots manipulate objects with different arms.
  • What is assessed?
    • 1- Prediction quality (AD test).
    • 2- Control performance (robotics test).

"Deep Learning-based Vehicle Behaviour Prediction For Autonomous Driving Applications: A Review"

  • [ 2019 ] [📝] [ 🎓 University of Warwick ] [ 🚗 Jaguar Land Rover ]

  • [ multi-modality prediction ]

Click to expand
The author propose new classification of behavioural prediction methods. Only deep learning approaches are considered and physics-based approaches are excluded. The criteria are about the input, ouput and deep learning method. Source.
The author propose new classification of behavioural prediction methods. Only deep learning approaches are considered and physics-based approaches are excluded. The criteria are about the input, ouput and deep learning method. Source.
First criterion is about the input: What is the prediction based on? Important is to capture road structure and interactions while staying flexible in the representation (e.g. describe different types of intersections and work with varying numbers of target vehicles and surrounding vehicles). Partial observability should be considered by design. Source.
First criterion is about the input: What is the prediction based on? Important is to capture road structure and interactions while staying flexible in the representation (e.g. describe different types of intersections and work with varying numbers of target vehicles and surrounding vehicles). Partial observability should be considered by design. Source.
Second criterion is about the output: What is predicted? Important is to propagate the uncertainty from the input and consider multiple options (multi-modality). Therefore to reason with probabilities. Bottom - why multi-modality is important. Source.
Second criterion is about the output: What is predicted? Important is to propagate the uncertainty from the input and consider multiple options (multi-modality). Therefore to reason with probabilities. Bottom - why multi-modality is important. Source.

Authors: Mozaffari, S., Al-Jarrah, O. Y., Dianati, M., Jennings, P., & Mouzakitis, A.

  • One mentioned review: (Lefèvre et al.) classifies vehicle (behaviour) prediction models to three groups:
    • 1- physics-based
      • Use dynamic or kinematic models of vehicles, e.g. a constant velocity (CV) Kalman Filter model.
    • 2- manoeuvre-based
      • Predict vehicles' manoeuvres, i.e. a classification problem from a defined set.
    • 3- interaction-aware
      • Consider interaction of vehicles in the input.
  • About the terminology:
    • "Target Vehicles" (TV) are vehicles whose behaviour we are interested in predicting.
    • The other are "Surrounding Vehicles" (SV).
    • The "Ego Vehicle" (EV) can be also considered as an SV, if it is close enough to TVs.
  • Here, the authors ignore the physics-based methods and propose three criteria for comparison:
    • 1- Input.
      • Track history of TV only.
      • Track history of TV and SVs.
      • Simplified bird’s eye view.
      • Raw sensor data.
    • 2- Output.
      • Intention class: From a set of pre-defined discrete classes, e.g. go straight, turn left, and turn right.
      • Unimodal trajectory: Usually the one with highest likelihood or the average).
      • Intention-based trajectory: Predict the trajectory that corresponds to the most probable intention (first case).
      • Multimodal trajectory: Combine the previous ones. Two options, depending if the intention set is fixed or dynamically learnt:
        • static intention set: predict for each member of the set (an extension to intention-based trajectory prediction approaches).
        • dynamic intention set: due to dynamic definition of manoeuvres, they are prone to converge to a single manoeuvre or not being able to explore all the existing manoeuvres.
    • 3- In-between (deep learning method).
      • RNN are used because of their temporal feature extracting power.
      • CNN are used for their **spatial feature extracting ability (especially with bird’s eye views).
  • Important considerations for behavioural prediction:
    • Traffic rules.
    • Road geometry.
    • Multimodality: there may exist more than one possible future behaviour.
    • Interaction.
    • Uncertainty: both aleatoric (measurement noise) and epistemic (partial observability). Hence the prediction should be probabilistic.
    • Prediction horizon: approaches can serve different purposes based on how far in the future they predict (short-term or long-term future motion).
  • Two methods I would like to learn more about:
    • social pooling layers, e.g. used by (Deo & Trivedi, 2019):
      • "A social tensor is a spatial grid around the target vehicle that the occupied cells are filled with the processed temporal data (e.g., LSTM hidden state value) of the corresponding vehicle. It contains both the temporal dynamic of vehicles represented and spatial inter-dependencies among them."

    • graph neural networks, e.g. (Diehl et al., 2019) or (Li et al., 2019):
      • Graph Convolutional Network (GCN).
      • Graph Attention Network (GAT).
  • Comments:
    • Contrary to the object detection task, there is no benchmark for systematically evaluating previous studies on vehicle behaviour prediction.
      • Urban scenarios are excluded in the comparison since NGSIM I-80 and US-101 highway driving datasets are used.
      • Maybe the INTERACTION Dataset​ could be used.
    • The authors suggest embedding domain knowledge in the prediction, and call for practical considerations (industry-supported research).
      • "Factors such as environment conditions and set of traffic rules are not directly inputted to the prediction model."

      • "Practical limitations such as sensor impairments and limited computational resources have not been fully taken into account."


"Multi-Modal Simultaneous Forecasting of Vehicle Position Sequences using Social Attention"

  • [ 2019 ] [📝] [ 🎓 Ecole CentraleSupelec ] [ 🚗 Renault ]

  • [ multi-modality prediction, attention mechanism ]

Click to expand
Two multi-head attention layers are used to account for social interactions between all vehicles. They are combined with LSTM layers to offer joint, long-range and multi-modal forecasts. Source.
Two multi-head attention layers are used to account for social interactions between all vehicles. They are combined with LSTM layers to offer joint, long-range and multi-modal forecasts. Source.
Source.
Source.

Authors: Mercat, J., Gilles, T., Zoghby, N. El, Sandou, G., Beauvois, D., & Gil, G. P.

  • Previous work: "Social Attention for Autonomous Decision-Making in Dense Traffic" by (Leurent, & Mercat, 2019), detailed on this page as well.
  • Motivations:
    • 1- joint - Considering interactions between all vehicles.
    • 2- flexible - Independant of the number/order of vehicles.
    • 3- multi-modal - Considering uncertainty.
    • 4- long-horizon - Predicting over a long range. Here 5s on simple highway scenarios.
    • 5- interpretable - E.g. using the social attention coefficients.
    • 6- long distance interdependencies - The authors decide to exclude the spatial grid representations that "limit the zone of interest to a predefined fixed size and the spatial relation precision to the grid cell size".
  • Main idea: Stack LSTM layers with social multi-head attention layers.
    • More precisely, the model is broken into four parts:
      • 1- An Encoder processes the sequences of all vehicle positions (no information about speed, orientation, size or blinker).
      • 2- A Self-attention layer captures interactions between all vehicles using "dot product attention". It has "multiple head", each specializing on different interaction patterns, e.g. "closest front vehicle in any lane".
      • 3- A Predictor, using LSTM cells, forecasts the positions.
      • A second multi-head self-attention layer is placed here.
      • 4- A final Decoder produces sequences of Gaussians mixtures for each vehicle.
        • "What is forecast is not a mixture of trajectory density functions but a sequence of position mixture density functions. There is a dependency between forecasts at time tk and at time tk+1 but no explicit link between the modes at those times."

  • Two quotes about multi-modality prediction:
    • "When considering multiple modes, there is a challenging trade-off to find between anticipating a wide diversity of modes and focusing on realistic ones".

    • "VAE and GANs are only able to generate an output distribution with sampling and do not express a PDF".

  • Baselines used to compare the presented "Social Attention Multi-Modal Prediction" approach:
    • Constant velocity (CV), that uses Kalman filters (hence single modality).
    • Convolutional Social Pooling (CSP), that uses convolutional social pooling on a coarse spatial grid. Six mixture components are used.
    • Graph-based Interaction-aware Trajectory Prediction (GRIP), that uses a spatial and temporal graph representation of the scene.

"MultiPath : Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction"

  • [ 2019 ] [📝] [ 🚗 Waymo ]

  • [ anchor, multi-modality prediction, weighted prediction, mode collapse ]

Click to expand
Source.
Source.
A discrete set of intents is modelled as a set of K=3 anchor trajectories. Uncertainty is assumed to be unimodal given intent (here 3 intents are considered) while control uncertainty is modelled with a Gaussian distribution dependent on each waypoint state of an anchor trajectory. Such an example shows that modelling multiple intents is important. Source.
A discrete set of intents is modelled as a set of K=3 anchor trajectories. Uncertainty is assumed to be unimodal given intent (here 3 intents are considered) while control uncertainty is modelled with a Gaussian distribution dependent on each waypoint state of an anchor trajectory. Such an example shows that modelling multiple intents is important. Source.

Authors: Chai, Y., Sapp, B., Bansal, M., & Anguelov, D.

  • One idea: "Anchor Trajectories".
    • "Anchor" is a common idea in ML. Concrete applications of "anchor" methods for AD include Faster-RCNN and YOLO for object detections.
      • Instead of directly predicting the size of a bounding box, the NN predicts offsets from a predetermined set of boxes with particular height-width ratios. Those predetermined set of boxes are the anchor boxes. (explanation from this page).
    • One could therefore draw a parallel between the sizes of bounding boxes in Yolo and the shape of trajectories: they could be approximated with some static predetermined patterns and refined to the current context (the actual task of the NN here).
      • "After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios." [explanation about Yolo from this page]

      • "Our trajectory anchors are modes found in our training data in state-sequence space via unsupervised learning. These anchors provide templates for coarse-granularity futures for an agent and might correspond to semantic concepts like change lanes, or slow down." [from the presented paper]

    • This idea reminds also me the concept of pre-defined templates used for path planning.
  • One motivation: model multiple intents.
    • This contrasts with the numerous approaches which predict one single most-likely trajectory per agent, usually via supervised regression.
    • The multi-modality is important since prediction is inherently stochastic.
      • The authors distinguish between intent uncertainty and control uncertainty (conditioned on intent).
    • A Gaussian Mixture Model (GMM) distribution is used to model both types of uncertainty.
      • "At inference, our model predicts a discrete distribution over the anchors and, for each anchor, regresses offsets from anchor waypoints along with uncertainties, yielding a Gaussian mixture at each time step."

  • One risk when working with multi-modality: directly learning a mixture suffers from issues of "mode collapse".
    • This issue is common in GAN where the generator starts producing limited varieties of samples.
    • The solution implemented here is to estimate the anchors a priori before fixing them to learn the rest of our parameters (as for Faster-RCNN and Yolo for instance).
  • Second motivation: weight the several trajectory predictions.
    • This contrasts with methods that randomly sample from a generative model (e.g. CVAE and GAN), leading to an unweighted set of trajectory samples (not to mention the problem of reproducibility and analysis).
    • Here, a parametric probability distribution is directly predicted: p(trajectory|observation), together with a compact weighted set of explicit trajectories which summarizes this distribution well.
      • This contrasts with methods that outputs a probabilistic occupancy grid.
  • About the "top-down" representation, structured in a 3d array:
    • The first 2 dimensions represent spatial locations in the top-down image
    • "The channels in the depth dimension hold static and time-varying (dynamic) content of a fixed number of previous time steps."

      • Static context includes lane connectivity, lane type, stop lines, speed limit.
      • Dynamic context includes traffic light states over the past 5 time-steps.
      • The previous positions of the different dynamic objects are also encoded in some depth channels.
  • One word about the training dataset.
    • The model is trained via imitation learning by fitting the parameters to maximize the log-likelihood of recorded driving trajectories.
    • "The balanced dataset totals 3.85 million examples, contains 5.75 million agent trajectories and constitutes approximately 200 hours of (real-world) driving."


"SafeCritic: Collision-Aware Trajectory Prediction"

  • [ 2019 ] [📝] [ 🎓 University of Amsterdam ] [ 🚗 BMW ]

  • [ Conditional GAN ]

Click to expand
The Generator predicts trajectories that are scored against two criteria: The Discriminator (as in GAN) for accuracy (i.e. consistent with the observed inputs) and the Critic (the generator acts as an Actor) for safety. The random noise vector variable z in the Generator can be sampled from N(0, 1) to sample novel trajectories. Source.
The Generator predicts trajectories that are scored against two criteria: The Discriminator (as in GAN) for accuracy (i.e. consistent with the observed inputs) and the Critic (the generator acts as an Actor) for safety. The random noise vector variable z in the Generator can be sampled from N(0, 1) to sample novel trajectories. Source.
Several features offered by the predictions of SafeCritic: accuracy, diversity, attention and safety. Source.
Several features offered by the predictions of SafeCritic: accuracy, diversity, attention and safety. Source.

Authors: van der Heiden, T., Nagaraja, N. S., Weiss, C., & Gavves, E.

  • Main motivation:
    • "We argue that one should take into account safety, when designing a model to predict future trajectories. Our focus is to generate trajectories that are not just accurate but also lead to minimum collisions and thus are safe. Safe trajectories are different from trajectories that try to imitate the ground truth, as the latter may lead to implausible paths, e.g, pedestrians going through walls."

    • Hence the trajectory predictions of the Generator are evaluated against multiple criteria:
      • Accuracy: The Discriminator checks if the prediction is coherent / plausible with the observation.
      • Safety: Some Critic predicts the likelihood of a future dynamic and static collision.
    • A third loss term is introduced:
      • "Training the generator is harder than training the discriminator, leading to slow convergence or even failure."

      • An additional auto-encoding loss to the ground truth is introduced.
      • It should encourage the model to avoid trivial solutions and mode collapse, and should increase the diversity of future generated trajectories.
      • The term mode collapse means that instead of suggesting multiple trajectory candidates (multi-modal), the model restricts its prediction to only one instance.
  • About RL:
    • The authors mentioned several terms related to RL, in particular they try to dray a parallel with Inverse RL:
      • "GANs resemble IRL in that the discriminator learns the cost function and the generator represents the policy."

    • I got the feeling of that idea, but I was honestly did not understand where it was implemented here. In particular no MDP formulation is given.
  • About attention mechanism:
    • "We rely on attention mechanism for spatial relations in the scene to propose a compact representation for modelling interaction among all agents [...] We employ an attention mechanism to prioritize certain elements in the latent state representations."

    • The grid-like scene representation is shared by both the Generator and the Critic.
  • About the baselines:
    • I like the "related work" section which shortly introduces the state-of-the-art trajectory prediction models based on deep learning. SafeCritic takes inspiration from some of their ideas, such as:
      • Aggregation of past information about multiple agents in a recurrent model.
      • Use of Conditional GAN to offer the possibility to also generate novel trajectory given observation via sampling (standard GANs have not encoder).
      • Generation of multi-modal future trajectories.
      • Incorporation of semantic visual features (extracted by deep networks) combined with an attention mechanism.
    • SocialGAN, SocialLSTM, Car-Net, SoPhie and DESIRE are used as baselines.
    • R2P2 and SocialAttention are also mentioned.

"A Review of Tracking, Prediction and Decision Making Methods for Autonomous Driving"

  • [ 2019 ] [📝] [ 🎓 University of Iasi ]
Click to expand

One figure:

Classification of motion models based on three increasingly abstract levels - adapted from (Lefèvre, S., Vasquez. D. & Laugier C. - 2014). Source.
Classification of motion models based on three increasingly abstract levels - adapted from (Lefèvre, S., Vasquez. D. & Laugier C. - 2014). Source.

Authors: Leon, F., & Gavrilescu, M.

  • A reference to one white paper: "Safety first for automated driving" 2019 - from Aptiv, Audi, Baidu, BMW, Continental, Daimler, Fiat Chrysler Automobiles, HERE, Infineon, Intel and Volkswagen (alphabetical order). The authors quote some of the good practices about Interpretation and Prediction:
    • Predict only a short time into the future (the further the predicted state is in the future, the less likely it is that the prediction is correct).
    • Rely on physics where possible (a vehicle driving in front of the automated vehicle will not stop in zero time on its own).
    • Consider the compliance of other road users with traffic rules.
  • Miscellaneous notes about prediction:
    • The authors point the need of high-level reasoning (the more abstract the feature, the more reliable it is long term), mentioning both "affinity" and "attention" mechanisms.
    • They also call for jointly addressing vehicle motion modelling and risk estimation (criticality assessment).
    • Gaussian Processed is found to be a flexible tool for modelling motion patterns and is compared to Markov Models for prediction.
      • In particular, GP regressions have the ability to quantify uncertainty (e.g. occlusion).
    • "CNNs can be superior to LSTMs for temporal modelling since trajectories are continuous in nature, do not have complicated "state", and have high spatial and temporal correlations".


"Deep Predictive Autonomous Driving Using Multi-Agent Joint Trajectory Prediction and Traffic Rules"

  • [ 2019 ] [📝] [🎞️] [ 🎓 Seoul National University ]
Click to expand

One figure:

The framework consists of four modules: encoder module, interaction module, prediction module and control module. Source.
The framework consists of four modules: encoder module, interaction module, prediction module and control module. Source.

Authors: Cho, K., Ha, T., Lee, G., & Oh, S.

  • One previous work: "Learning-Based Model Predictive Control under Signal Temporal Logic Specifications" by (Cho & Ho, 2018).
  • One term: "robustness slackness" for STL-formula.
    • The motivation is to solve dilemma situations (inherent to strict compliance when all rules cannot be satisfied) by disobeying certain rules based on their predicted degree of satisfaction.
    • The idea is to filter out non-plausible trajectories in the prediction step to only consider valid prediction candidates during planning.
    • The filter considers some "rules" such as Lane keeping and Collision avoidance of front vehicle or Speed limit (I did not understand why they are equally considered).
    • These rules are represented by Signal Temporal Logic (STL) formulas.
      • Note: STL is an extension of Linear Temporal Logic (with boolean predicates and discrete-time) with real-time and real-valued constraints.
    • A metric can be introduced to measure how well a given signal (here, a trajectory candidate) satisfies a STL formula.
      • This is called "robustness slackness" and acts as a margin to satisfaction of STL-formula.
    • This enables a "control under temporal logic specification" as mentioned by the authors.
  • Architecture
    • Encoder module: The observed trajectories are fed to some LSTM whose internal state is used by the two subsequent modules.
    • Interaction module: To consider interaction, all LSTM states are concatenated (joint state) together with a feature vector of relative distances. In addition, a CVAE is used for multi-modality (several possible trajectories are generated) and capture interactions (I did not fully understand that point), as stated by the authors:
      • "The latent variable z models inherent structure in the interaction of multiple vehicles, and it also helps to describe underlying ambiguity of future behaviours of other vehicles."

    • Prediction module: Based on the LSTM states, the concatenated vector and the latent variable, both future trajectories and margins to the satisfaction of each rule are predicted.
    • Control module: An MPC optimizes the control of the ego car, deciding which rules should be prioritized based on the two predicted objects (trajectories and robustness slackness).

"An Online Evolving Framework for Modeling the Safe Autonomous Vehicle Control System via Online Recognition of Latent Risks"

  • [ 2019 ] [📝] [ 🎓 Ohio State University ] [ 🚗 Ford ]

  • [ MDP, action-state transitions matrix, SUMO, risk assessment ]

Click to expand

One figure:

Both the state space and the transition model are adapted online, offering two features: prediction about the next state and detection of unknown (i.e. risky) situations. Source.
Both the state space and the transition model are adapted online, offering two features: prediction about the next state and detection of unknown (i.e. risky) situations. Source.

Authors: Han, T., Filev, D., & Ozguner, U.

  • Motivation
  • Main ideas: Both the state space and the transition model (here discrete state space so transition matrices) of an MDP are adapted online.
    • I understand it as trying to learn the transition model (experience is generated using SUMO), hence to some extent going toward model-based RL.
    • The motivation is to assist any AV control framework with a so-called "evolving Finite State Machine" (e-FSM).
      • By identifying state-transitions precisely, the future states can be predicted.
      • By determining states uniquely (using online-clustering methods) and recognizing the state consistently (expressed by a probability distribution), initially unexpected dangerous situations can be detected.
      • It reminds some ideas about risk assessment discussed during IV19: the discrepancy between expected outcome and observed outcome is used to quantify risk, i.e. the surprise or misinterpretation of the current situation).
  • Some concerns:
    • "The dimension of transition matrices should be expanded to represent state-transitions between all existing states"
      • What when the scenario gets more complex than the presented "simple car-following" and that the state space (treated as discrete) becomes huge?
    • In addition, "the total number of transition matrices is identical to the total number of actions".
      • Alone for the simple example, the acceleration command was sampled into 17 bins. Continuous action spaces are not an option.

"A Driving Intention Prediction Method Based on Hidden Markov Model for Autonomous Driving"

  • [ 2019 ] [📝] [ 🎓 IEEE ]

  • [ HMM, Baum-Welch algorithm, forward algorithm ]

Click to expand

One figure:

Source.
Source.

Authors: Liu, S., Zheng, K., Member, S., Zhao, L., & Fan, P.

  • One term: "mobility feature matrix"
    • The recorded data (e.g. absolute positions, timestamps ...) are processed to form the mobility feature matrix (e.g. speed, relative position, lateral gap in lane ...).
    • Its size is T × L × N: T time steps, L vehicles, N types of mobility features.
    • In the discrete characterization, this matrix is then turned into a set of observations using K-means clustering.
    • In the continuous case, mobility features are modelled as Gaussian mixture models (GMMs).
  • This work implements HMM concepts presented in my project Educational application of Hidden Markov Model to Autonomous Driving.

"Online Risk-Bounded Motion Planning for Autonomous Vehicles in Dynamic Environments"

  • [ 2019 ] [📝] [ 🎓 MIT ] [ 🚗 Toyota ]

  • [ intention-aware planning, manoeuvre-based motion prediction, POMDP, probabilistic risk assessment, CARLA ]

Click to expand

One figure:

Source.
Source.

Authors: Huang, X., Hong, S., Hofmann, A., & Williams, B.

  • One term: "Probabilistic Flow Tubes" (PFT)
    • A motion representation used in the "Motion Model Generator".
    • Instead of using hand-crafted rules for the transition model, the idea is to learns human behaviours from demonstration.
    • The inferred models are encoded with PFTs and are used to generate probabilistic predictions for both manoeuvre (long-term reasoning) and motion of the other vehicles.
    • The advantage of belief-based probabilistic planning is that it can avoid over-conservative behaviours while offering probabilistic safety guarantees.
  • Another term: "Risk-bounded POMDP Planner"
    • The uncertainty in the intention estimation is then propagated to the decision module.
    • Some notion of risk, defined as the probability of collision, is evaluated and considered when taking actions, leading to the introduction of a "chance-constrained POMDP" (CC-POMDP).
    • The online solver uses a heuristic-search algorithm, Risk-Bounded AO* (RAO*), takes advantage of the risk estimation to prune the over-risky branches that violate the risk constraints and eventually outputs a plan with a guarantee over the probability of success.
  • One quote (this could apply to many other works):

"One possible future work is to test our work in real systems".


"Towards Human-Like Prediction and Decision-Making for Automated Vehicles in Highway Scenarios"

  • [ 2019 ] [📝] [:octocat:] [🎞️] [ 🎓 INRIA ] [ 🚗 Toyota ]

  • [ planning-based motion prediction, manoeuvre-based motion prediction ]

Click to expand

Author: Sierra Gonzalez, D.

  • Prediction techniques are often classified into three types:

    • physics-based
    • manoeuvre-based (and goal-based).
    • interaction-aware
  • As I understood, the main idea here is to combine prediction techniques (and their advantages).

    • The driver-models (i.e. the reward functions previously learnt with IRL) can be used to identify the most likely, risk-aversive, anticipatory manoeuvres. This is called the model-based prediction by the author since it relies on one model.
      • But relying only on driver models to predict the behaviour of surrounding traffic might fail to predict dangerous manoeuvres.
      • As stated, "the model-based method is not a reliable alternative for the short-term estimation of behaviour, since it cannot predict dangerous actions that deviate from what is encoded in the model".
      • One solution is to add a term that represents how the observed movement of the target matches a given maneuver.
      • In other words, to consider the noisy observation of the dynamics of the targets and include these so-called dynamic evidence into the prediction.
  • Usage:

    • The resulting approach is used in the probabilistic filtering framework to update the belief in the POMDP and in its rollout (to bias the construction of the history tree towards likely situations given the state and intention estimations of the surrounding vehicles).
    • It improves the inference of manoeuvres, reducing rate of false positives in the detection of lane change manoeuvres and enables the exploration of situations in which the surrounding vehicles behave dangerously (not possible if relying on safe generative models such as IDM).
  • One quote about this combination:

"This model mimics the reasoning process of human drivers: they can guess what a given vehicle is likely to do given the situation (the model-based prediction), but they closely monitor its dynamics to detect deviations from the expected behaviour".

  • One idea: use this combination for risk assessment.

    • As stated, "if the intended and expected maneuver of a vehicle do not match, the situation is classified as dangerous and an alert is triggered".
    • This is an important concept of risk assessment I could identify at IV19: a situation is dangerous if there is a discrepancy between what is expected (given the context) and what is observed.
  • One term: "Interacting Multiple Model" (IMM), used as baseline in the comparison.

    • The idea is to consider a group of motion models (e.g. lane keeping with CV, lane change with CV) and continuously estimate which of them captures more accurately the dynamics exhibited by the target.
    • The final predictions are produced as a weighted combination of the individual predictions of each filter.
    • IMM belongs to the physics-based predictions approaches and could be extended for manoeuvre inference (called dynamics matching). It is often used to maintain the beliefs and guide the observation sampling in POMDP.
    • But the issue is that IMM completely disregards the interactions between vehicles.

"Decision making in dynamic and interactive environments based on cognitive hierarchy theory: Formulation, solution, and application to autonomous driving"

  • [ 2019 ] [📝] [ 🎓 University of Michigan ]

  • [ level-k game theory, cognitive hierarchy theory, interaction modelling, interaction-aware decision making ]

Click to expand

Authors: Li, S., Li, N., Girard, A., & Kolmanovsky, I.

  • One concept: cognitive hierarchy.

    • Other drivers are assumed to follow some "cognitive behavioural models", parametrized with a so called "cognitive level" σ.
    • The goal is to obtain and maintain belief about σ based on observation in order to optimally respond (using an MPC).
    • Three levels are considered:
      • level-0: driver that treats other vehicles on road as stationary obstacles.
      • level-1: cautious/conservative driver.
      • level-2: aggressive driver.
  • One quote about the "cognitive level" of human drivers:

"Humans are most commonly level-1 and level-2 reasoners".

Related works:

  • Li, S., Li, N., Girard, A. & Kolmanovsky, I. [2019]. "Decision making in dynamic and interactive environments based on cognitive hierarchy theory, Bayesian inference, and predictive control" [pdf]

  • Li, N., Oyler, D., Zhang, M., Yildiz, Y., Kolmanovsky, I., & Girard, A. [2016]. "Game-theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems" [pdf]

    • "If a driver assumes that the other drivers are level-1 and takes an action accordingly, this driver is a level-2 driver".

    • Use RL with hierarchical assignment to learn the policy:
      • First, the π-0 (for level-0) is learnt for the ego-agent.
      • Then π-1 with all the other participants following π-0.
      • Then π-2 ...
    • Action masking: "If a car in the left lane is in a parallel position, the controlled car cannot change lane to the left".
      • "The use of these hard constrains eliminates the clearly undesirable behaviours better than through penalizing them in the reward function, and also increases the learning speed during training"
  • Ren, Y., Elliott, S., Wang, Y., Yang, Y., & Zhang, W. [2019]. "How Shall I Drive ? Interaction Modeling and Motion Planning towards Empathetic and Socially-Graceful Driving" [pdf] [code]

Source.
Source.
Source.
Source.