Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding silent failures - stuck agents disappear between inputs and outputs #3008

Open
gac55 opened this issue Dec 21, 2023 · 4 comments
Open

Comments

@gac55
Copy link
Contributor

gac55 commented Dec 21, 2023

Sometimes agents have issues in their life (like all of us) and they don't make it to the finish line.

Often this may be a network issue where an agent gets stuck because of a congestion explosion or because a facility is not accessible to them.

In this case, these agents may not appear in the outputs. Specifically, we have agents who are in the plans.xml but not in the output_plans.xml, or output_experienced_plans.xml.

When we run analysis on the output_plans it can become evident that some agents are missing, but this is often a "silent" failure and unless you perform basic checks you may have no reason to suspect there is even an issue. If you are lucky, stuck agent events may trigger some warning in your pipeline and cause you to investigate further.

We are implementing some basic smoke tests that check agents exist in inputs and outputs to flag when this may be happening.

Is there a better way to guard against this? Should MATSim not output a warning or even an error in output_plans (or other output) to warn users this is occurring?

It appears that we are dropping agents from results when something has gone wrong, and that therefore may not be picked up by many users in the current setup.

@mrieser
Copy link
Contributor

mrieser commented Dec 21, 2023

output_plans.xml is the input-plans.xml used for the last iteration, so there should be all agents included.

output_experienced_plans.xml are reconstructed plans based on events. If an agent gets stuck and does not produce anymore events (or if an agent stays at home the whole time and does not generate any events at all), they might be missing in the experienced_plans (I would have to check what happens with agent getting stuck in the middle of their plan; I would assume that their first legs and activities are still contained in the experienced_plans, but am not sure; they might be excluded for specific reasons, see last point below).

If you are lucky, stuck agent events may trigger some warning in your pipeline and cause you to investigate further.

Events are the real output of MATSim simulations. We try that all parts of MATSim create PersonStuckEvent or VehicleAbortsEvent at latest at the end of an iteration when an agent has not yet arrived at his/her last activity. If you observe agents that have not yet reached their final activity and for which no stuck-event was generated, please let us know, so we can figure out which part of MATSim does not create stuck-events so we can fix this.

Is there a better way to guard against this? Should MATSim not output a warning or even an error in output_plans (or other output) to warn users this is occurring?

MATSim would have to distinguish between different agents. For example "bus drivers" or "train drivers" created by MATSim when simulating a public transit schedule are also agents. If they get stuck during the morning rush hour, it would be of interest. If they get aborted at time 27:00 because the simulation ends and they were one of the last departures according to the schedule and not yet delayed, it can be assumed to be okay. A passenger in such a late (but not delayed) vehicle on the other hand should receive a PersonStuckEvent, as that agent was not able to finish its plan within the simulated day.

So, often the answer is "it depends" whether a stuck agent is noteworthy or not, making a decision on what MATSim should do in that case non-obvious.

It appears that we are dropping agents from results when something has gone wrong

I know of some institutes/groups that prefer it that way, although it can be debated. If you have agents getting stuck in the middle of the day, you might not want to include them in some analyses, e.g. daily travel time stats, as they would falsely influence the average. If you only look at trips, one could argue that the complete trips of such agents could be analyzed. So again, "it depends" on the specific analysis if stuck-agents should be fully counted, partially counted or not counted at all in some analysis.

@mrieser
Copy link
Contributor

mrieser commented Dec 21, 2023

An idea I just had: could we add an attribute to the persons in experienced_plans.xml when they were not able to finish their plan, something like stuck=true? Then it would at least be "documented" and users could decide for themselves depending on their needs whether they want to include such agents or not in their analysis.

@kainagel
Copy link
Contributor

A note from the sidelines: It is absolutely correct that stuck agents need to be excluded from the analysis. In fact, all agents that are ever stuck in (the final iteration of) any base case or policy case need to be excluded. This means in particular that, if in a policy case agents get stuck that were not stuck in other runs, the base case run to compare with needs to be re-analyzed with those agents removed.

Reason (as already said above): Assume, for example, that in the base case typically agents on short trips get stuck, while in the policy case those on long trips. Without removing the stuck agents from the analysis on both sides, the policy case will show shorter trips than the base case, without any true changes by those agents that finish their plans.

I try to tell VSP people that they should debug the simulations so that no agents get stuck. In particular, the "stuck because of the simulation end time" should not happen in runs-to-be-analyzed ... the mobsim should just run long enough. Unfortunately, having no stuck agents at all is difficult to achieve ... they might, e.g., miss the last bus.

Having it in the experienced plans would be a good idea. @gac55 , do you want to have a shot at it and submit a pull request? Looks (to me) like a change in EventsToLegs. One would presumably have to add handleEvent that listens to abort events. And then? Presumably agents get stuck on legs, so one would either have to add a dummy leg and a dummy activity. Or mark the last fully executed activity as "aborted". Or (my preference right now) mark the plan as "aborted". In my preference with a "reason". (Which implies that the PersonStuckEvent should actually have a "reason".)

@gac55
Copy link
Contributor Author

gac55 commented Dec 22, 2023

Thanks @mrieser for the considered and insightful response.

Sounds like we have a decent proposed approach. I'll do a bit more digging and testing, I like the idea of flagging it in experienced plans @kainagel @mrieser

I suspect post Christmas ;) 🎅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants