Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r5r benchmarks for new gtfs_traveltimes fn #73

Open
mpadge opened this issue Jan 27, 2021 · 11 comments
Open

r5r benchmarks for new gtfs_traveltimes fn #73

mpadge opened this issue Jan 27, 2021 · 11 comments
Labels
enhancement New feature or request must do Highest priority

Comments

@mpadge
Copy link
Member

mpadge commented Jan 27, 2021

@rafapereirabr gtfsrouter now has a new turbo-charged gtfs_traveltimes function, largely motivated by the support of the Mobility Institute Berlin, and the outstanding input of @AlexandraKapp. This issue is intended to document first comparisons against equivalent times generated from your amazing r5r package.

Disclaimer: r5r is a truly general package able to do what it says, "Realistic Routing on Real-world and Reimagined networks." gtfsrouter in comparison is just a routing package for the mode = "TRANSIT" component only, and so has comparably much more restricted functionality than r5r.

Additional notes:

  1. The r5r results in the following reprex were pre-generated using the same GTFS feed for the "Verkehrsverbund Berlin Brandenburg" (VBB), Germany. Both queries were run for a Tuesday (8th Dec 2020).
  2. gtfsrouter has a function to expand transfer tables provided with feeds, to enable more realistic, and more extended, pedestrian transfers between stops, analogous to the additional pedestrian routing offered by r5. This is implemented below in the lines starting with gtfs_transfer_table(), which has a default upper time limit of 120 seconds.
library (gtfsrouter)
packageVersion("gtfsrouter")
#>[1] '0.0.4.160'
gtfs <- extract_gtfs("vbb.zip")
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

transfers <- gtfs_transfer_table (gtfs, network_times = FALSE)
library (dplyr)
transfers <- gtfs$transfers %>%
    select(.data$from_stop_id, .data$to_stop_id, .data$transfer_type, .data$min_transfer_time) %>%
    rbind(transfers) %>%
    group_by(.data$from_stop_id, .data$to_stop_id) %>%
    summarise(across(everything(), first)) %>% # take min_transfer_time from gtfs$transfers if present
    data.table::data.table()
gtfs$transfers <- transfers
gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "Berlin Hauptbahnhof"
start_time <- 8 * 3600
system.time (
    iso <- gtfs_traveltimes (gtfs, from, start_time)
    )
#>    user  system elapsed 
#>   1.001   0.003   0.942
iso$duration <- as.integer (iso$duration)

library(r5r)
path <- "<path-to-OSM-and-GTFS-data-for-Berlin>"
r5r_core <- setup_r5(data_path = path)
p <- data.frame (name = gtfs$stops$stop_name,
    id = gtfs$stops$stop_id,
    lon = gtfs$stops$stop_lon,
    lat = gtfs$stops$stop_lat)
o <- p [grep ("Berlin Hauptbannhof", p$name), ]
mode <- c ("WALK", "TRANSIT")
departure_datetime <- as.POSIXct ("2020-12-08 08:00:00",
    format = "%Y-%m-%d %H:%M:%S")
system.time (
    r5 <- travel_time_matrix (r5r_core = r5r_core,
                              origins = o,
                              destinations = p,
                              mode = mode,
                              departure_datetime = departure_datetime,
                              max_walk_dist = 500,
                              max_trip_duration = 180)
)
#>    user  system elapsed 
#>   23.847   0.161   8.005

library (dplyr)
r5 <- group_by (r5, toId) %>%
    summarise (travel_time = min (travel_time))
r5 <- r5 [which (r5$toId %in% iso$stop_id), ]
index <- match (r5$toId, iso$stop_id)
iso$r5 [index] <- r5$travel_time * 60 # r5 returns travel time in minutes
iso <- iso [which (!is.na (iso$r5)), ]
# iso has one outlier:
iso <- iso [iso$duration < max (iso$duration), ]

duration <- iso$duration
r5 <- iso$r5
par (mfrow = c (1, 2))
cols <- c ("gray70", "gray20")
plot (duration, r5, type = "p", pch = 1, col = cols [1],
      xlab = "gtfsrouter times", ylab = "r5r times")
lines (range (duration), range (duration), col = cols [2], lwd = 2)
mod <- lm (r5 ~ duration + 0)
new <- data.frame (duration = range (duration))
fit <- predict (mod, newdata = new)
lines (new$duration, fit, col = cols [2], lwd = 2, lty = 2)
legend ("topleft", col = cols [2], lwd = 2, lty = 1:2,
        legend = c ("1:1", "linear fit"), bty = "n")

tdiff <- r5 - duration
hh <- hist (tdiff / 60, breaks = 50, plot = FALSE)
plot (hh$mids, hh$density, "l", col = cols [1], lwd = 2,
      xlab = "r5 times - gtfsrouter times (minutes)",
      ylab = "density")
lines (c (0, 0), c (0, 1), col = cols [2], lwd = 2)

message ("r5 estimates take ",
         round (100 * (mod$coefficients [1] - 1), 1), "% longer")
#> r5 estimates take 13.4% longer
message ("(mean, median) difference in times = (",
         hms::hms (round (mean (tdiff))), ", ",
         hms::hms (median (tdiff)), ")")
#> (mean, median) difference in times = (00:04:46, 00:03:12)

Created on 2021-01-27 by the reprex package (v0.3.0)


Conclusions

  • gfsrouter calculates full travel times to > 13,000 stations in < 1 second, whereas r5r takes around 8 seconds to reach around 11,000 of those stations.
  • gtfsrouter generates travel times that are around 13% quicker than r5r.

Why might the travel times be so much faster? My suspicion at present would be because r5, and so r5r by direct extension, uses an implementation of the raptor algorithm . As discussed in #61, the raptor algorithm only finds routes between successive "optimal" end-points, and so excludes the possibility of the optimal route to some end-point actually requiring intermediate travel along sub-optimal routes. In contrast, the new algorithm implemented here traces all possible routes between all pairs of stops, whether optimal or not, and only determines optimal routes once the algorithm has finished tracing the entire timetable. My hypothesis would be that the differences observed above primarily reflect this difference, and particularly the inability of raptor to use sub-optimal intermediate routes.

@rafapereirabr
Copy link

@mpadge, thanks for the update on congrats on the progress with gtfs_router ! The performance is really impressive and I can’t wait to see and integration between dodgr and gtfs_router.

I think you're right, most of the difference is probably due different algorithm heuristics under the hood of r5r and gtfs_router. It is also likely that r5r searches for walking route alternatives between stops when this information is not present in gtfs$transfers, which is probably takes more time to compute than using the transfers table.

I've tried running your example on a sample GTFS feed that comes within r5r and I got this error below. I believe this error occurs because this feed does no have a transfers.txt file. It it possible to run gtfs_router on a GTFS feed without a transfers.txt file ?

library(r5r)
library(dplyr)
library(gtfsrouter)
packageVersion("gtfsrouter")
#> [1] ‘0.0.4.160’


### gtfs_router code -----------------------------

# read gtfs
gtfs <- extract_gtfs(filename = system.file("extdata/poa/poa.zip", package = "r5r") )

# build transfers table
transfers <- gtfs_transfer_table (gtfs, network_times = FALSE)

# transfers <- gtfs$transfers %>%
#   select(.data$from_stop_id, .data$to_stop_id, .data$transfer_type, .data$min_transfer_time) %>%
#   rbind(transfers) %>%
#   group_by(.data$from_stop_id, .data$to_stop_id) %>%
#   summarise(across(everything(), first)) %>% # take min_transfer_time from gtfs$transfers if present
#   data.table::data.table()
# 
# gtfs$transfers <- transfers

gtfs <- gtfs_timetable(gtfs, day = "tuesday")
from <- "AUXILIADORA SILVA JARDIM"
start_time <- 8 * 3600
system.time (
  iso <- gtfs_traveltimes (gtfs, from, start_time)
)

#> Error in rcpp_traveltimes(gtfs_cp$timetable, gtfs_cp$transfers, nrow(gtfs_cp$stop_ids),  :
#> Index out of bounds: [index='min_transfer_time'].
#> Timing stopped at: 0.02 0 0.02



### r5r code ----------------------------- 

path <- data_path <- system.file("extdata/poa", package = "r5r")
r5r_core <- setup_r5(data_path = path)
p <- data.frame (name = gtfs$stops$stop_name,
                 id = gtfs$stops$stop_id,
                 lon = gtfs$stops$stop_lon,
                 lat = gtfs$stops$stop_lat)
o <- subset(p, name =="AUXILIADORA SILVA JARDIM")
mode <- c ("WALK", "TRANSIT")
departure_datetime <- as.POSIXct ("2019-07-14 08:00:00",
                                  format = "%Y-%m-%d %H:%M:%S")
system.time (
  r5 <- travel_time_matrix (r5r_core = r5r_core,
                            origins = o,
                            destinations = p,
                            mode = mode,
                            departure_datetime = departure_datetime,
                            max_walk_dist = 500,
                            max_trip_duration = 180)
)
#>    user  system elapsed 
#>   0.71    0.00    0.16 

@mpadge
Copy link
Member Author

mpadge commented Mar 4, 2021

@rafapereirabr sorry for delayed response: No, it is not possible to run gtfs_traveltimes on a feed without a transfers.txt table. What i will do is to make sure that is caught in pre-processing, and an informative message issued to ensure a table is added via the gtfs_transfer_table() function. Next commit will do that, and close this issue. Thanks!

And by the way, the gtfs_traveltimes() function has in the meantime had considerably more updates and improvements, and should now generate even better results than those indicated above, and generally even faster too!

@mpadge mpadge closed this as completed in deef8b1 Mar 4, 2021
@rafapereirabr
Copy link

Hi @mpadge . Thanks for the update! I guess the lack of a transfers.txt files will not be so much of a problem when the power of gtfs-router and dodgr are combined. Really looking forward to this!

@luukvdmeer
Copy link

I think another factor that influences the difference in results and in computation times is that R5 is really focused on multi-modal travel where you first travel to a public transport stop before you start your public transport trip. I might be off at some points, and don't know exactly to what extent some parts of the computations are pre-computed when building the network, but the routing process of R5 should go something like this:

When you provide the coordinates of a trip origin, it will first find the nearest point on the street network and use a straight-line distance between the actual origin and the snapped point in combination with a walking speed to estimate the time it takes you to reach the street. It will always do this, even when you set mode = "TRANSIT" and don't specify any other mode. Then, it will use the given access mode (in the example also walking) to travel from the snapped location of the origin to the snapped location of nearest public transport stop. From there, it uses again a straight-line distance and walking speed to get to the actual location of the nearest public transport stop. Only then, it will start the transit routing similar to gtfsrouter.

This procedure repeats at the end of the trip. It will use a straight-line distance and walking speed to estimate the travel time between the actual public transport stop location and its snapped location on the street network, then use the given egress mode to travel to the snapped location of the trip destination, and finally use a straight-line distance and walking speed to estimate the travel time between the snapped location of the trip destination and the actual location of the trip destination.

In the benchmark, we have a case where the origin location of the trip is equal to the actual location of a public transport stop. But R5 does not recognize that and uses its normal procedure. Hence, it estimates the travel time from the trip origin to the street network, then attemps to route to the snapped location of the public transport stop (which is the same in this case, i.e. travel time of this leg equals 0), and finally estimates the travel time from the street network to the public transport stop (which in this case is the travel time from the street network back again to the origin location). This repeats at the end of the trip.

This behaviour is explained briefly in the R5 docs, see here

You place the origin directly on a transit stop with fast, frequent service, but travel times using the service show an unexpected delay. Because total travel time includes walking from the origin to the nearest street or pedestrian path in the baseline street layer, then from the street layer (back) to the transit stop, there may be an unexpectedly long access time if the stop is far from a nearby street.

@mpadge
Copy link
Member Author

mpadge commented May 17, 2021

Thanks for the input @luukvdmeer - that is in fact also what gtfsrouter does via the above code which generates extended transfer tables via the gtfs_transfer_table function. The start and end points are always assumed to be actual stop platforms, with no associated walking times, but the internal transfers may involve walks up to default of 200m, well beyond those specified in most internal GTFS transfer tables. Most importantly, those results are for the entire system, with a median of two transfers, and so these extended pedestrian transfers are an important component of final estimated travel times. It is actually the ability to walk between stops not specified as official transfers in the GTFS feed that makes an additionally important contribution to the result observed here, along with the advantage of the algorithm implemented here in comparison to the shortcomings described above of the raptor algorithm used in r5.

The one caveat with the above results is that they were generated using the default gtfs_transfer_table (..., network_times = FALSE), and so reflect physically unrealistic straight-line transfers. Using network_times = TRUE uses dodgr to generate realistic walking time estimates (including incline, waits at traffic lights, and a host of highly realistic effects on pedestrian times), and does increase the resultant travel times a small amount, but hardly effects the above results. (Median times are almost identical, while mean times are increased by around 40 seconds.)

I presume that once r5 has entered public transport mode it primarily relies on the transfer table as specified in the feed itself, and so is unable to effectively jump "out of" public transport mode to connect stations beyond the transfer table with intermediate walk stages. That's what gtfsrouter now does, and so in that sense is actually more multi-modal than r5.

@luukvdmeer
Copy link

luukvdmeer commented May 17, 2021

I tried to understand their methodology a bit better and found that R5 does actually mention they use "Walking via the street network between transit stops for transfers". I did not look at their source code to check their exact implementation, but from their documentation I get that at each stop they consider transfers to all other stops within a given walking time (default 20 minutes), constrained by a maximum walking distance of 2km. See here.

But apart from that, I was mainly trying to emphasize that even for the exact same journey between two stops, R5 might already return higher travel time estimates and have longer computation time simply because in contradiction to gtfsrouter it doesn't recognize that the start and end points of the journey are actual transport stop platforms already. Of course I am not sure for how much of the difference in times that accounts, but it might play a role in there.

@mpadge
Copy link
Member Author

mpadge commented Jul 20, 2021

Re-opened in response to the r5r issue linked to above, and in order to clearly document why r5 and gtfsrouter ought not be expected to give identical results. @luukvdmeer thanks again for your insights here - see the r5r issue for what we currently think are the underlying reasons for these observed differences.

@mpadge mpadge reopened this Jul 20, 2021
@abyrd
Copy link

abyrd commented Jul 28, 2021

Hi everyone. I just commented on the linked r5r issue at ipeaGIT/r5r#183. I understand that a lot of what you are discussing here about R5 is not well documented, so I will try to provide some clarification. I'll cover some high-level points first, and then we can go into detail if you have more specific questions.

There are two main issues here: 1) speed of execution and 2) inflation of travel times relative to gtfsrouter.

1) Speed of execution

R5 is completely multimodal (chaining walking and biking or car with transit, routing from and to arbitrary geographic locations). From an origin point, it will go to the closest street and follow that to every reachable transit station, then ride transit to every other reachable transit station, then follow the street network to all specified destination points. In everyday use, we perform searches with up to 2-3 million destination points arranged on a regular grid, which could be very distant from any transit station.

To each one of these destinations, it tries to find a large number of different travel times (generally between 200 and 1500), including waiting time, over a large number of departure times (and realizations of frequency-based schedules), in order to construct a statistical distribution of the travel times that a rider may experience under different conditions.

Constructing this large set of optimal arrival times at every destination, for every different departure time, is rather time consuming. This process (which we call "propagation", i.e. propagation of travel times out from transit stops, through the street network and "empty space" to final destinations) is usually the majority of run time.

It appears that you're benchmarking against only the public transit part of the search. Even in very large networks like that of the whole Netherlands, I would expect R5 to perform such a transit-only search in a few hundred milliseconds. That involves finding hundreds of separate paths to each of over 40k transit stops. The rest of the time is spent on propagation, managing and creating statistical summaries of the very large data tables that result. And also on the initial search through the streets, from the origin to all reachable transit stops, which can take longer than the transit search itself.

The "general case" routing methods in R5 do not have any special considerations for routing from and to transit stops (instead of arbitrary geographic points). It would be possible for a system that uses R5 (such as R5R) to call only the transit portion and bypass the rest, but I imagine that's not how it works.

Although we have spent quite a bit of time optimizing this (as it has a direct impact on our operating costs and ability to provide results with rapid turnaround) speed is not our highest priority. We are more focused on the stability and maintainability of the code, and the ability of future maintainers to comprehend it.

2) Inflation of travel times

As we discussed in our recent call with @mpadge, a good comparison here is going to require a clear and rigorous definition of what is meant by "optimal", ensuring both systems are seeking the same results. After reading your issue on the r5r repository describing the kind of solutions gtfsrouter seeks, I believe that r5's output for a given origin-destination pair is a superset of that produced by gtfsrouter: the result gtfsrouter is looking for can be selected from among the N results produced by r5.

At the lowest level R5 is optimizing arrival time given a single departure time; at a higher level it is optimizing end-to-end travel time for departure at any time within a window. Or more accurately, it it it is finding a distribution of travel times, including all waiting time, for uniformly distributed departure times within a given window, and you can choose the shortest one or any other other one you like.

However, it looks like you're benchmarking R5 via R5R, and I do not know how R5R is using R5, or what results it is selecting from R5's output. The output will depend heavily on exactly which routing method is called in R5 and how its output is interpreted. We'll need to examine that more closely. My sense is that R5R is using R5 in a specific, constrained way, and we just need to better document the results of those choices. Note that I did not work on R5R and have only briefly looked at its source code so this will require a bit of time and coordination with its developers.

@abyrd
Copy link

abyrd commented Jul 28, 2021

@luukvdmeer wrote:

I tried to understand their methodology a bit better and found that R5 does actually mention they use "Walking via the street network between transit stops for transfers".

Correct, in normal usage we pre-compute transfers from each stop to all other stops within a certain radius. (Some simplification is applied there, eliminating transfers that are not expected to ever contribute to an optimal path.) These transfers are stored in a table though, not recreated at each call, so this process does not have any speed impact at runtime.

Unlike some other data formats exported from scheduling software, in GTFS the transfers table usually only states which transfers are preferred, constrained, or forbidden. All "normal" transfers are expected to be inferred from the street network (or failing that, straight line distance). We typically use networks built from OpenStreetMap, which in many places include detailed pedestrian walkways and micro-mapped train stations.

But apart from that, I was mainly trying to emphasize that even for the exact same journey between two stops, R5 might already return higher travel time estimates and have longer computation time simply because in contradiction to gtfsrouter it doesn't recognize that the start and end points of the journey are actual transport stop platforms already.

This is correct. It would be possible to directly call the transit router which thinks in terms of stop platforms, but it would be a different code path. In the case where you are focused only on transit stops, the "general case" code path does incur a lot of overhead. But in our standard use case (urban planning analysis), the origin and destination points are homes, workplaces, public facilities, or sample points representing small areas. The special case where you are only interested in the transit stops themselves is not common. Indeed we adopt a pretty strong position that transportation infrastructure should be evaluated in terms of its impact on people's access to opportunities, which are mostly situated outside the infrastructure.

@abyrd
Copy link

abyrd commented Jul 28, 2021

@luukvdmeer wrote:

...it estimates the travel time from the trip origin to the street network, then attemps to route to the snapped location of the public transport stop (which is the same in this case, i.e. travel time of this leg equals 0), and finally estimates the travel time from the street network to the public transport stop (which in this case is the travel time from the street network back again to the origin location). This repeats at the end of the trip.

This is correct, passing via the street network will increase the travel time a bit. The documentation you cited was just to confirm the existence of this discrepancy and explain it, not to justify it - clearly it is not ideal, and we eventually want to introduce a special case for connecting the origin point directly to a transit station. But as mentioned in another comment, most of our use cases do not involve routing from stations, so we want to carefully consider impacts on code complexity or interference with other functionality.

Besides the increase in reported travel times, also note the significant impact on execution time. This initial search does not only go to the nearest station, but through the street network to all other reachable transit stations. Unlike the transfers, this street search is not precomputed and not cached - this takes a significant amount of time on every call. It wouldn't be surprising if this initial street search takes significantly longer than the transit search itself.

@mpadge
Copy link
Member Author

mpadge commented Jul 30, 2021

Thank you very much @abyrd for these detailed and helpful considerations. I'll respond in more detail in a couple of weeks (currently on leave), and feed a lot of that back in to documentation, including detailed comparisons of these kinds of differences between r5 and gtfsrouter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request must do Highest priority
Projects
None yet
Development

No branches or pull requests

4 participants