Skip to content

Collecting Public Transport Realtime Data to Obtain (Conditional) Delay Distributions

License

Notifications You must be signed in to change notification settings

traines-source/public-transport-statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Public Transport Statistics

Collecting traffic of realtime public transport APIs and GTFS-RT feeds to calculate statistics about delays, cancellations, etc. of public transport. With a special focus on delay distributions (delay histograms) and conditional probabilities, to answer questions like "Given this train is currently delayed by 20 minutes and is thus projected to depart in 50 minutes from now, what departure delay (i.e. what delay distribution) will it finally have?"

Access the results

(Currently Germany (DB) only; Belgium, France, Switzerland, Netherlands TBD)

Data sources

  • GTFS and GTFS-RT feeds of Belgium, France, Germany, Switzerland and the Netherlands (in most cases partial coverage, to be investigated), the archived data since 03/2023 is available at http://mirror.traines.eu (about 70 GB per month).
  • If you know of more at least country-level european GTFS-RT feeds (I know of Norway), let me know.
  • Traffic of realtime public transport APIs (from TSTP among others). For access to the raw data since 10/2022 (about 20 GB per month), please get in touch.
  • Official DB Haltestellendaten and other data sources for -> station mapping.

Core principle

Example for one train leaving one station

Sample # scheduled_time projected_time delay_minutes sample_time ttl_minutes
Sample 1 2023-01-01 17:10 2023-01-01 17:10 NULL 2023-01-01 16:00 70
Sample 2 2023-01-01 17:10 2023-01-01 17:11 1 2023-01-01 17:00 11
Sample 3 (latest_sample) 2023-01-01 17:10 2023-01-01 17:27 17 2023-01-01 17:25 2
Sample 4 2023-01-01 17:10 2023-01-01 17:10 NULL 2023-01-01 17:40 -30

We get multiple "samples", i.e. records of status from different points in time, for one arrival/departure of a train/bus/etc. at a station, e.g. ICE 100 scheduled to leave Berlin Hbf at 2023-01-01 17:10 (scheduled_time). Possibly at first (Sample 1), we do not get any live data, because the train has not yet departed from its origin station (delay_minutes is NULL). When we get closer to departure, i.e. the difference between sample_time and projected_time that is ttl_minutes diminishes, at some point we should get live data (Sample 2 and 3, delay_minutes not NULL). The delay will change over time. After the train has departed, for many providers, live data will be deleted for that departure (Sample 4, delay_minutes is NULL again).

We qualify as "latest sample" the last sample (latest sample_time) that still has live data (delay_minutes set or trip cancelled, Sample 3 here). This is taken to be the actual final delay of this train at this station. The accuracy of that assumption depends on how close to the actual departure the sample was taken (ttl_minutes). More on that below (latest_sample_ttl_bucket).

Core tables/views

sample

The table where all samples are recorded. Many fields are based on FPTF (Friendly Public Transport Format), which is also the intermediate format for ingestion.

  • id: autoincrement id
  • scheduled_time: FPTF plannedWhen/plannedArrival/plannedDeparture, time when arrival/departure was originally scheduled
  • scheduled_duration_minutes: How long this connection takes to the next stop according to schedule (not always available).
  • projected_duration_minutes: How long this connection takes to the next stop according to current prediction (Not always available).
  • delay_minutes: Null when no realtime data available or when cancelled. Realtime data usually gets nulled by the source system a few minutes after actual arrival/departure. Negative when too early.
  • cancelled: Either this stop or the entire trip was cancelled.
  • sample_time: When this sample was taken, i.e. when the data contained in this row was current.
  • ttl_minutes: Difference between sample_time and projected_time. Positive when arrival/departure was in the future at sample time. Negative when it was in the past past.
  • trip_id: FPTF tripId
  • line_name: FPTF line.name
  • line_fahrtnr: FPTF line.fahrtNr
  • product_type_id: FK product_type, FPTF line.product
  • product_name: FPTF line.productName
  • station_id: EVA number
  • operator_id: FK operator
  • is_departure: Indicates arrival/departure.
  • remarks_hash: FK remarks, FPTF remarks.
  • stop_number: Can be used to indicate how many stops came before this stop on this trip.
  • (destination_provenance_id: Destination if is_departure, provenance if NOT is_departure.)
  • (scheduled_platform: FPTF plannedPlatform)
  • (projected_platform: FPTF platform)
  • load_factor_id: FK load_factor, FPTF loadFactor
  • response_id: FK response_log

sample_histogram

First aggregation step, this is basically an n-dimensional histogram (n being the number of dimension columns) that you can sum over to get coarser stats (but be careful how, see queries below). In order to fight the curse of dimensionality, different incremental aggregations are created (sample_histogram_by_day, sample_histogram_by_hour, sample_histogram_by_station). By default, the full sample table and sample_histogram are not kept anymore to save storage!

  • scheduled_time by hour
  • year, month, day, day_of_week, hour: extracted from scheduled_time in GMT
  • line_name
  • product_type_id
  • station_id
  • operator_id
  • is_departure
  • prior_ttl_bucket: time from sample_time until projected_time according to prior_delay, NULL without prior delay (i.e. actual final delay distribution).
  • prior_delay_bucket: prior delay, NULL without prior delay or for stops not yet having live data.
  • latest_sample_ttl_bucket: Indicator for accuracy of latest_sample_delay. Hopefully never NULL (INNER JOIN)
  • latest_sample_delay_bucket: when prior_delay_bucket set: relative delay, else absolute delay. NULL only when stop cancelled. '(,)' when stop cancelled but substitution trip running.
  • sample_count: number of samples falling in that bucket.
  • total_sample_count: sum of sample_count grouped by all fields except operator_id and latest_sample_delay_bucket. Do not sum over this!

Buckets are using the postgres range type with familiar maths notation like [-10,10).

Revisiting the question from above: Given a train is currently delayed by prior_delay_bucket minutes and is thus projected to depart in prior_ttl_bucket minutes from now, what departure delay (i.e. what delay distribution latest_sample_delay_bucket) will it finally have? "Finally" being defined by latest_sample_ttl_bucket, since we do not actually have final delay times. As such, we take the "latest sample" for each trip-station-scheduled_time-is_departure combination we have and record the duration from this sample_time to the final projected_time. It is advisable to restrict latest_sample_ttl_bucket to something like (-10,10] to avoid skewing the delay distribution due to multiple reasons:

  • if the delay is sampled too early before actual departure/arrival, delays might be underestimated (I think, because delays tend to increase more than decrease during the course of a trip)
  • if the delay is sampled much before actual departure/arrival, when for most trips, live data is not yet available, cancelled trips are overrepresented (since they are usually known earlier and only trips with live data or cancelled flag set are kept as "latest sample")
  • if the delay is sampled too far after the actual departure/arrival, cancelled trips are overrepresented, since for some providers (e.g. Deutsche Bahn), delays get nulled after a couple of minutes and as such these samples don't qualify as "latest sample", but the cancelled ones do indefinitely.

For more insight on latest_sample_ttl_bucket, check out the "Realtime data deletion" panel in the "Ops" dashboard.

Core queries

Filtering latest_samples and first aggregation to sample_histogram

See procedure refresh_histograms_and_cleanup_samples(). This procedure will be triggered automatically when ingesting. The procedure refresh_histograms_aggregations() can be triggered manually to update the aggregations sample_histogram_without_time and sample_histogram_by_month that are used by the dashboard for faster querying.

Since we do not actually know the final delay of a trip at a station, we record the latest samples for each trip-station-scheduled_time-is_departure combination and take that as ground truth. We can later restrict our statistics to latest samples that were taken not too far from final departure/arrival (cf. latest_sample_ttl_bucket).

For cancelled trips, we detect based on the recorded remarks whether a substitute trip is running and as such, from a traveler's perspective, whether the trip is not actually cancelled.

Dashboard: Relative histogram with prior_delay_bucket and prior_ttl_bucket by product_type

Also see the dashboard and inspect panels.

SELECT CASE WHEN l.latest_sample_delay_bucket IS NULL THEN 'cancelled' ELSE l.latest_sample_delay_bucket::text END as label, (sample_count/SUM(sample_count) OVER ()) AS percent_of_departures
FROM (
SELECT latest_sample_delay_bucket, SUM(sample_count) as sample_count
FROM ${schema}.sample_histogram
NATURAL JOIN ${schema}.product_type p
WHERE prior_ttl_bucket = '$prior_ttl_bucket'::int4range AND (CASE WHEN '$prior_delay_bucket' = 'Unknown' THEN prior_delay_bucket IS NULL ELSE prior_delay_bucket::text = '$prior_delay_bucket' END) AND is_departure AND latest_sample_ttl_bucket <@ '$latest_sample_ttl_bucket'::int4range
GROUP BY latest_sample_delay_bucket
) AS s
FULL OUTER JOIN (SELECT DISTINCT latest_sample_delay_bucket FROM ${schema}.sample_histogram WHERE latest_sample_delay_bucket IS NOT NULL) AS l ON l.latest_sample_delay_bucket = s.latest_sample_delay_bucket
ORDER BY l.latest_sample_delay_bucket

Dashboard: Absolute histogram by operator

SELECT CASE WHEN l.latest_sample_delay_bucket IS NULL THEN 'cancelled' ELSE l.latest_sample_delay_bucket::text END as label, (sample_count/SUM(sample_count) OVER ()) AS percent_of_arrivals
FROM (
SELECT latest_sample_delay_bucket, SUM(sample_count) as sample_count
FROM ${schema}.sample_histogram
NATURAL JOIN ${schema}.operator o
WHERE prior_ttl_bucket IS NULL AND NOT is_departure AND latest_sample_ttl_bucket <@ '$latest_sample_ttl_bucket'::int4range AND o.id = '$operator'
GROUP BY latest_sample_delay_bucket
) AS s
FULL OUTER JOIN (SELECT DISTINCT latest_sample_delay_bucket FROM ${schema}.sample_histogram WHERE latest_sample_delay_bucket IS NOT NULL) AS l ON l.latest_sample_delay_bucket = s.latest_sample_delay_bucket
WHERE NOT l.latest_sample_delay_bucket <@ '(,-5)'::int4range OR l.latest_sample_delay_bucket IS NULL
ORDER BY l.latest_sample_delay_bucket

Dashboard: Sanity check with official statistics

SELECT CONCAT(s.year::text, '-', s.month::text, ' ', od.category),
od.delay_percentage_5min,
ROUND(SUM(CASE WHEN latest_sample_delay_bucket <@ '(,5]'::int4range THEN sample_count ELSE 0 END)/SUM(sample_count)*100, 1) AS estimated_percentage_5min,
od.delay_percentage_15min,
ROUND(SUM(CASE WHEN latest_sample_delay_bucket <@ '(,15]'::int4range THEN sample_count ELSE 0 END)/SUM(sample_count)*100, 1) AS estimated_percentage_15min,
SUM(sample_count) AS sample_count
FROM (
	SELECT year, month, operator_id, latest_sample_delay_bucket, SUM(sample_count) as sample_count
	FROM db.sample_histogram
	NATURAL JOIN db.product_type
	WHERE prior_ttl_bucket IS NULL AND NOT is_departure AND latest_sample_ttl_bucket <@ '$latest_sample_ttl_bucket'::int4range
	AND latest_sample_delay_bucket IS NOT NULL AND latest_sample_delay_bucket::text != '(,)'
	GROUP BY year, month, operator_id, latest_sample_delay_bucket
) AS s
NATURAL JOIN db.operator o
JOIN db.official_delay_stats_operators oo ON oo.operator = o.id
JOIN db.official_delay_stats od ON od.category = oo.category AND od.year = s.year AND od.month = s.month
GROUP BY s.year, s.month, od.category, od.delay_percentage_5min, od.delay_percentage_15min

If you find more efficient or simpler variants of these queries (that are still (or more?) correct), let me know!

Related work

About

Collecting Public Transport Realtime Data to Obtain (Conditional) Delay Distributions

Resources

License

Stars

Watchers

Forks