Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploratory data analysis for reports site data model: Vendor information & organizations/services/datasets #194

Open
lauriemerrell opened this issue Oct 4, 2022 · 3 comments
Assignees

Comments

@lauriemerrell
Copy link
Contributor

Before tackling #181 and #193, we need to investigate the state of the Airtable data model with respect to the fields that we want to incorporate into the reports site to get a sense for data completeness, up-to-dateness, and any risks of fanout (many-to-many relationships).

Specifically, we want to investigate:

  • What % of services have GTFS vendor information specified? (involves traversing the dim_services, dim_service_components, and dim_components tables)
  • What distinct products are listed for GTFS generation? Do those products all seem reasonable? (Perhaps can run them by folks within Cal-ITP)
  • Spot checking – for services with no vendor, is that because they do it in house? Or because data is missing?
  • How many many-to-many relationships are there between services & GTFS datasets, i.e., cases where one service is split across 2 datasets or one dataset contains multiple services? How many of these latter cases exist where the services in question are associated with different organizations?

Submitting this ticket in the reports repo because it is directly associated with planned feature & data development on the reports site, but these questions are probably of broader interest as well.

@edasmalchi edasmalchi self-assigned this Feb 21, 2023
@edasmalchi
Copy link
Member

self-assigned per convo today w/ @tiffanychu90, going to prioritize this since it's been blocking us for a while

@edasmalchi
Copy link
Member

edasmalchi commented Mar 10, 2023

What distinct products are listed for GTFS generation? Do those products all seem reasonable? (Perhaps can run them by folks within Cal-ITP)

It appears that there are multiple component values relevant in some way to either the publishing of GTFS or GTFS-RT. Right now the table is set to manually code them in order to support a simplified presentation of schedule and RT vendors on the reports site. For these components, the listed products appear reasonable. For the site, I propose reporting and organizing on the vendor level rather than the product level, with the exception of in-house and pending product values.

        CASE
            -- manually coded to relevance to GTFS Schedule and RT
            WHEN component_name IN ('GTFS generation', 'Scheduling (Fixed-route)')
                THEN 'schedule_vendors'
            WHEN component_name IN ('Real-time info', 'Arrival predictions',
             'GTFS-rt vehicles/trips', 'GTFS Alerts Publication')
                THEN 'rt_vendors'
        END AS reports_vendor_type,

185 total organizations in March reports (date_start = '2023-02-01')

8 are missing at least one schedule vendor (or pending/in-house)
89 are missing at least one RT vendor (or pending/in-house)

  • of these, 26 are listed as has_rt, i.e. we have an RT feed from them but no realtime vendor info

In other words, we're missing schedule vendor info for about 4% of total orgs, and RT vendor info for 26% of orgs that have RT.

SQL for organizations with neither RT nor Schedule vendors:

SELECT * FROM
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.idx_monthly_reports_site`
LEFT JOIN
`cal-itp-data-infra-staging.eric_mart_gtfs_quality.fct_monthly_reports_site_organization_gtfs_vendors`
USING (organization_name, organization_source_record_id, organization_itp_id, date_start)
WHERE date_start = '2023-02-01'
--GROUP BY rt_vendors IS NULL
AND (ARRAY_LENGTH(rt_vendors) = 0 OR rt_vendors IS NULL)
AND (ARRAY_LENGTH(schedule_vendors) = 0 OR schedule_vendors IS NULL)
LIMIT 1000

Screenshot 2023-03-10 at 13 09 34

There appears to be no product/component data in Airtable for these. Since in-house is a value entered for some other service components, we shouldn't assume that its done in-house. The long term path would be for the Transit Data Quality team to connect with these agencies and determine how to correctly track.

How many many-to-many relationships are there between services & GTFS datasets, i.e., cases where one service is split across 2 datasets or one dataset contains multiple services? How many of these latter cases exist where the services in question are associated with different organizations?

Proposing we put this on hold for now since service definitions may be changing soon. This situation does appear rare, and by keeping vendor information fairly general and aggregated at the organization level I think we can move forward.

Also see warehouse work at cal-itp/data-infra#2374

More thoughts to come on how to operationalize.

@edasmalchi
Copy link
Member

#181 contemplates using vendor information to filter the monthly index on the reports site -- I think if an organization has multiple vendors then filtering the index to any combination of those vendors should show a link to their report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants