Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate Santa Clara County social distancing protocol business database #23

Open
7 of 12 tasks
1ec5 opened this issue Sep 13, 2020 · 36 comments
Open
7 of 12 tasks

Comments

@1ec5
Copy link
Member

1ec5 commented Sep 13, 2020

We should incorporate Santa Clara County Social Distancing Protocol data into a community asset map and ultimately into the larger OpenStreetMap database.

Background

Since the COVID-19 pandemic began, most point of interest data in OSM in the South Bay has been at risk of going stale due to temporary or permanent closures or changes in opening hours or services. In #21, we attempted to put together a spreadsheet of open businesses based on business association listings, but this listing is skewed toward certain kinds of businesses, and the copyright situation is unclear (or at least not clear enough to rely on in OSM).

The Santa Clara County Public Health Department has created a listing of businesses and institutions that have submitted social distancing protocols for approval. At the time of writing, the listing includes 29,324 establishments. These are the businesses and institutions most likely to be open during the COVID-19 pandemic.

Unfortunately, the county hasn’t published a structured dataset corresponding to this listing. Moreover, the listing is geared towards checking for compliance and isn’t particularly usable by consumers as a business directory: it allows searching by business name and city or filtering by category, but there’s no way to limit search results by proximity or get directions.

Rationale

The 2020 National Day of Civic Hacking included a call for community asset mapping. We brainstormed several ideas before settling on the social distancing protocol listing as something that would make a government dataset significantly more accessible to the general public while avoiding overlap with projects such as Bay Area Community Resources.

The short-term goal is to process the listings into a mappable format and displaying the data directly on an asset map. People need to know which nearby businesses they can safely patronize and which brick-and-mortar community services are currently available.

The long-term goal is to add these businesses and institutions to OSM along with some COVID-19-specific tagging. This would help to jumpstart OSM’s local efforts to update POIs post-lockdown. It would also enable projects such as Bay Area Community Resources to use OSM as one source for POI data or at least have more confidence in OSM as its basemap. Both projects would make this data more accessible and usable to the general public than the current listing.

Implementation details

We expect this listing to grow significantly over time, so it’s important to take an automated, repeatable approach.

The social distancing protocol site provides only unstructured, inconsistently formatted addresses, so we’ll need to use a geocoder to convert the addresses to coordinates to make them mappable. An open-source geocoder would be preferable to a proprietary one, because we expect this data to eventually go into OSM. The import in #4 adds addresses but only in San José, whereas the county data is countywide. So we’ll need to use the county master address file. We only need to set up the geocoder on a local machine for one-off batch geocoding tasks, but eventually we may want to set up something on a server for future projects.

The site also links each business to an electronically completed PDF for details about its social distancing protocol. It’s feasible but inconvenient to scrape these PDFs, so we’re going to ignore them for now. Unfortunately, it means we won’t be able to automatically clarify the businesses in the “Other” category.

When it comes time to add the businesses to OSM, we could set up a MapRoulette challenge that asks the mapper to identify the shop inside the building using aerial and street-level imagery. We won’t want to blindly add every result en masse, because we’re concerned that some of the listings may be home-based businesses – identifying signage will be key.

Tasks

To make the asset map:

  • Scrape the Santa Clara County Social Distancing Protocol site @stgibson
    • The first page of each category for now
    • Account for paginated results
  • Set up a geocoder such as Pelias on a local machine @impiaaa
    • Load the county master address file into Pelias
  • Convert the scraped addresses to coordinates and stick it in a spreadsheet
  • Set up urlwatch to get notified when the listings are updated
  • Display the businesses and institutions on a map using a tool like Mapbox Sheet Mapper or more manually via GeoJSON

To get the data into OSM:

  • Write and submit an import proposal @1ec5
  • Set up a MapRoulette challenge with one task per business
  • Complete the MapRoulette challenge
  • Follow up on any tasks that are too difficult due to missing imagery

Additional notes

This brainstorming document turned up several other datasets worth scraping and getting into Bay Area Community Resources or OSM.

@1ec5
Copy link
Member Author

1ec5 commented Sep 13, 2020

Pelias customized to import Santa Clara County addresses courtesy of @impiaaa: https://github.com/codeforsanjose/pelias-project-scc/

@1ec5
Copy link
Member Author

1ec5 commented Sep 18, 2020

Scraper and scraped data courtesy of @stgibson: https://github.com/stgibson/social_distance_web_scraping/

@1ec5
Copy link
Member Author

1ec5 commented Sep 18, 2020

Scraped data geocoded by Pelias: socialdistance.geojson.zip

county_dots

downtown_dots

Thumbnails county_dots_thumb downtown_dots_thumb

@1ec5
Copy link
Member Author

1ec5 commented Sep 18, 2020

At tonight’s hack night, @impiaaa, Kevin, and I discussed next steps for this project. Having scraped and geocoded the data once, we need to massage it and figure out the logistics of entering it into OSM. Some discussion points from tonight:

Contact information and geocoding

  • Pelias had trouble geocoding addresses containing units and unit designators, because the county address database lacks unit designators, though it does have a dedicated column for units. After a geocoding request fails, try replacing any unit designator in the address with a #. We could also implement heuristics around the structured address that comes out of Pelias’ parser.
  • What if Pelias returns multiple results? We could try skipping listings based on the number of results or confidence, or if the parse tree doesn’t match the input address, but we should gather statistics on how many listings would get dropped to understand the impact of those heuristics on overall coverage.
  • Pelias geocoded “No physical address” to a therapist in San Francisco named “No” or something like that. We should be able to exclude these businesses entirely. Fortunately, it looks like every form that checked “No Business Facility” on the PDF form ended up listing “No physical address” on the site, even if a mailing address was provided.
  • Phone numbers are likely just phone numbers of owners or social distancing managers, not the customer number, so we need to omit the phone numbers for privacy reasons. These forms are public and it would’ve been nice to give this county one of the highest concentrations of phone numbers in OSM, but we don’t want to have customers calling an employee on their personal cell phone just because they filled out this form.

Scraping

  • We should rescrape the site to get the URLs to the PDF forms. Even if we don’t bother to scrape the PDFs themselves, we can include a link to the PDF from within the MapRoulette per-task instructions to make the tasks more efficient. The URLs also seem to include a numeric identifier; not sure if that identifier is unique or stable (in the event that a business resubmits a replacement form).
  • For the “Other, please specify” category, we’ll need mappers to refer to the PDF for the more detailed description of the business in order to know which preset to use.
  • We could either have mappers skip the “Other” category entirely and ask them to categorize businesses in a separate pass, or we could scrape the PDFs ahead of time and crowdsource tagging suggestions in a spreadsheet.
  • The PDFs lend themselves to scraping, because they’re filled AcroForms whose data appears in a distinct section of the file. For example, I ran this form through textutil -convert txt and found text like this:
    347 0 obj
    <</Type/StructElem/S/Span/K 56/Alt(Checked)/P 117 0 R/Pg 4 0 R>>
    endobj
    348 0 obj
    <</Type/StructElem/S/Span/K 57/Alt(Not Applicable)/P 117 0 R/Pg 4 0 R>>
    endobj
    349 0 obj
    <</Type/StructElem/S/Span/K 58/Alt(Park Holiday Apartments, Western Management, LLC)/P 117 0 R/Pg 4 0 R>>
    endobj
    
  • When we rescrape the site, we should also include the “Date of Protocol Submission”, because it could help us detect changes to the PDF that aren’t apparent from diffing the HTML site.
  • MapRoulette prefers a stable, unique identifier for each feature, to deduplicate the features across updates and keep tasks from getting uncompleted. If the ID in the PDF URL isn’t stable and unique, we could try hashing the name and address.

Tagging

  • Not every submission includes the d/b/a (“Fictitious Business Name”), so the mapper will have to clean up legal names by hand, removing things like “LLC” and “Inc.”
  • For those that do include a d/b/a, the legal name goes in official_name, operator, or owner.
  • At a minimum, we should use this opportunity to tag the fact that these businesses are open during the COVID-19 pandemic – the very reason they filed these protocols with the county. opening_hours:covid19=open appears to be the most appropriate tag for that. access:covid19=yes is for businesses without physical reception, which maps to the “No physical address” designation that we aren’t mapping.
  • None of the checkboxes on the form neatly matches the delivery:covid19, drive_through:covid19, or takeaway:covid19 key.
  • The form has multiple checkboxes about hand sanitizer, but only the “Hand sanitizer and/or soap and water are available at or near the site entrance…” checkbox applies to the safety:hand_sanitizer:covid19 key. If the form doesn’t check “Handwashing and other sanitary facilities are operational and stocked at all times,” we’re unsure how to tag that. The other checkboxes apply to staff, which doesn’t have an established tagging scheme.
  • The form has multiple checkboxes about face masks, but we probably won’t bother requiring mappers to tag safety:mask:covid19, because that rule applies throughout the county.
  • The form distinguishes between “Maximum number of personnel” and “Maximum number of customers/members of public”, whereas the established capacity:covid19 key doesn’t distinguish. Based on popular subkeys of capacity:*, we could have mappers tag capacity:staff:covid19, capacity:customers:covid19, and capacity:covid19 as the sum of the other two.

Mapping

  • We don’t want mappers to map home offices and the like. That can be difficult to tell from the protocol form, but we’re hoping that the “No physical address” designation is a good determiner.
  • Ideally, mappers would use street-level imagery to avoid mapping businesses at homes that lack business signage. However, Bing Streetside imagery is too old, while Mapillary and OpenStreetCam don’t have comprehensive coverage throughout the county. So this is an unresolved problem for now.
  • 19,000 tiny changesets would be very annoying. MapRoulette has an option to batch multiple tasks together in one changeset, but the per-task instructions box only shows the instructions for the first selected task. A workaround would be for the per-task instructions to not embed any feature properties but instead refer the mapper to the interactive map or GeoJSON table.

@1ec5 1ec5 changed the title Santa Clara County social distancing protocol Incorporate Santa Clara County social distancing protocol business database Sep 19, 2020
@1ec5
Copy link
Member Author

1ec5 commented Oct 3, 2020

Ideally, mappers would use street-level imagery to avoid mapping businesses at homes that lack business signage. However, Bing Streetside imagery is too old, while Mapillary and OpenStreetCam don’t have comprehensive coverage throughout the county. So this is an unresolved problem for now.

At last night’s hack night, @impiaaa and I focused on possible solutions to this problem, as well as the related problem of pinpointing a business within a strip mall or professional center:

  • If a POI geocodes to within a landuse=commercial or landuse=retail, it’s more likely to be mappable than one that geocodes to within a landuse=residential. Besides OSM landuse areas, we could make use of VTA’s landuse planning dataset, which is available as a MapServer in state plane.
  • The POI’s distance to another POI could prove useful. We took a Geofabrik extract of Northern California and used osmium to filter it down to POIs in Santa Clara County, making it small enough to load in QGIS to calculate POI distances.
  • As a last resort, if a mapper can verify that the business is in a complex (based on monument signage) but can’t pinpoint the unit within the complex, they could temporarily follow the practice of commercial maps by scattering the POIs within the parking lot. It isn’t ideal to have POIs in a parking lot; however, the alternative of picking a random location in the strip mall building could mislead users as to the precision of the POI’s location.

At a glance, Mapillary coverage in the South Bay doesn’t look as bad as we had presumed, considering that most of the businesses would be in business districts or along arterial streets rather than in residential areas. But we do need more thorough coverage of office parks. Some areas like Milpitas, Berryessa, and South San José also have very little coverage.

Some next steps:

@1ec5 1ec5 pinned this issue Oct 12, 2020
@1ec5 1ec5 mentioned this issue Oct 14, 2020
7 tasks
@1ec5
Copy link
Member Author

1ec5 commented Oct 16, 2020

This fork of the scraper has continuing work including some tweaks to work better with the geocoder.

@impiaaa
Copy link
Collaborator

impiaaa commented Oct 16, 2020

  • Suggest OSM tags for business categories
  • amenity,shop,office,craft,leisure,tourism,healthcare,industrial,public_transit
  • Make MapRoulette instructions
  • skip phone number, unless 800 or extension
  • use name2 over name1
  • verification strategies:
    • geocoded address matches what's in the form
    • street-level
    • business's website
    • if it's in a business district (OSM landuse, zoning, aerial)
    • @1ec5 is looking into Mapillary involvement
  • Get the relevant code in the codeforsanjose GitHub organization
  • Make a list of businesses at the same location, to prioritize imagery collection

@1ec5
Copy link
Member Author

1ec5 commented Oct 16, 2020

  • Some addresses lie outside of the county (San Francisco, Sacramento). We’ll keep them, but they have inaccurate distances to the nearest businesses, which will affect prioritization in MapRoulette.
  • Some Construction entries are construction companies. The capacity listed on the form may be for the construction site, not the office.
  • The “Home Cooking Services” category seems to consist entirely of caterers, restaurants, and bakeries that have had to change their business model temporarily during the pandemic. Not sure of a tag specifically for services that cook for the homebound, but for now craft=caterer seems apt.
  • Likely tags and presets by category
  • The Other category will require a lot of tagging discussion. We can proactively scrape the PDFs of these listings and crowdsource the tagging choices in a spreadsheet.
  • The “Alternative Non-hotel Guest Accommodations” category includes Airbnbs, which shouldn’t be mapped unless they happen to also be established bed-and-breakfasts and the like.

@1ec5
Copy link
Member Author

1ec5 commented Oct 30, 2020

Tier 3 revision

CDPH moved Santa Clara County to Tier 3 (Orange, Moderate) on October 13. The county public health department issued a revised order that required every business to complete a revised social distancing protocol form within 14 days. The revised form looks very similar to the previous revision from September.

The SDP business database last updated on October 12. The COVID19Prepared.org site took down its link to the business database around that time, leaving this note:

Customers and the general public are encouraged to view the list of businesses that have submitted their Revised Social Distancing Protocol to help ensure our community is prepared to operate safely. A listing of businesses that have completed their Revised Social Distancing Protocols is coming soon.

Assuming the SDP site does start updating again soon, it doesn’t make sense to go forward with the current business listing. If for some reason the site doesn’t start updating, we may need to get in touch with TSS to ask for access to the raw dataset.

In the meantime, this delay gives us time to take care of other remaining tasks:

  • Write an import proposal that incorporates a finalized tag mapping.
  • Post the proposal to the wiki and announce it on the imports mailing list.
  • Come up with a solution for efficiently tagging the thousands of “Other” POIs based on the freeform business type description (assuming that freeform option remains on the new form).

Wrangling the “Other” category

We expect the current database to overlap considerably with the revised database, but there will probably be businesses that spell their names, addresses, or “Other” description slightly differently from one form to another.

Unless the SDP site starts listing the business type description in plain text, we’ll need to scrape the linked PDFs for that information. Over 5,700 entries may be too many to tag by hand, especially if we need to keep tagging more by hand as the database updates. One possible solution might involve training a Bayesian classifier on the text, labeling them with presets.

@1ec5
Copy link
Member Author

1ec5 commented Nov 9, 2020

Assuming the SDP site does start updating again soon, it doesn’t make sense to go forward with the current business listing. If for some reason the site doesn’t start updating, we may need to get in touch with TSS to ask for access to the raw dataset.

The SDP site is updating again, with the latest entries from November 5. The site currently lists 21,536 entries across the same categories as before. It no longer includes submissions of the previous revision of the form.

@1ec5
Copy link
Member Author

1ec5 commented Nov 9, 2020

@1ec5
Copy link
Member Author

1ec5 commented Nov 13, 2020

@impiaaa, Lindsay, and I met tonight to discuss the state of the project:

Tier 2?

There are rumors that Santa Clara County may soon move back to Tier 2 (red), as other nearby counties have, just a few weeks after moving to Tier 3 (orange) and right after the SDP website got back up and running. Tier 2 allows only essential businesses to stay open, so we’re unsure what that means for the SDP database: will they keep collecting SDPs from nonessential businesses in anticipation of an eventual transition back to Tier 3, remove submissions from nonessential businesses, or stop updating or advertising the site? I’m hopeful we won’t completely repeat the database reset from last month, because the county hasn’t issued a new public health order ahead of any tier change like last time, and the revised SDP form seems to be tier-agnostic. (It no longer asks the submitter for any hard numbers around capacity.)

Given this uncertainty, we could replace opening_hours:covid19=open with a less direct tag like opening_hours:covid19:conditional=open @ (cdph:tier=3) that would be more resilient to tier changes over the next several months. But that would be complicated if mappers have to copy the suggested tags from MapRoulette instructions. We should avoid making mappers add raw tags in iD if possible, because that can be error-prone, time-consuming, and unfriendly to new mappers.

More likely, we’d avoid making any representations about a business’s opening status. After all, OSM probably already has POIs that haven’t opened since the initial stay-at-home order began. It means we wouldn’t be able to facilitate a COVID-19-specific application, but we’d still accomplish the larger goal of jumpstarting OSM’s POI coverage in the area.

Tooling

The main downside to MapRoulette is that it doesn’t prepopulate the point feature and its tags in iD, since this import requires much more manual intervention than a collaborative mapping challenge. We can make sure iD opens up to maximum zoom level 19, which is good enough to easily distinguish standalone businesses, but it would be pretty ambiguous in a strip mall or downtown area.

RapiD could be a good alternative to MapRoulette for our use case, as long as there’s a way for the user to not only accept a feature but also change its feature type and change its tags before saving. Ideally we could get permission to add our own challenge to this tasking manager instance, then periodically upload GeoJSON data from the SDP database. Otherwise, we’d have to reach out to the RapiD team about loading our data. There is a new Esri ArcGIS integration, but it would be rather indirect for us since our dataset isn’t in ArcGIS to begin with.

Proposal process

We aren’t sure about the county’s status come Tuesday and are still deciding between MapRoulette and RapiD, so we need to wait until at least early next week before posting a request for comments about this import proposal on the imports mailing list.

The proposal needs a few tweaks:

  • @impiaaa will come up with a more accurate count of importable POIs that excludes non-physical businesses. Duplicates due to replacement protocols are OK, because they’re a signal for us to revisit a POI that might have already been imported.
  • Mappers are primarily responsible for verifying the geocoded coordinate for a given business and cleaning up name and address tags. The geocoded coordinates can be quite far off in some cases, even in the wrong city – the mapper needs to skip these tasks so we can follow up later. Refining the location, for example to find the unit within a strip mall, is somewhat less important and can be time-consuming.

The request for comments needs to emphasize that this is a labor-intensive organized editing project originating from an external dataset, not a conventional automated import, but we’re going to adhere to some of the import guidelines anyways as a courtesy. We won’t ask participants to use dedicated import accounts, because that overhead would discourage participation while not really making the mapped features easier to identify and roll back.

Time and people

The other day, I did some back-of-the-napkin math to estimate how long this import would take:

There are currently 21,536 items on the SDP site, apparently rising daily. Of these entries, 18,008 are in taggable categories (that is, not “Other”), and an unknown number lack a physical address. I’m going to assume all 18k have physical addresses, which can account for growth over the next few weeks. If we manage to attract 10 participants mapping for 10 hours a week (2 hours every weekday) and get the average time to map a business down to 1 minute, we can finish importing what's on the site so far in 3 weeks.

That’s the optimistic scenario.

That “Other” category is 16% of the database and we’ll need to come up w/ a wide variety of tags for the businesses in there. I mean, I’m tempted to write a bot that spams the tagging list every morning w/ a business-of-the-day post. https://saesdp.sccgov.org/sdpdocs/2848313-SocialDistancingProtocolForm.pdf is “ADMINISTRATIVE OFFICE FOR WHOLESALE DISTRIBUTOR OF MOTORCYCLE PARTS”. 😵

If we get 10 people to dedicate themselves to nothing but tagging decisions, we could take care of those 3,528 “Other” businesses in 6 hours.

To get the average time per task down to a minute, we can encourage mappers to only map the businesses as point features and not areas. As much as possible, we’re trying to avoid making mappers trawl through street-level imagery, but it might occasionally be necessary to choose the right unit in a strip mall or avoid mapping a home office. Focusing each challenge on a single category and providing crisp instructions will go a long way too.

To get the necessary level of participation, we’ll recruit mappers among Code for San José volunteers who haven’t been attending the OSM map nights. We’ll also recruit among the broader OSM community. As far as I can tell, this import will be just the third POI import in the U.S., after the nationwide GNIS import and a POI import in Puerto Rico. I’m hopeful that the import’s novelty will attract non-local mappers who wouldn’t be interested in a run-of-the-mill building import.

I had originally calculated the required time thinking that we’d try to complete the import before the county leaves Tier 3 and the SDP database gets reset again. But the possibility of going back to Tier 2 so soon changes the calculus: if we don’t map anything time-sensitive like opening_hours:covid19 and capacity, then it doesn’t matter what tier we’re in.

@1ec5
Copy link
Member Author

1ec5 commented Nov 17, 2020

The county moved to Tier 1 (purple) today. This poster explains the impact on social distancing protocols:

Social Distancing Protocol requirements: All businesses must complete and submit a Revised Social Distancing Protocol for each of their facilities on the County’s website at COVID19Prepared.org. Social Distancing Protocols submitted prior to October 11, 2020 are no longer valid. The Revised Social Distancing Protocols must be filled out using an updated template for the Social Distancing Protocol at COVID19Prepared.org.

SDPs prior to October 11 have already been removed from the SDP site. This wording makes it sound unlikely that the SDP site would be taken offline, but it means today is probably the high water mark for the site in terms of new submissions.

@1ec5
Copy link
Member Author

1ec5 commented Nov 17, 2020

@impiaaa and I found a reliable way to grab the “Other, please specify” business type description from each PDF’s headers:

$ curl -sI https://saesdp.sccgov.org/sdpdocs/2841699-SocialDistancingProtocolForm.pdf | grep 'x-ms-meta-typeofbusinessother' | sed 's/^.*: //' | atob
Nail supply

This could save us the trouble and time of downloading the whole PDF for the “Other, please specify” category. However, we were also looking to have mappers consult the “Facility/Worksite visited by public” checkbox in the PDF to avoid mapping businesses that aren’t open to the public. It is possible to extract this information from the PDF automatically, but to avoid excessive requests and processing time, perhaps we could limit it to certain categories we’re particularly concerned about (like professional services, but not restaurants).

@1ec5
Copy link
Member Author

1ec5 commented Nov 18, 2020

challenge_geojson.zip as of November 16
Business type descriptions as of November 16

@1ec5
Copy link
Member Author

1ec5 commented Nov 18, 2020

Some outstanding tasks, in no particular order:

The more I spot-check the SDPs we’ve downloaded, the less confidence I have in the “Facility/Worksite visited by public” checkbox. Even if it’s accurate, there are plenty of cases where “No” is an appropriate response for a non-retail site that nonetheless should be mapped. At most, it would be just one signal alongside the reference zoning polygons, but that makes parsing the downloaded PDFs a lower priority.

@1ec5
Copy link
Member Author

1ec5 commented Nov 20, 2020

I sent a request for comments to the talk-us-sfbay, imports-us, and imports mailing lists. (It’s probably stuck in the imports list’s moderation queue.) I also mentioned the request for comments in the #imports channel of OSMUS Slack. We can continue to refine the proposal on the wiki in the coming days based on feedback that we receive. I’m hoping we can move forward in about a week’s time, in time to do some armchair mapping over the Thanksgiving weekend. Thanks to @impiaaa and Lindsay for workshopping the request for comments this evening.

@1ec5
Copy link
Member Author

1ec5 commented Nov 27, 2020

The MapRoulette project is now live with an initial batch of 49 challenges. Challenges with 500 or more tasks are hidden for now until we get a chance to see how smoothly we can get through the smaller challenges.

mr_task

Thumbnails

mr_task_thumb

@1ec5
Copy link
Member Author

1ec5 commented Dec 4, 2020

Wednesday night, @frhino invited me to present the import at Code for San Francisco’s general hack night meeting. CfSF has been spearheading the Bay Area Brigades’ COVID-19 pandemic dashboard project. This import can complement the dashboard as another area for cross-bay collaboration.

Josh graciously offered to pair on the MapRoulette workflow before sharing it with the rest of the brigade. Unluckily, we ran into the Recreation challenge, which turns out to be mostly composed of nondescript offices of recreation organizations. I’ve changed that challenge’s difficulty level to Expert to steer new mappers away from it.

@1ec5
Copy link
Member Author

1ec5 commented Dec 4, 2020

Last night, @impiaaa, Kevin, Lindsay, and I met to take stock of the import a week into it:

Promotion

With help from some friends and acquaintances, we’ve been spreading the word about the import in various places, including but not limited to:

As time goes on, we’ll have to keep being creative and possible revisit some of these communication channels to keep up the momentum.

Progress

For the first week of the import, we had enabled only the 49 smaller challenges in case any adverse feedback came through the mailing lists. Measuring progress is a bit tricky because MapRoulette normally excludes both completed and undiscoverable challenges, so it was showing the project 5% completed. Including both completed and undiscoverable challenges, we were at a little over 200 of 17,441 tasks, or 1%.

Even several days after weeklyOSM mentioned the import proposal, no feedback came in, so we’re more or less in the clear as far as the import guidelines are concerned. After the meeting, we enabled the remaining challenges except for the Construction challenge. That brings our progress back down to 1%, but it’s more accurate that way, and hopefully people will find the new categories like Restaurant and Retail to be more interesting to map.

Hiccups

The Construction challenge remains undiscoverable, because most of the submissions in that challenge appear to be minor work sites (like reconfiguring interior walls at an office building), not the sort of thing we’d map as construction in OSM.

Lindsay got unlucky working on the Grocery Stores and Pharmacy challenges due to poor geocoding or inadequate street-level imagery resolution. We ended up changing the difficulty level of the Grocery Stores challenge to Expert due to the prevalence of these issues. The Pharmacy challenge was already well on its way to completion, so Lindsay finished the job, other than a couple extra-tough cases.

@impiaaa and I differ on what to do about businesses in strip malls or office buildings, where it isn’t immediately feasible to determine which corner of the building the business occupies. We could either mark such businesses as Too Hard for now and wait to survey them in person, or we could place a point randomly within the building, perhaps with a fixme tag to indicate an approximate location. We’ll have more concrete cases to consider as people dive into the newly discoverable challenges, but it’s possible that our approach could depend on the situation: Too Hard for a strip mall with per-store entrances but a random point in the building for an office building with a central entrance.

Time management

MapRoulette currently reports an average time per task of 6 minutes, 14 seconds. That’s far, far above the back-of-the-napkin assumptions in #23 (comment). However, this metric includes situations where a mapper has gotten carried away doing legitimate mapping around the POI, as well as when a mapper forgets to unlock a task after getting distracted by something else. The average has been trending down, so it also probably reflects some initial feeling-around as we got used to the workflow. We’ll keep an eye on the metric, but the most important thing at this point is to bring more contributors into the project.

@1ec5
Copy link
Member Author

1ec5 commented Dec 8, 2020

On Friday, we figured out why many of the addresses got geocoded way out in Sacramento County or San Benito County (example: the Pelias instance was getting confused by Santa Clara (city) and Santa Clara County sharing the same name. It’s similarly very difficult to search for addresses along El Camino Real in Santa Clara (city) in Nominatim. @impiaaa fixed the issue in Pelias by renaming the county from Santa Clara to Santa Clara County in the Who’s On First file and loading OpenAddresses.

On Saturday, @impiaaa rescraped the site and reuploaded all the tasks. We’re up to 23,004 tasks total.

@1ec5
Copy link
Member Author

1ec5 commented Dec 15, 2020

We don’t want mappers to map home offices and the like. That can be difficult to tell from the protocol form, but we’re hoping that the “No physical address” designation is a good determiner.

The more I spot-check the SDPs we’ve downloaded, the less confidence I have in the “Facility/Worksite visited by public” checkbox. Even if it’s accurate, there are plenty of cases where “No” is an appropriate response for a non-retail site that nonetheless should be mapped. At most, it would be just one signal alongside the reference zoning polygons, but that makes parsing the downloaded PDFs a lower priority.

The “visited by public” checkbox sometimes helps, but it’s pretty unreliable because business owners are also unclear on its meaning. We’ve only mapped about 2% of the SDPs so far, but we’ve already encountered plenty of cases that have forced us to consider the privacy of private residences:

  • The “non-hotel accommodations” category almost entirely consisted of Airbnbs, which technically are businesses located at the houses being rented out. So far, I’ve avoided mapping them, whereas a traditional bed-and-breakfast would clearly be mappable. Aside from any privacy considerations, an Airbnb listing can come and go too fluidly to be reliably mapped in OSM.
  • The “private transportation” category included a large number of taxi drivers’ and ride-sharing drivers’ houses, since most are classified as independent contractors in California. I’ve avoided mapping them, because all we know is that the house has a home office, which isn’t mappable per se.
  • The “other non-restaurant food facilities” category includes some pastry chefs apparently operating out of their home kitchens. I’ve mapped at least one whose website displays their address and even embeds Google Maps so that customers can pick up cakes from their house. Between the submitted form and the Google Maps embed, it’s clear that the owner wants the general public to know where they’re located. (Don’t worry, we're relying on public domain sources for the address and geocoding.)
  • The “religious institutions” category includes house churches and Chabad houses. I’ve omitted the ones whose websites say “Contact us for our address”, but I’ve mapped the ones whose websites list their addresses for all to see.
  • About half of the “childcare” category consists of home-based childcare services, which are an alternative to commercial daycare centers. This is a tough one, because the home-based childcare services usually rely on word of mouth or offline advertising. In my personal experience, if a home-based childcare service has no signage, it may only be because one doesn’t become a customer simply by walking up to the house with a child in tow, but that’s not to say the business doesn’t want to be listed or doesn’t welcome new customers who call ahead. If OSM systematically excludes home-based childcare, then our local coverage of childcare would be heavily biased towards the affluent neighborhoods that are served by commercial daycare centers.

I think our decisions so far are roughly in line with the OSM community consensus as expressed by this summary. Protecting privacy is important to us, as is on-the-ground verifiability to some extent. When in doubt, we’ve deferred the task for later review. Depending on the circumstances, we may want to contact some of these businesses to determine their expectations around being listed.

@1ec5
Copy link
Member Author

1ec5 commented Feb 5, 2021

As of December 21, we reached 4% across all challenges, including 43% of high-priority tasks, 7% of medium-priority tasks, and 2% of low-priority tasks:

Screenshot-2020-12-21 MapRoulette

There have been cases where both the SDP and sign outside the building had the wrong address.

On December 24, I added a section to the detailed instructions document explaining how to configure iD to show the Santa Clara County parcel layer as a background layer to more easily associate addresses in SDPs with buildings in OSM:

https://webgis.sccgov.org/gis/rest/services/property/SCCProperty2/MapServer/export?bbox={bbox}&bboxSR={proj}&size={width},{height}&format=png&transparent=false&f=image

We finished half the Religious Institutions challenge by December 27 and finished the nursing home challenge on January 2 (thanks Will!). Camille and @sutter-dave joined us on January 7 to help with the POI import and introduce us to Apogee as a possible tool for future imports.

As of January 12, we finished two-thirds of the high-priority tasks, enough for the time series chart to show some movement:

Screenshot-2021-1-12 MapRoulette

We finished half of the laundromat/dry cleaning challenge by January 15. Unfortunately, around this time we discovered that a MapRoulette user unfamiliar with OSM editing had begun completing tasks completely incorrectly; their edits had to be reverted and 12 tasks reset in the banks challenge.

On January 22, @impiaaa reran the scraper, pulling in lots of new tasks that set our completion rate back to less than 4%. On the bright side, the update brought in improvements to geocoding, due in part to the new addresses we’ve been adding as part of the POI import. Additionally, the priorities have changed so that outlying, typically poorly geocoded tasks no longer stubbornly show up any time you try to get a random task.

We refinished the pharmacies challenge on January 25 and got gas stations back up to halfway on January 31. As of February 3, we’re about 6% complete, having fully recovered from the latest update from the SDP website.

@1ec5
Copy link
Member Author

1ec5 commented Feb 5, 2021

This import is one of the more extensive projects on the MapRoulette platform. The site has been serving us well, but certain things like gathering statistics do take a bit longer, understandable considering the large number of challenges and the sheer size of some of those challenges.

Unfortunately, MapRoulette has been experiencing performance problems and the team is considering making some changes that will adversely impact the import. maproulette/maproulette3#1536 would limit the number of challenges per project, and maproulette/maproulette3#1535 would limit the number of tasks per challenge. Taken together, these changes would force us to split the import project into several projects, possibly arbitrarily, making it more difficult for us to gauge our progress, attract and onboard new mappers, ensure equitable coverage throughout the county, and manage synchronization with the SDP database.

If these changes go into effect as planned, we may need to consider an alternative platform for the import. We don’t have great options. to-fix is unmaintained, Sophox Editor is offline, the OSMUS Tasking Manager is ill-suited to microtasking, and RapiD only integrates with ArcGIS services (which would make rescrapes impractical).

If we stick with MapRoulette, adhering to the new caps would mean splitting apart the larger challenges like Retail and Other into dozens of challenges. How would we split the challenges? If we split them by ZIP code or city, certain areas will inevitably enjoy more attention than others. But anything more arbitrary would prevent us from consolidating tasks into bulk changesets.

@mvexel
Copy link

mvexel commented Feb 5, 2021

The caps are still being discussed (see linked tickets above) and having community input like yours is very valuable to us. We do need to strike a balance between performance and flexibility, and are trying to determine what that right balance is.

I would hate for y'all to move away from MapRoulette because of this, if you find the platform otherwise useful. I'll have a chat with @1ec5 to learn more about the way you use MapRoulette.

@1ec5
Copy link
Member Author

1ec5 commented Feb 8, 2021

Thanks so much for reaching out, @mvexel! MapRoulette has been key to this import project – #23 (comment) shows that there’s really no alternative that matches MapRoulette in ease of use when the data source can update dynamically. From the looks of it, any new limit to the number of tasks per challenge or the number of challenges per project would comfortably accommodate this import’s project, so we should be in good shape.

@1ec5
Copy link
Member Author

1ec5 commented Mar 25, 2021

Lots more happened since the last time I updated this issue:

Timeline

The laundry/dry cleaner challenge returned to 50% on February 7.

By February 15, we reached 7% overall:

2021-02-15

On February 24, we retagged all the Fry’s locations in the county, including all the locations that had filed SDPs, as shop=vacant disused:shop=electronics after the chain closed.

On February 25, @impiaaa overtook me on the leaderboard to claim first place:

leaderboard

On March 1, we completed the maintenance services challenge.

On March 3, the county moved back to Tier 2 (red).

On March 4, we reran the scraper incrementally. We remained at 7%:

2021-03-04

The childcare challenge reached 50% complete on March 17. As seen here on March 15, when we had reached 40% complete and 25% fixed, childcare and kindergarten facilities are much more evenly distributed throughout the county compared to before:

childcare before childcare 2021-03-15

As of today, we’ve completed 10% of the entire import:

2021-03-24

Also, @impiaaa and I submitted a joint talk proposal for State of the Map 2021 about this import. 🤞

Address bloopers

Some examples we’ve seen of SDP addresses that threw off the geocoder:

Address in SDP Actual address Distance
2650A Walsh Ave, Santa Clara, CA 95051 2350 Walsh Avenue Unit A, Santa Clara, CA 95051 584.9 ft
6477 Almaden Rd., San Jose CA 95120 6477 Almaden Expressway, San Jose, CA, 95120 0.789 mi
470 Jackson Ave., San Jose, CA 95112 470 Jackson Street, San Jose, CA 95112 2.265 mi
2302 MONTEREY RD, San Jose, CA 95111 5302 Monterey Road, San Jose, CA 95111 4.047 mi
6477 Almaden Rd, San Jose, CA 95120 6477 Almaden Expressway, San Jose, CA 95120 7.006 mi
2133 Morrill, San Jose, CA 95132 2133 Morrill Avenue, San Jose, CA 95132 9.194 mi
4849 San Felipe Rd. Unit 140, San Jose, CA 95135 4898 San Felipe Road Unit 140, San Jose, CA 95135 19.51 mi
4134 Fairway Dr., Sequel, CA 95073 4134 Fairway Drive, Soquel, CA 95073 148.2 mi

Tips for mappers

  • If you see ??? in the name of a MapRoulette task, it might be in a non-Latin script, such as Chinese.
  • The homeless shelter category includes some shelter houses and drug rehab centers whose websites don’t list addresses, so we should treat them similar to unlisted childcare centers.
  • This list of professional postnominals is handy when working on the healthcare and professional services challenges to know which preset to use.

Further afield

  • So far the most distant SDP we’ve mapped has been this furniture manufacturer in Wisconsin. There was also this landscaper in Santa Rosa. Consider it a way for us to promote our import nationally.
  • It might be interesting to try joining the SDP data to other open datasets such as AllThePlaces to see how many of the phone numbers are actually salvageable business lines without compromising the privacy of an individual’s phone number.
  • Based on all the buildings and POIs we’ve been importing, I’d imagine StreetComplete is filling up with many more challenges in building- and POI-related quests these days. It must be a lot more fun to use now that not all of the quests ask you for the surface of an obviously paved street. It’ll be interesting to check back a couple months from now and see if there’s been an uptick in edits made using StreetComplete. This OSMCha filter kinda-sorta tracks such edits, though the county boundary would need to be refined quite a bit.

@1ec5
Copy link
Member Author

1ec5 commented Mar 26, 2021

We have a new 3rd-place mapper:

Screenshot-2021-3-26 MapRoulette

@1ec5
Copy link
Member Author

1ec5 commented Mar 31, 2021

Based on all the buildings and POIs we’ve been importing, I’d imagine StreetComplete is filling up with many more challenges in building- and POI-related quests these days. It must be a lot more fun to use now that not all of the quests ask you for the surface of an obviously paved street. It’ll be interesting to check back a couple months from now and see if there’s been an uptick in edits made using StreetComplete. This OSMCha filter kinda-sorta tracks such edits, though the county boundary would need to be refined quite a bit.

So far, it looks like we’ve gotten improvements to names, addresses, and opening hours from StreetComplete users. StreetComplete doesn’t ask about some things that are often missing from the SDPs we’re importing, such as cuisine (streetcomplete/StreetComplete#103), medical specialty (streetcomplete/StreetComplete#1020), and religious denomination (streetcomplete/StreetComplete#1737).

@1ec5
Copy link
Member Author

1ec5 commented Apr 29, 2021

  • The schools challenge reached 25% complete on April 10 and 33% complete on April 15.
  • The Brazilian OSM community is proposing an import of hundreds of thousands of POIs, so our local POI import will be the third largest in OSM history, not the second largest. 🤷‍♂️
  • Sometimes an SDP really shouldn't be mapped.
  • I proposed a workshop at next month’s Mapping USA conference to invite people to help us map POIs. There are a lot of workshop proposals for the conference, so I’m not sure it’ll make the cut.

@1ec5
Copy link
Member Author

1ec5 commented May 28, 2021

Task completion milestones:

  • May 13: 14% complete overall, including 28% of high-priority tasks. Laundromats challenge over 75% complete.
  • May 27: 15% complete overall, including 29% of high-priority tasks. Schools challenge 35% complete.
  • Today:

Progress on May 28

Other notable events:

  • @impiaaa reran the scraper on April 30, bringing in about 3,000 new entries to review. The zoning WMS is no longer available, so he had to find and merge each city’s GPLU map together.
  • Our State of the Map submission was declined, but posters are being accepted until June 27 and lightning talks until July 2.
  • I surveyed and photomapped Gilroy Medical Park on May 7, since it was one of the higher concentrations of fixmes in South County arising from the import.
  • On May 16, @impiaaa photomapped the San Jose Flea Market, where a lot of SDPs had previously been unmappable.
  • On May 19, the county entered the Yellow Tier and the public health department dropped the requirement for businesses to submit SDPs. We reran the scraper the following day.
  • I presented the import at an expo booth at the Inclusive Product Week conference on May 20.
  • We held a workshop about the import at Mapping USA on May 22.
  • By May 24, the landing page for SDP submissions had been taken offline, so we reran the scraper another time for a few stragglers. As of today, the DocuSign form is still online, so a few new SDPs are still coming in every day.

@1ec5
Copy link
Member Author

1ec5 commented Jun 1, 2021

@1ec5
Copy link
Member Author

1ec5 commented Jun 14, 2021

We have a new 3rd-place mapper!

leaderboard

As of June 10: 15% complete overall, including 30% of high-priority tasks

  • 90% of laundromats
  • 30% of landscaping services
  • 20% of banks
  • 20% of healthcare services
  • 20% of bars

Current progress:

progress

Some tidbits:

  • In a sign of how quickly businesses are reopening these days, most of the open businesses in this strip mall never filed an SDP. Presumably they reopened after the SDP requirement was lifted.
  • This businesses consistently indicated their square footage within the strip mall each time they resubmitted an SDP. It’s clearly very important to them, so I made sure to map the business as an area within the building.
  • The Mapillary layer in iD broke on June 9, but Activating missing mapillary layer openstreetmap/iD#8535 has a workaround in case anyone needs to cross-reference a business in Mapillary.

@1ec5
Copy link
Member Author

1ec5 commented Jun 28, 2021

I submitted a poster to the State of the Map 2021 poster competition:

Mapping POIs in Santa Clara County.pdf

@1ec5
Copy link
Member Author

1ec5 commented Aug 7, 2023

OSM POI coverage compared to SCCPHD and Census Bureau data by tract and ZIP code

I’ve uploaded some of the files I used to create this report to the osm-southbay-poi-coverage repository.

@1ec5
Copy link
Member Author

1ec5 commented Aug 12, 2023

An updated report as of August 5, 2023:

OpenStreetMap POI coverage in Santa Clara County August 2023.pdf

POIs by census tract versus population density

POIs by census tract versus median household income

POIs by census tract versus share of nonwhite or Hispanic/Latino residents

POIs by ZIP code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants