Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoSuggestions option seems to pull in low-use values as well #9227

Closed
jdhoek opened this issue Aug 1, 2022 · 5 comments
Closed

autoSuggestions option seems to pull in low-use values as well #9227

jdhoek opened this issue Aug 1, 2022 · 5 comments
Labels
enhancement An enhancement to make an existing feature even better

Comments

@jdhoek
Copy link

jdhoek commented Aug 1, 2022

In openstreetmap/id-tagging-schema#553 I am seeing that autoSuggestions set to true (the default if missing) is pulling in tag values from TagInfo with as little as 16 uses. Is this by design?

Because anyone can create and use new tag values (and tags), the values used are not necessarily useful or correct. Misspellings and synonyms for more common tag-values are a common occurrence. To prevent propagating such tag values, editors that implement an auto suggestion feature should probably set a lower limit to the number of uses for a tag. A very safe limit is 1000, which matches the level at which TagInfo starts keeping usage graphs etc. for that tag-value as well.

Possibly, such a lower limit could be lowered for specific fields eventually, if needed.

@jdhoek jdhoek added the bug A bug - let's fix this! label Aug 1, 2022
@1ec5
Copy link
Collaborator

1ec5 commented Aug 1, 2022

The specific behavior of the autoSuggestions flag is up to the client applications that implement schema-builder, such as iD. iD currently requires at least 10 occurrences of a tag coming from taginfo before it’ll show up in the list. However, there are a bunch of other heuristics in the same method, such as omitting deprecated values.

I suspect taginfo’s 1,000-occurrence threshold is for performance reasons; it would be expensive to generate usage graphs and compute most frequent combinations for the long tail of tags, regardless of validity. Taginfo has a different threshold of 100 occurrences for considering an alphanumeric key “good”.

#7174 was a similar case where some very rare values were showing up unexpectedly. There are two existing solutions to that problem: either mark the problematic values as deprecated or turn off autoSuggestions, as you’ve done in openstreetmap/id-tagging-schema#554. The advantage of this approach is that we know which values of parking:orientation are valid and which are invalid (all the rest); on the other hand, we don’t have as much visibility into which keys will be adversely affected by a much higher threshold.

@jdhoek
Copy link
Author

jdhoek commented Aug 1, 2022

on the other hand, we don’t have as much visibility into which keys will be adversely affected by a much higher threshold.

Conversely, it may be hard to see if many bad tags get propagated with the limit set to 10 as well. 10 is awfully low for most tags; in my experience any tag value below 100 (excepting tags that document things like opening hours or monetary amounts which shouldn't suggest values at all anyway) should probably be ignored by tools. These are not bad values per se, but they tend to include lots of synonyms, misspellings, and unconventional notations. Could that limit perhaps be raised from 10 to 100?

@1ec5
Copy link
Collaborator

1ec5 commented Aug 1, 2022

A minimum of 100 occurrences doesn’t seem like a painfully high bar to clear for keyword-like values, so I think that would be a decent thing to try. Maybe we could also exempt documented values from the minimum, like taginfo does. That would accommodate tags like marker=paddle that aren’t mere typos sneaking in under the radar. I think the filter has access to taginfo’s wiki-based description strings (though not the description strings that come from data items).

You’re right that some keys with non-keyword values, like opening_hours and maxweight, may be impacted by this change. Of course the best solution would be to have dedicated support for special syntaxes (e.g., #974 ideditor/schema-builder#15), but in the meantime, maybe a per-field property could indicate the minimum number of occurrences required for inclusion in the taginfo-powered list for that key.

@jdhoek
Copy link
Author

jdhoek commented Aug 1, 2022

A minimum of 100 occurrences doesn’t seem like a painfully high bar to clear for keyword-like values, so I think that would be a decent thing to try. Maybe we could also exempt documented values from the minimum, like taginfo does.

Yeah, that seems like it would cover most desirable values, although usually documented values exceed 100 uses anyway. Exceptions are perhaps the long-tail tags like shop=*.

@tyrasd tyrasd added question Not Actionable - just a question about something moveto-iD and removed bug A bug - let's fix this! labels Aug 1, 2022
@tyrasd tyrasd transferred this issue from openstreetmap/id-tagging-schema Aug 1, 2022
@tyrasd tyrasd added enhancement An enhancement to make an existing feature even better and removed moveto-iD question Not Actionable - just a question about something labels Aug 1, 2022
@tyrasd tyrasd closed this as completed in a2cacaa Aug 1, 2022
@tyrasd
Copy link
Member

tyrasd commented Aug 1, 2022

100 occurrences

Yeah, 10 seems indeed to be too low for the cut-off point. I tried to look for some valid tags which might get hidden when increasing it to 100, and didn't find many. There are some examples:

I think it is also important to keep in mind that the autocomplete/suggestion functionality is meant to be a way for users to discover the more niche, less often used tags. In a2cacaa I went with a similar metric which is already used for filtering out uncommon tag keys: by only returning values which are either used more than 100 times or have a wiki page. This still lets a few "bad" tags slip through, i.e. ones where the wiki page describes them as spelling mistakes or similar error cases of another tag value, but those cases should probably be best modeled in the deprecated tags list of the id-tagging-schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to make an existing feature even better
Projects
None yet
Development

No branches or pull requests

3 participants