Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The search strategy needs documenting #17

Open
rw251 opened this issue Jul 9, 2020 · 1 comment
Open

The search strategy needs documenting #17

rw251 opened this issue Jul 9, 2020 · 1 comment

Comments

@rw251
Copy link
Owner

rw251 commented Jul 9, 2020

The search strategy is currently only documented in 1 acadmic paper. It should appear in a README within this repository and somewhere on the getset website to help users understand how best to search. It would include:

  • wildcards
  • the meaning of quoted expressions
  • how the inclusion matching works
  • how the exclusion matching works
  • more besides.
@rw251
Copy link
Owner Author

rw251 commented Jul 9, 2020

From another issue:

What constitutes a word boundary/whitespace (e.g. is '-' included?)

Not sure - need to double check - but I think it's and non-alphanumeric character

Do the words have to occur in order (e.g. would the search string foo bar match the rubric bar foo)?

Order is irrelevant for inclusion terms - foo bar matches bar foo

Do the words have to occur concurrently (e.g. would the search string foo baz match the rubric foo bar baz)?

No - matches anywhere.

Do words match in a "LIKE %test%" style (e.g. would the search string bar match the rubric foobarbaz)?

No - it's done with whole word matching for performance reasons

How are characters treated when they are neither a letter/number or a word boundary (examples are '#', '_', '+' etc.)?
I think I would prefer it if they were treated exactly the same as letters.

As before I need to check but pretty sure they're stripped out. If we get any examples of code definitions where searching for these characters is important we can revisit.

How are non-ascii characters treated? I have seen these in the Salford data (e.g. squared ²) I would guess that 
applying unicode normalization would be enough.

Probably as word boundaries which I guess isn't good enough if they occur in code definitions - though they might not - this might just be something that appears in units. Having said that, I reckon units is probably something in SNOMED.

The ability to have a literal quote which must match exactly (including whitespace). For example, the search string
"foo bar" would match bazfoo bar but not foo bazbar.

Yes this happens - with the exception that because it's whole word matching "foo bar" would match "baz foo bar" but not "bazfoo bar"

The ability to match zero or more characters using the * wildcard (except in literal quotes "").

Yes - it does this.

Possibly the option to match a single character (usually with ?). To be honest I don't see myself using this much.

It doesn't do this, and given the use case is pretty small I don't think it should. Enumerating in full the search terms would be easier for someone to check e.g. having 3 inclusion terms: foo, fop and fod is better than fo? - especially if the reviewer is unfamiliar with the wildcard syntax.

@rw251 rw251 mentioned this issue Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant