The search strategy needs documenting #17

rw251 · 2020-07-09T07:40:00Z

The search strategy is currently only documented in 1 acadmic paper. It should appear in a README within this repository and somewhere on the getset website to help users understand how best to search. It would include:

wildcards
the meaning of quoted expressions
how the inclusion matching works
how the exclusion matching works
more besides.

rw251 · 2020-07-09T07:52:19Z

From another issue:

What constitutes a word boundary/whitespace (e.g. is '-' included?)

Not sure - need to double check - but I think it's and non-alphanumeric character

Do the words have to occur in order (e.g. would the search string foo bar match the rubric bar foo)?

Order is irrelevant for inclusion terms - foo bar matches bar foo

Do the words have to occur concurrently (e.g. would the search string foo baz match the rubric foo bar baz)?

No - matches anywhere.

Do words match in a "LIKE %test%" style (e.g. would the search string bar match the rubric foobarbaz)?

No - it's done with whole word matching for performance reasons

How are characters treated when they are neither a letter/number or a word boundary (examples are '#', '_', '+' etc.)?
I think I would prefer it if they were treated exactly the same as letters.

As before I need to check but pretty sure they're stripped out. If we get any examples of code definitions where searching for these characters is important we can revisit.

How are non-ascii characters treated? I have seen these in the Salford data (e.g. squared ²) I would guess that 
applying unicode normalization would be enough.

Probably as word boundaries which I guess isn't good enough if they occur in code definitions - though they might not - this might just be something that appears in units. Having said that, I reckon units is probably something in SNOMED.

The ability to have a literal quote which must match exactly (including whitespace). For example, the search string
"foo bar" would match bazfoo bar but not foo bazbar.

Yes this happens - with the exception that because it's whole word matching "foo bar" would match "baz foo bar" but not "bazfoo bar"

The ability to match zero or more characters using the * wildcard (except in literal quotes "").

Yes - it does this.

Possibly the option to match a single character (usually with ?). To be honest I don't see myself using this much.

It doesn't do this, and given the use case is pretty small I don't think it should. Enumerating in full the search terms would be easier for someone to check e.g. having 3 inclusion terms: foo, fop and fod is better than fo? - especially if the reviewer is unfamiliar with the wildcard syntax.

rw251 mentioned this issue Jul 9, 2020

Regex searching #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The search strategy needs documenting #17

The search strategy needs documenting #17

rw251 commented Jul 9, 2020

rw251 commented Jul 9, 2020

The search strategy needs documenting #17

The search strategy needs documenting #17

Comments

rw251 commented Jul 9, 2020

rw251 commented Jul 9, 2020