New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex searching #11
Comments
I want to avoid the need for regex. I agree for some users it would be helpful. However, the whole point of term sets is that the end result is more easily understood than a code set by itself. A short list of terms is demonstrably quicker and easier to digest and to determine the author's intent. If, however we allowed unlimited regex then you could in theory create term sets that were unintelligible to people without regex experience which is what I want to avoid. The compromise is that the wildcard * is allowed. If you can provide examples of any regex you would like to see then we can discuss further. |
I understand your reasoning. As an alternative, I've made a personal shopping list of features I would like. It's up to you whether you think they would be useful to others.
I don't think I would need escaping (e.g. a literal "?") because I can always use a quoted literal. I think my ideal algorithm would be
That would match the behaviour I would intuitively expect. But of course, other people might have different intuition. Also, it might be that you already specified all this when writing the papers you already published, so it's difficult to change them. Definitely think of my suggestions as opinion rather than factually correct. EDIT I've edited this comment rather a lot, all on 2020-07-07. |
Implementing the algorithm could either be done by converting to a regex, taking care to escape literal characters with a meaning in regex, or something like (using typescript to show the typed data structures) // from https://github.com/orling/grapheme-splitter
let splitter = new GraphemeSplitter();
interface Token {
chars: Char[];
}
interface Char {
ty: CharType;
ch: string; // enforced to be a single grapheme cluster.
}
function lit(ch: string) -> Char {
return {
ty: Literal,
ch
};
}
function wild() -> Char {
return {
ty: WildCard,
ch: null
};
}
type CharType = Wildcard | Literal;
function parse(input: string): Token[] {
let tokenBuf: Char[] = [];
let inLiteral: boolean = false;
let tokens: Token[] = [];
for (let ch of splitter.splitGraphemes(input)) {
if (ch === "\"") {
if (inLiteral) {
// end of literal
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
inLiteral = false;
} else {
// start of literal
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
inLiteral = true;
}
} else if (inLiteral) {
// we're in a literal so we push `ch` whatever it is.
tokenBuf.push(lit(ch));
} else if (ch === "*") {
tokenBuf.push(wild());
} else if (isWhitespace(ch)) {
// start a new token if we haven't already.
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
tokenBuf = [];
}
} else {
// It's a normal character, add it to the current word
tokenBuf.push(lit(ch));
}
}
if (tokenBuf.length > 0) {
tokens.push(tokenBuf);
}
return tokens;
}
function isWhitespace(input: string) -> boolean {
// TODO decide what constitutes whitespace
/^\s$/.match(input)
}
function test(matcher: Token, input: string): boolean {
// TODO write a deterministic finite automata, either directly or using regex.
throw new Error("unimplemented");
} |
I've raised an issue #17 for making sure that whatever the search strategy it is at least documented for users - I've responded to most of your wish list there as it is a great starting point for the documentation. The algorithm is almost exactly as you describe, so that is reassuring. The differences are:
I'm assuming there is never a need to search for a
I'd probably just treat them as invalid and flag to the user.
Yes - apart from the |
It would be handy if I could perform regular expression searching in addition to literal text search. This would also solve #1 and potentially #7.
The text was updated successfully, but these errors were encountered: