Case sensitivity handling and note about in the docs #117

autioch · 2019-01-10T12:26:12Z

Hi!
First of all, I'd like to thank You for creating such fast lexer. I've been using it along with nearley.js in various projects. It really changed my way of approaching any text parsing related topics.

I'm currently working on a language that requires all tokens to be case-insensitive, without exceptions. For now, following tips that I've found over the internet (and issues in this repo), I've been using some custom helpers that transform token text into case insensitive regex without the /i flag. This works, however it's not pretty. Also, even if unreal, I have doubts about the overall performance of my parser.

Why I'm creating this issue? I would like a concise description on how to approach situations where all (or some) tokens are case-insensitive. An example would be nice as well.

Let's say, that my lexer usage looks like this:

import moo from 'moo';

const lexer = ({

  /* This doesn't care about case sensitivity. */
  STRING: /"(?:[^\\]|\\.)*?"/,

  /* Case sensitivity doesn't apply here. */
  NUMBER: /(?:\.\d+|\d+\.?\d*)/,

  /* Case sensitivity doesn't apply here. */
  ADD: '+',

  /* Manually force case insensitivity */
  IN: ['in', 'iN', 'In', 'IN'],

  /* Use a helper */
  ABS: textToCaseInsensitiveRegex('ABS')
});

tjvr · 2019-01-12T14:02:05Z

Hi! I'm glad you've Moo and Nearley useful 😊

What is the exact use-case you're thinking about? Case-insensitive keywords?

_{Sent with GitHawk}

autioch · 2019-01-16T08:20:17Z

Language that I'm working on these days is completely case-insensitive. If this can be solved with keywords, then cool.

nathan · 2019-01-16T14:59:01Z

Perhaps we could use a similar solution to that of #116, combined with some indicator that strings should be case-insensitive as well?

moo.compile({
  STRING: /"(?:[^\\]|\\.)*?"/i,
  NUMBER: /(?:\.\d+|\d+\.?\d*)/i,
  ADD: {match: '+', case: false},
  IN: {match: 'in', case: false},
  ABS: {match: 'ABS', case: false},
})

(I don't particularly care for case: false; feel free to bikeshed the syntax.)

It's a bit verbose but inescapably clear that every string is case-insensitive. Another option might be an options dictionary, e.g.:

moo.compile({
  STRING: /"(?:[^\\]|\\.)*?"/,
  NUMBER: /(?:\.\d+|\d+\.?\d*)/,
  ADD: '+',
  IN: 'in',
  ABS: 'ABS',
}, {case: false})

autioch · 2019-01-17T08:37:06Z

If adding such option isn't a problem, then I'm all for it.

However, as I originally wrote, I've found existing solutions and discussions to this problem.
#46
#78
#85
#53
Plus some other sites.

Clear description of how to do it the best way is what i currently need. As this issue is recurring, maybe a note in the docs/readme for other people.

tjvr · 2019-01-17T19:56:48Z

When just keywords are case-insensitive, using a custom type transform is my favourite solution.

const caseInsensitiveKeywords = defs => {
  const keywords = moo.keywords(defs)
  return value => keywords(value.toLower())
}

let lexer = compile({
  identifier: {
    match: /[a-zA-Z]+/,
    type: caseInsensitiveKeywords({
      'kw-class': 'class',
      'kw-def': 'def',
      'kw-if': 'if',
    }),
  },
  space: {match: /\s+/, lineBreaks: true},
})

For case-insensitive literals, where the keywords modifier doesn't make sense, then your textToCaseInsensitiveRegex sounds reasonable, and I can't imagine it would perform badly.

If everything is case-insensitive then perhaps we need something like what Nathan suggests?

It'd be great to see an example of your language :)

autioch · 2019-01-18T11:37:54Z

Unfortunately I can't show You the exact language. It's similar to excel functions, where user can type anything and case just doesn't matter.
I'll test the caseInsensitiveKeywords . If it works, I can make a PR with docs change.

Nathan's suggestion is still very welcome, as it would greatly simplify writing lexer for such situations.

autioch · 2019-01-21T15:20:53Z

Hi again. After some testing and pondering about my code readability, I've came to a conclusion that I'll stick to the textToCaseInsensitiveRegex helper. I'm doing some extra transformations on the lexer rules, so I've added "precompiler", that outputs complete, finished definitions, that I later use in the app. Separating these two things is safer, easier to write tests, debug and finally reduces time for parsing and preparing the JS on the client side.

Helper:

const LETTER_REGEXP = /[a-zA-Z]/;
const isCharLetter= (char) => LETTER_REGEXP.test(char);

function textToCaseInsensitiveRegex(text) {
  const regexSource = text.split('').map((char) => {
    if (isCharLetter(char)) {
      return `[${char.toLowerCase()}${char.toUpperCase()}]`;
    }

    return char;
  });

  return new RegExp(regexSource.join(''));
};

As a side note, it's cool that moo accepts array as an alternative to object. It's easier to manipulate and there's complete certainty about the rules order.

jdoklovic · 2020-01-31T18:30:30Z

I REALLY need this too. I can't find any reasonable way to implement the following matcher that I need to use:

currently I have to keep adding these terrible statements:

const ciOps = /[wW][aA][sS]\s+[nN][oO][tT]\s+[iI][nN]|[iI][sS]\s+[nN][oO][tT]|[nN][oO][tT]\s+[iI][nN]|[wW][aA][sS]\s+[nN][oO][tT]|[wW][aA][sS]\s+[iI][nN]|[iI][sS]|[iI][nN]|[wW][aA][sS]|[cC][hH][aA][nN][gG][eE][dD]/;

const ciOrderBy = /[oO][rR][dD][eE][rR]\s+[bB][yY]/;

const ciJoins = /[aA][nN][dD]|[oO][rR]/;

any word on this?

tjvr · 2021-06-04T13:34:29Z

Like the unicode flag handling in #123, I think it would be reasonable to allow the ignoreCase /i flag if all the RegExps use it. That would handle the case where everything in the language is case-insensitive.

If only some of the RegExps need to be case-insensitive, then you'll have to generate the cases manually, using something like textToCaseInsensitiveRegex above.

tjvr added the question label Jan 12, 2019

tjvr mentioned this issue Feb 23, 2019

Add ignoreCase flag #122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case sensitivity handling and note about in the docs #117

Case sensitivity handling and note about in the docs #117

autioch commented Jan 10, 2019 •

edited

tjvr commented Jan 12, 2019

autioch commented Jan 16, 2019

nathan commented Jan 16, 2019 •

edited

autioch commented Jan 17, 2019

tjvr commented Jan 17, 2019

autioch commented Jan 18, 2019

autioch commented Jan 21, 2019

jdoklovic commented Jan 31, 2020

tjvr commented Jun 4, 2021

Case sensitivity handling and note about in the docs #117

Case sensitivity handling and note about in the docs #117

Comments

autioch commented Jan 10, 2019 • edited

tjvr commented Jan 12, 2019

autioch commented Jan 16, 2019

nathan commented Jan 16, 2019 • edited

autioch commented Jan 17, 2019

tjvr commented Jan 17, 2019

autioch commented Jan 18, 2019

autioch commented Jan 21, 2019

jdoklovic commented Jan 31, 2020

tjvr commented Jun 4, 2021

autioch commented Jan 10, 2019 •

edited

nathan commented Jan 16, 2019 •

edited