Skip to content

Extend Regex capabilities #26160

@ghost

Description

@ghost

1- Let Regex measure word distances

Regex is a great tool to search text. but it lakes some patterns that can be used to measure the similarity between words, like Levenshtein distance algorithm. Maybe this is not of a great important in English because different forma of the word contains some sort of shared letter sequence such as in (accura)te, in(accura)te and (accura)cy. This not always true in other languages such as Arabic words such as يلعب, لاعب , ألعاب which comes from the root لعب but with different patterns. These words can't be matched using a regex pattern, unless using some morphology analyzer first to get the root and use it to create all possible regex patterns to match all the word formulas for this root, which is too complicated thing to do.
The easy solution, is to measure the distance between the words. Words with smaller distance is more likely to be related. I implemented and used Levenshtein distance to do so, and there are some NuGets that use such algorithms. But this approach is an approximate solution, that can be enhanced if combined with regex patterns. Things will be easier and regex will be more powerfull if it used such algorithms as a part of its patterns, so we can find related words based on complicated criteria. Besides, analyzing the text once with regex is more efficient than using do this twice by separate tools.
The suitable regex syntax to supply the accepted distance range can be something like this:
\b?<xit, 1, 4>\b
this searches for any whole word that has a distance from 1 to 4 from the sequence xit.
There are many algorithms such as:

  • Levenshtein distance
  • Normalized Levenshtein
  • Weighted Levenshtein
  • Damerau-Levenshtein
  • Tanimoto coefficient
  • Hamming distance
  • Optimal String Alignment
  • Jaro-Winkler
  • Longest Common Subsequence
  • Metric Longest Common Subsequence
  • N-Gram
  • Q-Gram
  • Cosine similarity
  • Jaccard index similarity
  • Sorensen-Dice coefficient
    Regex can implement some of then and let us decide which one to use by supplying this option through the constructor.

2- Allow us add callback functions in the regex pattern:

I have an idea: can regex allow us to call a custom callback function to use as a part of the pattern? say:
\b?<funcName(\w{2,6})>\b
where funcName can be any name of a callback function with the form:
bool funcName(string);
Each time the regex evaluates the supplied expression (such as \w{2,6}) sends it to the callback function , and if it returns true, the regex conseders this string as match and continue evakuating the remaing pattern at this posission.
This will allow us to use what ever algorithms we need as a part of the pattern.
This a "simple" regex to match IP
\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:\.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b
It became that complex because of the need to check that each part is less than 255, which can be very simple if we mach three digits and pass them to a call back function to validat, like this:
\b(?Validate(\d{1,3}))(?:.(?Validate(\d{1,3}))){3}\b
Where Validate is defined as:

bool Validate(string No)
{
   if (byte.TryParse(No, out byte b))
       return true;
   return false;
}

And that is all!

3- Add validation expressions:

Another solution for this issue, to add on your shelf for future improvements:
Instead of calling an external function, regex can define some validating expressins using Len( ), val () and so, similar to SQL.
'''\b(?n=\d{1,3}, val(n)<256)(?:.(?d=\d{1,3}, val(d)<256){3}\b'''

4- Add named subexpression:

It will be also helpful if we can define named sub expressions:
'''(?DEF validNo:(n=\d{1,3}, val(n)<256))\b'validNo'(?:.'validNo'{3}\b'''

Compare above to:
'''\b(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))(?:.(?:1?\d{1,2}|2(?:[0-4]\d|5[0-5]))){3}\b'''
which is readable, easier to write, and more iffecient?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions