Skip to content
amobiz edited this page Aug 16, 2014 · 5 revisions

Welcome to the regexgen.js wiki!

RegexGen.js is a JavaScript regular expression generator that helps to construct complex regular expressions.

The Generator

The generator is exported as the regexGen() function, everything must be referenced from it.

Generator

regexGen()

To generate a regular expression, pass sub-expressions as parameters to the call of regexGen() function.

Sub-expressions are then concatenated together to form the whole regular expression.

Sub-expressions can either be a string, a number, a RegExp object, or any combinations of the call to methods (i.e., the sub-generators) of the regexGen() function object.

Strings passed to the the call of regexGen(), text(), maybe(), anyCharOf() and anyCharBut() functions, are always escaped as necessary, so you don't have to worry about which characters to escape.

The result of calling the regexGen() function is a RegExp object. See The RegExp Object section for detail.

Since everything must be referenced from the regexGen() function, to simplify codes, assign it to a short variable is preferable.

Usage:

var _ = regexGen;

var regex = regexGen(
    _.startOfLine(),
    _.capture( 'http', _.maybe( 's' ) ), '://',
    _.capture( _.anyCharBut( ':/' ).repeat() ),
    _.group( ':', _.capture( _.digital().multiple(2,4) ) ).maybe(), '/',
    _.capture( _.anything() ),
    _.endOfLine()
);
var matches = regex.exec( url );

Utility

mixin( window | global context )

The mixin() function is a method of the regexGen() function object. For convenient, you can use the regexGen.mixin() function to export all methods of the regexGen() function object to the global object. Note that this will pollute the global object.

Usage:

regexGen.mixin( window );

var regex = regexGen(
    startOfLine(),
    capture( 'http', maybe( 's' ) ), '://',
    capture( anyCharBut( ':/' ).repeat() ),
    group( ':', capture( digital().multiple(2,4) ) ).maybe(), '/',
    capture( anything() ),
    endOfLine()
);
var matches = regex.exec( url );

Modifiers

Modifiers alter behavior of regular expression. If specified, modifiers can have any combination of the following values:

ignoreCase()

Case-insensitive search. Equivalent to /.../i.

searchAll()

Global search. Equivalent to /.../g.

searchMultiLine()

Multiline. If the input string has multiple lines, startOfLine() (^) and endOfLine() ($) match the beginning and end of each line within the string, instead of matching the beginning and end of the whole string only. Equivalent to /.../m.

Sub-Generators

Sub-generators are methods of the regexGen() function object that generate parts of the whole regular expression.

Boundaries

startOfLine()

Matches beginning of input. If the multiline modifier searchMultiLine() is specified, also matches immediately after a line break character. Equivalent to /^.../.

endOfLine()

Matches end of input. If the multiline modifier searchMultiLine() is specified, also matches immediately before a line break character. Equivalent to /...$/.

wordBoundary()

Matches boundary of a word. Equivalent to /\b/.

nonWordBoundary()

Matches a non-word boundary. Equivalent to /\B/.

Literal Characters

text( string text )

Matches the text specified. The characters in text is properly escaped when necessary. Note this is the equivalent of passing a string literal to the regexGen() generator, except that you can't use any quantifiers on a string literal.

Usage:

text( "subject" )  // ==>  /subject/

maybe( string text )

Matches the text specified 0 or 1 time. The characters in text is properly escaped when necessary.

Usage:

maybe( "subject" )  // ==>  /(?:subject)?/

Character Classes

anyCharOf( string text | array character pair, ... )

Matches any given character. Each arguments are concatenated and can be any of:

  • string literal, e.g., "abcde", Equivalent to /[abcde]/.
  • array of two element indicating a range of characters, e.g., ["a", "z"], Equivalent to /[a-z]/.
  • character shorthand generator, including: anyChar(), ascii(), unicode(), nullChar(), controlChar(), formFeed(), lineFeed() , carriageReturn(), space(), nonSpace, tab(), vertTab(), digital(), nonDigital(), word() and nonWord().

Usage:

anyCharOf( [ 'a', 'c' ], ['2', '6'], 'fgh', 'z', space() )  // ==>  /[a-c2-6fghz\s]/

anyCharBut( string text | array character pair, ... )

Matches anything but these characters. see anyCharOf() for instructions of arguments.

Usage:

anyCharBut( [ 'a', 'c' ], ['2', '6'], 'fgh', 'z', space() )  // ==>  /[^a-c2-6fghz\s]/

Character Shorthands

anyChar( string character )

Matches any single character except the newline character. Equivalent to /./.

ascii( string asciiCode )

Matches the character with the code hh (two hexadecimal digits).

Usage:

ascii( '20' )  // ==>  /\x20/

unicode( string unicode )

Matches the character with the code hhhh (four hexadecimal digits).

Usage:

unicode( '2000' )  // ==>  /\u2000/

nullChar()

Matches a NULL (U+0000) character. Equivalent to /\0/.

Do not follow this with another digit, because \0 is an octal escape sequence.

controlChar( string controlCharacter )

Matches a control character in a string. Where value is a character ranging from A to Z.

Usage:

controlChar( 'Z' )  // ==>  /\cZ/

backspace()

Matches a backspace (U+0008). Equivalent to /[\b]/.

Note: in regular expression, you need to use square brackets if you want to match a literal backspace character. (Not to be confused with \b.)

formFeed()

Matches a form feed. Equivalent to /\f/.

lineFeed()

Matches a line feed. Equivalent to /\n/.

carriageReturn()

Matches a carriage return. Equivalent to /\r/.

space()

Matches a single white space character, including space, tab, form feed, line feed. Equivalent to /\s/.

nonSpace()

Matches a single character other than white space. Equivalent to /\S/.

tab()

Matches a tab (U+0009). Equivalent to /\t/.

vertTab()

Matches a vertical tab (U+000B). Equivalent to /\v/.

digital()

Matches a digit character. Equivalent to /\d/.

nonDigital()

Matches any non-digit character. Equivalent to /\D/.

word()

Matches any alphanumeric character including the underscore. Equivalent to /\w/.

nonWord()

Matches any non-word character. Equivalent to /\W/.

Extended Character Shorthands

anything()

Matches any characters except the newline character. Equivalent to /.*/.

hexDigital()

Matches a hex digital character. Equivalent to /[0-9A-Fa-f]/.

lineBreak()

Matches any line break, includes Unix and windows CRLF. Equivalent to /\r\n|\r|\n/.

words()

Matches any alphanumeric character sequence including the underscore. Equivalent to /\w+/.

Grouping and Back References

either( any expression, ... )

Adds alternative expressions.

Usage:

either( 'first', '1st' )  // ==>  /first|1st/

group( any expression, ... )

Matches specified terms but does not remember the match. The generated parentheses are called non-capturing parentheses.

Usage:

group( 'http', maybe( 's' ) ).maybe()  // ==>  /(?:https?)?/

capture( any expression, ... )

Matches specified terms and remembers the match. The genrated parentheses are called capturing parentheses.

Usage:

var _ = regexGen;
var regex = regexGen(
  _.capture( _.label('prefix'), _.words() ),
  'o',
  _.sameAs( 'prefix' ),
  _.searchAll()
);                                             // ==> /(\w+)o\1/g
"lol, wow, aboab, foo, bar".match( regex );    // ["lol", "wow", "aboab" ]

See also label(), sameAs(). See extract() for extended usage.

label( string label )

Label is a named index to a capture group, and is allowed only as the very first argument in the capture() method. Label can be refered by sameAs() generator, i.e., back-reference.

See also capture(), sameAs(). See extract() for extended usage.

sameAs( string label )

Back reference to a labeled capture group, matching the same text as that capture group.

See also capture(), label(). See extract() for extended usage.

Regex Overwrite

regex( RegExp | string expression )

Use the given regex, i.e., trust me, just put the value as is.

Usage:

regex( /\w\d/ )       // ==> /\w\d/
regex( "\\w\\d" )     // ==> /\w\d/

Quantifiers

Quantifiers can apply to all of the above sub-generators.

any

Matches the expression generated by the preceding sub-generator 0 or more times. Equivalent to /.*/ and /.{0,}/.

Usage:

anyChar().any()      // ==> /.*/

many

Matches the expression generated by the preceding sub-generator 1 or more times. Equivalent to /.+/ and /.{1,}/.

Usage:

anyChar().many()      // ==> /.+/

maybe

Matches the expression generated by the preceding sub-generator 0 or 1 time. Equivalent to /.?/ and /.{0,1}/.

Usage:

anyChar().maybe()      // ==> /.?/

repeat( number (optional) times )

Matches the expression generated by the preceding sub-generator at least once or exactly specified times. Equivalent to /.+/, /.{n}/.

Usage:

anyChar().repeat()      // ==> /.+/
anyChar().repeat(5)     // ==> /.{5}/

multiple( number (optional) minTimes, number (optional) maxTimes )

Matches the expression generated by the preceding sub-generator at least minTimes and at most maxTimes times. Equivalent to /.{min,max}/. Note that the generator try to optimize the expression when possible.

Usage:

anyChar().multiple()       // ==> /.*/
anyChar().multiple(1)      // ==> /.+/
anyChar().multiple(0,1)    // ==> /.?/
anyChar().multiple(5)      // ==> /.{5,}/
anyChar().multiple(5,9)    // ==> /.{5,9}/

greedy

Makes a quantifier greedy. Note that quantifier are greedy by default.

Usage:

anyChar().any().greedy()       // ==> /.*/
anyChar().many().greedy()      // ==> /.+/
anyChar().maybe().greedy()     // ==> /.?/

lazy

Makes a quantifier lazy.

Usage:

anyChar().any().lazy()          // ==> /.*?/
anyChar().many().lazy()         // ==> /.+?/
anyChar().maybe().lazy()        // ==> /.??/
anyChar().multiple(5,9).lazy()  // ==> /.{5,9}?/

reluctant

This is an alias of lazy().

Lookaheads

contains( any expression )

Matches the expression generated by the preceding sub-generator only if it matches the given expression.

Usage:

// Simple Password Validation

var _ = regexGen;
var regex = regexGen(
    // Anchor: the beginning of the string
    _.startOfLine(),
    // Match: six to ten word characters
    _.word().multiple(6,10).
        // Look ahead: anything, then a lower-case letter
        contains( _.anything().reluctant(), _.anyCharOf(['a','z']) ).
        // Look ahead: anything, then an upper-case letter
        contains( _.anything().reluctant(), _.anyCharOf(['A','Z']) ).
        // Look ahead: anything, then one digit
        contains( _.anything().reluctant(), _.digital() ),
    // Anchor: the end of the string
    _.endOfLine()
);

notContains( any expression )

Matches the expression generated by the preceding sub-generator only if it not matches the given expression.

followedBy( any expression )

Matches the expression generated by the preceding sub-generator only if followed by contents that matches the given expression.

notFollowedBy( any expression )

Matches the expression generated by the preceding sub-generator only if not followed by contents that matches the given expression.

The RegExp Object

The RegExp object returned from the call of regexGen() function, can be used directly as usual. In addition, there are four properties injected to the RegExp object:

  • warnings array

The warnings property is an array of strings contains errors detected while processing and generating the final regular expression. One of the best practices of programming is: always treat warnings as error and fix them.

  • captures array

The captures property is an array of strings contains the indexes of captures and/or labels of named captures in the order they appeared in the regular expression. The first item is always "0", that is the index of the whole matches, the second item can be either '1' or the label of named capture that passed to the label() generator, and so forth.

  • extract( _string text ) method

Instead of access the array returned by RegExp.exec() method or String.match() method, you can obtain a JSON object from the injected RegExp.extract() method if you are using the label() generator to capture patterns:

var sample = 'Conan: 8: Hi, there, my name is Conan.';
var _ = regexGen;
var regex = regexGen(
    _.capture(_.label('name'), _.words()),
    ':', _.space().any(),
    _.capture(_.label('age'), _.digital().many()),
    ':', _.space().any(),
    _.capture(_.label('intro'), _.anything())
    );
var result = regex.extract(sample);
expect(regex.source).to.equal(/(\w+):\s*(\d+):\s*(.*)/.source);
expect(result).to.eql({
    '0': sample,
    name: 'Conan',
    age: '8',
    intro: 'Hi, there, my name is Conan.'
});
  • extractAll( _string text ) method

Same as extract() method, but returns all matches in an array. Note this method must be used with the searchAll() modifier is specified.

Usage:

var sample = 'Conan: 8, Kudo: 17';
var regex = regexGen(
    capture(label('name'), words()),
    ':', space().any(),
    capture(label('age'), digital().many()),
    searchAll()
);
expect(regex.extractAll(sample)).to.eql([{
    '0': 'Conan: 8',
    name: 'Conan',
    age: '8'
}, {
    '0': 'Kudo: 17',
    name: 'Kudo',
    age: '17'
}]);