Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major flaw with $ logic #2

Open
edcottrell opened this issue Jan 10, 2014 · 4 comments
Open

Major flaw with $ logic #2

edcottrell opened this issue Jan 10, 2014 · 4 comments
Labels

Comments

@edcottrell
Copy link

FYI, there is a major flaw in this regex simplifier's logic. $ does not represent the empty string; it represents the end of a string (or, with the /m modifier, the end of a line). So, $+ is meaningless, and $a can never match anything.

For example, foo$ matches foo but not foobar.

foo$

Regular expression visualization

Debuggex Demo

@ibudiselic
Copy link
Collaborator

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to explain now. Noam doesn't support Perl regexes (or any other particular regex flavor). The language for defining regular expressions in Noam is intentionally extremely simple and minimal, closely akin to something you'd find in any automata/languages textbook. The goal of a Noam regular expression is to define a regular language - nothing more and nothing less. Specifically, the goal of regular expressions in Noam is not to enable users to match and slice up parts of text - that would really be a silly thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is meaningless - they are both implicitly there at the start and end of the regular expression. You can't search for a match somewhere in your string. You can only test if the whole string is in a language defined by a regular expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to represent the empty string (usually denoted by epsilon in textbooks) so that regular expressions containing them were more readable and less error prone (for example, you can't define an optional "a" with the regular expression "a?" as you might do normally as the question mark is actually not an operator at all... you'd use something like "a|$" which looks nicer than "a|"). When you're defining a language, epsilons will be much more frequent then empty strings would be in a regex you were using to match text.

You can see a full explanation of the language for defining regular expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/ or in a comment around the 1450th line here https://github.com/izuzak/noam/blob/master/src/noam.re.js where the string representation of regular expressions is defined. I agree it might be helpful if we made this more explicit in the readme, but Noam started out with finite automata, their manipulation and visualization, and regular expressions were added afterwards primarily to make it easy to define languages.

Hope this clears it up. Cheers!

@edcottrell
Copy link
Author

Hi Ivan,

Thanks for the kind reply. Your explanation makes perfect sense, given that
you are using a special regex grammar. I appreciate you taking the time to
reply and clarify the Noam expression language.

That said, may I encourage putting a disclaimer at the top of the page? The
disclaimer would explain that the simplifier works with Noam regexes and
that Noam regexes != Perl compatible regexes. I came across the page
directly via a Google search for a regex simplifier. Because I am already
familiar with regexes, I didn't really read section 1, so I had no idea it
was not processing Perl-style regexes until I tried it out and got
unexpected results. The page doesn't currently mention Noam at all until
section 5, well after the "meat" of the page, and doesn't clarify there
that Noam regexes have their own syntax and grammar. I would anticipate
that others will have similar surprises.

Best regards,
Ed

On Fri, Jan 10, 2014 at 11:34 AM, Ivan Budiselic
notifications@github.comwrote:

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to explain
now. Noam doesn't support Perl regexes (or any other particular regex
flavor). The language for defining regular expressions in Noam is
intentionally extremely simple and minimal, closely akin to something you'd
find in any automata/languages textbook. The goal of a Noam regular
expression is to define a regular language - nothing more and nothing less.
Specifically, the goal of regular expressions in Noam is not to enable
users to match and slice up parts of text - that would really be a silly
thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is
meaningless - they are both implicitly there at the start and end of the
regular expression. You can't search for a match somewhere in your string.
You can only test if the whole string is in a language defined by a regular
expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to
represent the empty string (usually denoted by epsilon in textbooks) so
that regular expressions containing them were more readable and less error
prone (for example, you can't define an optional "a" with the regular
expression "a?" as you might do normally as the question mark is actually
not an operator at all... you'd use something like "a|$" which looks nicer
than "a|"). When you're defining a language, epsilons will be much more
frequent then empty strings would be in a regex you were using to match
text.

You can see a full explanation of the language for defining regular
expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/ or
in a comment around the 1450th line here
https://github.com/izuzak/noam/blob/master/src/noam.re.js where the
string representation of regular expressions is defined. I agree it might
be helpful if we made this more explicit in the readme, but Noam started
out with finite automata, their manipulation and visualization, and regular
expressions were added afterwards primarily to make it easy to define
languages.

Hope this clears it up. Cheers!


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32047901
.

@ibudiselic
Copy link
Collaborator

Thanks for the suggestion, I'm inclined to agree that we should make this
clearer.

Ivan

On Fri, Jan 10, 2014 at 7:15 PM, edcottrell notifications@github.comwrote:

Hi Ivan,

Thanks for the kind reply. Your explanation makes perfect sense, given
that
you are using a special regex grammar. I appreciate you taking the time to
reply and clarify the Noam expression language.

That said, may I encourage putting a disclaimer at the top of the page?
The
disclaimer would explain that the simplifier works with Noam regexes and
that Noam regexes != Perl compatible regexes. I came across the page
directly via a Google search for a regex simplifier. Because I am already
familiar with regexes, I didn't really read section 1, so I had no idea it
was not processing Perl-style regexes until I tried it out and got
unexpected results. The page doesn't currently mention Noam at all until
section 5, well after the "meat" of the page, and doesn't clarify there
that Noam regexes have their own syntax and grammar. I would anticipate
that others will have similar surprises.

Best regards,
Ed

On Fri, Jan 10, 2014 at 11:34 AM, Ivan Budiselic
notifications@github.comwrote:

Hello Ed, thanks for the comment.

This is not actually a bug, it's a design choice which I'll try to
explain
now. Noam doesn't support Perl regexes (or any other particular regex
flavor). The language for defining regular expressions in Noam is
intentionally extremely simple and minimal, closely akin to something
you'd
find in any automata/languages textbook. The goal of a Noam regular
expression is to define a regular language - nothing more and nothing
less.
Specifically, the goal of regular expressions in Noam is not to enable
users to match and slice up parts of text - that would really be a silly
thing to reimplement as JavaScript regexes already do that job.

In this context, defining the start or the end of the string is
meaningless - they are both implicitly there at the start and end of the
regular expression. You can't search for a match somewhere in your
string.
You can only test if the whole string is in a language defined by a
regular
expression or if it is not.

With that in mind, very early on we decided to use the dollar symbol to
represent the empty string (usually denoted by epsilon in textbooks) so
that regular expressions containing them were more readable and less
error
prone (for example, you can't define an optional "a" with the regular
expression "a?" as you might do normally as the question mark is
actually
not an operator at all... you'd use something like "a|$" which looks
nicer
than "a|"). When you're defining a language, epsilons will be much more
frequent then empty strings would be in a regex you were using to match
text.

You can see a full explanation of the language for defining regular
expressions here http://ivanzuzak.info/noam/webapps/regex_simplifier/or
in a comment around the 1450th line here
https://github.com/izuzak/noam/blob/master/src/noam.re.js where the
string representation of regular expressions is defined. I agree it
might
be helpful if we made this more explicit in the readme, but Noam started
out with finite automata, their manipulation and visualization, and
regular
expressions were added afterwards primarily to make it easy to define
languages.

Hope this clears it up. Cheers!


Reply to this email directly or view it on GitHub<
https://github.com/izuzak/noam/issues/2#issuecomment-32047901>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-32051182
.

@PlNG
Copy link

PlNG commented Nov 14, 2014

Agreed, I came looking for the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants