Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty intersection not detected #48

Open
mikewardle opened this issue Oct 29, 2018 · 5 comments
Open

Empty intersection not detected #48

mikewardle opened this issue Oct 29, 2018 · 5 comments

Comments

@mikewardle
Copy link

I am trying to write a method that will detect if there is NO intersection between 2 regular expressions (as in this answer on stack overflow for java: https://stackoverflow.com/a/17957180/2987400 )

but it never suggests that the intersection is empty.

Here is my code:

public bool IsNotOrthogonal(string productName1, string productName2)
    {
        _log.DebugFormat("Checking for overlap between [{0}] and [{1}]", productName1, productName2);

        var r1 = new RegExp(productName1);
        var a1 = r1.ToAutomaton();

        var r2 = new RegExp(productName2);
        var a2 = r2.ToAutomaton();

        var i2 = a1.Intersection(a2);

        if (i2.IsEmpty)
        {
            _log.Debug("no overlap");
            return false;
        }
        else
        {
            _log.Debug("intersection is not empty");
            return true;
        }
    }

and some a unit test that fails when I expect it to pass:

public void SimpleNoOverlap()
    {
        var logger = new Mock<ILog>();
        var sut = new RegexOverlapChecker(x => logger.Object);
        var res = sut.IsNotOrthogonal(@"^A", @"^B");
        Assert.IsFalse(res);
    }

Any help on this would be greatly appreciated

@mikewardle
Copy link
Author

Could this be that the IsEmpty property is never set?

I have tried changing to use BasicOperations.IsEmpty(i2) and now it finds empty intersections for cases I would not expect it to for example

var logger = new Mock<ILog>();
var sut = new RegexOverlapChecker(x => logger.Object);
var res = sut.IsNotOrthogonal(@"Easy Access Saver", @"Easy Access Saver Reward");
Assert.IsTrue(res);

@mikewardle
Copy link
Author

This now appears to be down to the matching of the regular expressions themselves, running:

var a = new RegExp("Easy Access Saver").ToAutomaton();
var isMatch = a.Run("Easy Access Saver Reward");

returns false! There is clearly a problem with my understanding of the way the automaton are functioning in Fare.

incidentally using .net framework native regex:

var isMatch = new Regex("Easy Access Saver").IsMatch("Easy Access Saver Reward");

gives true.

@moodmosaic
Copy link
Owner

@mikewardle, thank you for reporting this. Feel free to send over a pull request if you think you know a reasonable way to solve this.

Project Fare turns Regular Expressions into Automatons by applying the algorithms of dk.brics.automaton and xeger.

Unfortunately, I don't have an answer to your question, as Project Fare is really a port of the above Java projects.

You may use a different pattern or use a different engine to reverse the Regular Expression into an Automaton. As an example, you can use the Rex engine.

You are more than welcome to troubleshoot/debug this and open a pull request, however.

@trylock
Copy link

trylock commented Nov 10, 2018

@mikewardle I've been interested in a similar problem today. First off,

var a = new RegExp("Easy Access Saver").ToAutomaton();
var isMatch = a.Run("Easy Access Saver Reward");

returns false because the automaton matches the string exactly. AFAIK regular expression Easy Access Saver in this library should be equivalent to ^Easy Access Server$ in C#. You can easily fix this by using a different expression: .*Easy Access Server.*. This will try to match Easy Access Server as a substring.

Now, to the original question. I've only found this library today so correct me if I'm wrong. The IsEmpty property does not appear to be set anywhere in the code. However, language accepted by a DFA is empty if and only if there are no accepting states reachable from the initial state. The Automaton.GetAcceptStates() method will give you the set of reachable accepting states. You can easily check whether the intersection of 2 regular expressions is empty using this:

bool IsIntersectionEmpty(RegExp first, RegExp second)
{
    var firstAutomaton = first.ToAutomaton();
    var secondAutomaton = second.ToAutomaton();
    return firstAutomaton.Intersection(secondAutomaton).GetAcceptStates().Count == 0;
}

As a note, you don't have to explicitly build the intersection DFA. You can just traverse states of the product automaton.

@moodmosaic
Copy link
Owner

@trylock, very nice! Yes, besides Xeger there's also all the other great stuff from dk.brics.automaton that's already exposed in this library, e.g. ToAutomaton.


In fact, Xeger does not support all valid Java regular expressions. The full set of what is defined here and is summarized below:

regexp ::= unionexp 
| 
unionexp ::= interexp | unionexp (union) | interexp 
interexp ::= concatexp & interexp (intersection) [OPTIONAL] | concatexp 
concatexp ::= repeatexp concatexp (concatenation) | repeatexp 
repeatexp ::= repeatexp ? (zero or one occurrence) 
| repeatexp * (zero or more occurrences) 
| repeatexp + (one or more occurrences) 
| repeatexp {n} (n occurrences) | repeatexp {n,} (n or more occurrences) | repeatexp {n,m} (n to m occurrences, including both) 
| complexp 
complexp ::= ~ complexp (complement) [OPTIONAL] | charclassexp 
charclassexp ::= [ charclasses ] (character class) 
| [^ charclasses ] (negated character class) 
| simpleexp 
charclasses ::= charclass charclasses 
| charclass 
charclass ::= charexp - charexp (character range, including end-points) | charexp 
simpleexp ::= charexp 
| . (any single character) 
| # (the empty language) [OPTIONAL] | @ (any string) [OPTIONAL] | " <Unicode string without double-quotes> " (a string) 
| ( ) (the empty string) 
| ( unionexp ) (precedence override) 
| < <identifier> > (named automaton) [OPTIONAL] | <n-m> (numerical interval) [OPTIONAL] charexp ::= <Unicode character> (a single non-reserved character) 
| \ <Unicode character> (a single character)

(source)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants