Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for easily matching rules using path prefixes #1089

Open
4 of 6 tasks
davidspek opened this issue Apr 11, 2023 · 7 comments
Open
4 of 6 tasks

Allow for easily matching rules using path prefixes #1089

davidspek opened this issue Apr 11, 2023 · 7 comments
Labels
rfc A request for comments to discuss and share ideas.

Comments

@davidspek
Copy link
Contributor

davidspek commented Apr 11, 2023

Preflight checklist

Context and scope

Currently rules are matched to an incoming request based on regex or glob pattern matching. Since only a single rule is allowed to match a request, the regex or glob patterns must be very precise and become increasingly complex as different paths need to be matched. While the used regex library supports negative lookahead, creating rules where requests are matched based on path prefixes is still difficult since you'd need to provide negative matches for all other rule regexes with the same base path. As glob doesn't support negative lookahead doing path prefix based routing is not possible. Even when using regex patterns with negative lookahead, the issue that arises is that the end user ends up needing to manage a system with the state of all rules to then be able to create rules with the appropriate regexes including negative lookahead patterns for all other rules. Since the most common type of routing people likely would want to implement is prefixed based routing, the fact that doing so with oathkeeper is difficult and error prone is in my views its biggest drawback.

Also see #441.

Goals and non-goals

Goals:

  • Allow for easily matching routing rules using path prefixes.

Non-goals:

  • Allow for using regex in the scheme section of the URL, since that would make it much more difficult to enter rules into the Trie.
  • Allow for regex or glob patterns in places other in the host section of the URL, since making a Trie compatible with pattern matching is likely very difficult if not impossible.
  • Allow for regex or glob patterns in places other than the last section of the path, since making a Trie compatible with pattern matching is likely very difficult if not impossible.

The design

In order to allow for easily matching route prefixes we can make use of a Trie Data Structure. These are commonly used for things like search engines where you'd want to quickly find entries that contain a word without looping through all entries within the database. We will use a Trie to find the oathkeeper rule with the longest matching prefix. Since the time complexity is O(n) the matching of rules should be very fast, and likely much faster than what is currently being done since currently oathkeeper loops through each rule to compile a list of matches.

Here is an example of Trie that will be used:

trie

Entering rules into the Trie.

To simplify and speed up matching the first node in our Trie will be the protocol used. This can be either HTTP or GRPC. The next level node in the Trie are the methods (GET, POST, PUT, PATCH, DELETE). This is followed by a node containing the URL scheme (for example: https, as parsed by url.Parse) and then a node containing the host (as parsed by `url.Parse).

Now for the paths. First, we will cleanup the path and keep anything up to the first regex or glob pattern. Since we are only interested in full paths that match, we will not be entering each character of the path into the Trie. Rather, we will split the path of a rule by the / character and enter each complete path into the Trie as a node. On the last path we will add the rule object to the Trie node so we can retrieve it. Technically a Trie node can contain multiple rules if patterns are used somewhere in the paths.

Matching incoming requests

When a request comes in and prefix matching is enabled we will first search the Trie for the rule(s) that match the protocol, method, sheme, hostname and path. Next, we will evaluate the rule(s) for regex or glob pattern matching the same way that is done currently. If after the pattern matching there are still multiple rules that are matched an error will occur the same way this happens currently.

Examples

Using the above example Trie diagram the following requests would en up at the following nodes in the Trie.

Protocol Method Scheme Hostname Path Node Number
ProtocolHTTP GET https example.com /some/path 9
ProtocolHTTP GET https example.com /some/other/path 12
ProtocolHTTP GET https example.com /something/else 8
ProtocolHTTP GET https example.com /some/thing 5
ProtocolHTTP GET https example.com /unknown/path 1
ProtocolHTTP GET https example.com /something/else/and/very/long 8
ProtocolHTTP POST https example.com / 2
ProtocolHTTP POST https example.com /some/long/path 7
ProtocolHTTP POST https example.com /some/path 11

APIs

No response

Data storage

No response

Code and pseudo-code

An implementation can be found in #1073.

Degree of constraint

No response

Alternatives considered

No response

@davidspek davidspek added the rfc A request for comments to discuss and share ideas. label Apr 11, 2023
@davidspek
Copy link
Contributor Author

/cc @aeneasr since you asked for a more thorough explanation and/or design document for my PR.

@davidspek
Copy link
Contributor Author

Friendly ping @aeneasr .

@davidspek
Copy link
Contributor Author

Friendly ping @zepatrik maybe you can also chime in here.

@aeneasr
Copy link
Member

aeneasr commented Jul 17, 2023

This looks pretty nice! Would it make sense to re-use the matching logic of httprouter? I think they solved this already, basically, and it's really fast!

@davidspek
Copy link
Contributor Author

I haven’t looked into httprouter so I’d need to have a look at it.

@davidspek
Copy link
Contributor Author

It looks like httprouter is doing the same thing, except it seems to be a bit more advanced in terms of matching. What is the same though is that wildcard matching (or in our case regex and glob) can only be done at the end of a URL. Another similarity is that they also seem to be adding the method into the Trie.

One big difference though is that they don’t insert the host name into the Trie. Rather, you’d need to create a new router (and therefore a new Trie) for each domain.

I’d need to look into the code a bit closer to see if there’s pieces we could use as a module or if they can otherwise be adapted into this codebase. Since it doesn’t look like that project is super active and that the Trie structure we’d use is like a bit different I’m leaning towards the latter.

@davidspek
Copy link
Contributor Author

After looking at the code it seems very similar to what I’ve already implemented, and I don’t think we’d be able to reuse it as is. However, it probably does have some useful bit that can be used to improve the code I’ve already written.

So @aeneasr, my question now is how can we best try to move this forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc A request for comments to discuss and share ideas.
Projects
None yet
Development

No branches or pull requests

2 participants