print/vmc.c: Reduce generated code verbosity for `.{1000,}`. #446

silentbicycle · 2023-10-17T20:40:11Z

There was already an existing special case for repeated FETCH, STOP, FETCH, STOP, ... sequences, so add a similar case for repeated FETCH, BRANCH, ... cases as would be produced by .{1000,}.

There was already an existing special case for repeated `FETCH, STOP, FETCH, STOP, ...` sequences, so add a similar case for repeated `FETCH, BRANCH, ...` cases as would be produced by `.{1000,}`.

katef · 2023-10-17T20:50:06Z

src/libfsm/print/vmc.c

+				/* Instead of emitting e.g.:
+				 *     if (c = (unsigned char) *p++, c == '\0') return -1;
+				 *     if (c == '\n') goto l0;
+				 * N times, emit a for loop around that once.


okay, this makes a lot of sense!

dgryski · 2023-11-15T19:25:35Z

I wonder if this would be useful for the vmops output format too. Obviously it would mean making the interpreter vm a bit smarter.

katef · 2023-11-22T13:54:28Z

@dgryski I think we should move this logic up a level, to the vm, and introduce an opcode for this, or replace the existing "match a character" with "match a substring". then everybody benefits

sfstewman · 2023-11-23T15:01:01Z

@dgryski I think we should move this logic up a level, to the vm, and introduce an opcode for this, or replace the existing "match a character" with "match a substring". then everybody benefits

If you look, there are/were plans for opcodes for this. It should speed up certain kinds of matches, but I've never been able to get a test set that would tell me when it helped and when it didn't, so I haven't moved forward with it. It's pretty easy to come up with unanchored REs where this wouldn't help much without some serious graph preprocessing, and where a naive approach may be detrimental. There's also some locality trade-offs between match-a-substring and match-a-character that weren't obvious to me.

I've been noodling around with more general graph optimizations over the last few years, but work has been demanding, so I haven't made much progress lately. Sadly, I don't see that changing in the near future.

Which is to say: I think this is an interesting direction to explore, and I'm glad you guys have the data to test whether it has an impact for your problems. I would make this an optional optimization in case there are other use cases where this is detrimental to performance.

As a last thought, I don't think you want to replace "match a character" with "match a substring." I think you want both to let the failure handling be different. When a substring match fails, you'd want to do something intelligent with where it failed rather than abort the whole substring. @silentbicycle had suggested this at one point, and it's a good idea.

silentbicycle · 2024-01-04T20:57:48Z

I'm revisiting this now, and agree that having separate character and substring match instructions makes sense (rather than special-casing it in code generation like this). Conditionally emitting substring match would limit the impact of the change to specific cases like .{1000,} that aren't currently handled very well, but the existing behavior is fine otherwise.

silentbicycle · 2024-01-08T22:37:12Z

In the specific cases I have in mind (.{1000,} and x{1000,}) it's closer to your potential future FINDB command, in that what we really want is to scan forward until the first offset that does/doesn't match some condition until some max bound is reached, failing otherwise -- like FETCH-AND-CHECK cond:COND count:USIZE, and those respective cases would be:

// .{1000,}
FETCH-AND-CHECK cond:NE-'\n' count:1000
-- to represent 1000x
FETCHF
STOPFEQ 0x0a

// x{1000,}
FETCH-AND-CHECK cond:EQ<'x'> count:1000
-- and 1000x
FETCHF
STOPNE 'x'

but that wouldn't extend well to even slightly more complicated
inputs, such as [aeiou]{1000,}.

What instead do you think about either:

LOOPED-FETCH-CHECK-TABLE count table

The count could be the bottom 16 or so bits, the upper bits would be an address/table-id for a bool[256] table indicating fail or continue for each of the next bytes read. Or the next 8x 32-bit words could be the bool[256] table, I supposed, but with an address the same tables could be reused. It may be worth still having an instruction for LOOPED-FETCH-CHECK with a single COND, because checking for a single character (often \n) is a very common case.

OR

a pair of instructions:

COUNTER-INIT-MAX id count
COUNTER-DECR-JNZ id

The first instruction would initialize a new monotonic counter to a max count value. The second would decrement the counter value, then fall through to the next instruction if zero, otherwise jump back to immediately after where the counter was initialized for another repetition. Each id would correspond to a local variable in the code generation output -- at an IR level we could treat this like an unlimited register machine. The output language would either figure out the scoping or could use a stack. In practice both the number of IDs and their scope should be quite small. Nothing else should be able to modify the counter in any way, to avoid unbounded looping.

This would only be used to handle highly repetitive sequences, and any other instructions jumping into the middle of that instruction sequence from outside should be rejected at compile-time, to keep the code using the counter from being abused somehow that the unrolled version couldn't be. A looped regex whose body can match nothing (like (x?){1000,}) would just spin the loop doing nothing, but for a bounded number of cycles, and is probably still no worse than (x?)(x?)(x?)... fully unrolled (ignoring captures for now).

It would also nest, in a way that LOOPED-FETCH-CHECK-TABLE would not -- we could handle e.g. (x{10,}){10,}` smoothly that way for >10, whereas right now that unrolls the entire thing. But, is that a priority? Maybe not.

What do you think?

(edit: formatting)

dgryski · 2024-01-08T23:40:55Z

Note that Ragel https://www.colm.net/open-source/ragel/ has some extensions that allow it to create not-strictly-regular languages involving counters (such as nested parentheses). It might be interesting to see how it solves a similar problem.

katef · 2024-05-07T11:26:39Z

We talked about this, we prefer a single instruction. because it makes the IR rewriting simpler

print/vmc.c: Reduce generated code verbosity for .{1000,}.

630dfac

There was already an existing special case for repeated `FETCH, STOP, FETCH, STOP, ...` sequences, so add a similar case for repeated `FETCH, BRANCH, ...` cases as would be produced by `.{1000,}`.

silentbicycle requested a review from katef October 17, 2023 20:40

katef reviewed Oct 17, 2023

View reviewed changes

silentbicycle added 2 commits October 17, 2023 16:55

Address MSAN warning for variable set but "unused" w/out asserts.

d407ad5

Remove unused variable, only present for assert.

f6553db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

print/vmc.c: Reduce generated code verbosity for `.{1000,}`. #446

print/vmc.c: Reduce generated code verbosity for `.{1000,}`. #446

silentbicycle commented Oct 17, 2023

katef Oct 17, 2023

dgryski commented Nov 15, 2023

katef commented Nov 22, 2023

sfstewman commented Nov 23, 2023 •

edited

silentbicycle commented Jan 4, 2024

silentbicycle commented Jan 8, 2024 •

edited

dgryski commented Jan 8, 2024

katef commented May 7, 2024

print/vmc.c: Reduce generated code verbosity for .{1000,}. #446

Are you sure you want to change the base?

print/vmc.c: Reduce generated code verbosity for .{1000,}. #446

Conversation

silentbicycle commented Oct 17, 2023

katef Oct 17, 2023

Choose a reason for hiding this comment

dgryski commented Nov 15, 2023

katef commented Nov 22, 2023

sfstewman commented Nov 23, 2023 • edited

silentbicycle commented Jan 4, 2024

silentbicycle commented Jan 8, 2024 • edited

dgryski commented Jan 8, 2024

katef commented May 7, 2024

print/vmc.c: Reduce generated code verbosity for `.{1000,}`. #446

print/vmc.c: Reduce generated code verbosity for `.{1000,}`. #446

sfstewman commented Nov 23, 2023 •

edited

silentbicycle commented Jan 8, 2024 •

edited