Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: support for Fibers #171

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

fix: support for Fibers #171

wants to merge 2 commits into from

Conversation

dunglas
Copy link
Owner

@dunglas dunglas commented Aug 4, 2023

Continuation of #99. Closes #46, closes #99.

@dunglas
Copy link
Owner Author

dunglas commented Aug 4, 2023

I don't understand why fiber tests are failing in Docker containers but not when running directly in GHA runners. Adjusting the stack size with ulimit and setting -fstack-split seems to have no effect. Also, I don't manage to reproduce the issue locally (macOS + M1) or in Docker (tests are always green).
Alpine/musl doesn't look to be the problem here as fails also tests on Debian in Docker.

Do you have any idea of what can be the difference on @withinboredom?

@withinboredom
Copy link
Collaborator

I'll have to take a gander in the morning. At least for me, it looks like Docker has gone insane...

 => [internal] load .dockerignore                                                                                                                                                                           0.1s
 => => transferring context: 133B                                                                                                                                                                           0.0s
 => [internal] load build definition from Dockerfile                                                                                                                                                        0.0s
 => => transferring dockerfile: 2.29kB                                                                                                                                                                      0.0s
 => CANCELED [internal] load metadata for docker.io/library/php-base:latest                                                                                                                                 0.5s
 => CANCELED [internal] load metadata for docker.io/mlocati/php-extension-installer:latest                                                                                                                  0.5s
 => ERROR [internal] load metadata for docker.io/library/golang-base:latest                                                                                                                                 0.5s
 => [auth] library/php-base:pull token for registry-1.docker.io                                                                                                                                             0.0s
------
 > [internal] load metadata for docker.io/library/golang-base:latest:
------
Dockerfile:6
--------------------
   4 |
   5 |     COPY --from=golang-base /usr/local/go/bin/go /usr/local/bin/go
   6 | >>> COPY --from=golang-base /usr/local/go /usr/local/go
   7 |
   8 |     # This is required to link the FrankenPHP binary to the PHP binary
--------------------
ERROR: failed to solve: golang-base: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

@withinboredom
Copy link
Collaborator

I'm an idiot and forgot docker bake is a thing.

@@ -136,9 +136,10 @@ jobs:
if: '!matrix.qemu'
continue-on-error: ${{fromJson(needs.prepare.outputs.push)}}
run: |
docker run --platform=${{matrix.platform}} --rm \
ulimit -s
docker run --ulimit stack=$(ulimit -s)000 --platform=${{matrix.platform}} --rm \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a positively ridiculous stack size to get the tests to pass in my local Docker.

Suggested change
docker run --ulimit stack=$(ulimit -s)000 --platform=${{matrix.platform}} --rm \
docker run --ulimit stack=512000000 --platform=${{matrix.platform}} --rm \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very curious how PHP fibers work, under the hood, beyond my rudimentary knowledge. It'd be a really interesting blog post. 🤔

Maybe there are some changes to either Frankenphp or PHP itself that can make this better. A 5mb stack is pretty huge... Especially the zts version, where instead of trying to do everything in a single stack, could just spawn a thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this with fresh morning eyes. I'm going to write that blog post. I really want to understand what is happening in the stack here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the value you suggested (last commit), it doesn't work :/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing changed except the time of day and its failing locally with the same numbers I gave you (ran on the same commit). This makes no sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtaf. I ran it a few more times and it passes
image

you can't make this up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it fail the first few times and then pass afterward? There's some kind of race going on and it fails more than it passes, but ran it 10 times, it only passed 3 times.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's flaky. Even on GHA sometimes test pass, sometimes no.

Could you try to compile PHP with --disable-fiber-asm to see if this improves the situation?

@withinboredom
Copy link
Collaborator

withinboredom commented Aug 5, 2023

So @dunglas, I was digging around php-src to start that blog post :) and I saw this in the 8.3 notes:

Custom Fiber implementations have to initialize EG(stack_limit) and
EG(stack_base).

Maybe frankenphp needs to implement a go-specific fiber implementation?

There's also the fiber.stack_size ini entry.

@withinboredom
Copy link
Collaborator

Internals: SAPIs that may execute PHP in alternative stacks have to set EG(stack_limit) and EG(stack_base)

From: php/php-src#9104

That PR is amazing information about how the stack works. Something that Go might be screwing up:

Platforms with stacks that grow downwards are supported (probably all platforms supported by PHP). This assumes contiguous stacks.

Emphasis mine. I don't think the stack will be contiguous with Go involved.

@dunglas
Copy link
Owner Author

dunglas commented Aug 5, 2023

Thanks for the pointer, this looks interesting! Passing --disable-fiber-asm doesn't help.
Regarding contiguous stacks, I'm not sure that Go will be a problem because FrankenPHP creates the threads itself, they aren't the same threads as the one started by Go.

@withinboredom
Copy link
Collaborator

they aren't the same threads as the one started by Go.

Yep, saw that as well. I wonder if we're running into the same situation as golang/go#39411 (comment) and the PHP fiber implementation is somehow messing the main stack?

@withinboredom
Copy link
Collaborator

withinboredom commented Aug 5, 2023

I added zend_call_stack_init(); in frankenphp_execute_script (from Zend/zend_call_stack.h) as an experiment. It went from passing 30-40% of the time to passing 70-80% of the time. There's probably a hint somewhere in that code path as to what is going on. Maybe. (this is PHP 8.3 specific, that code doesn't exist in 8.2)

@withinboredom
Copy link
Collaborator

One thing sticks out to me:

goroutine 120 [runnable, locked to thread]:
runtime.gcTrigger.test({0x0?, 0x0?, 0x0?})
        /usr/local/go/src/runtime/mgc.go:547 +0xfb fp=0xc000b27990 sp=0xc000b27988 pc=0x41ed7b
runtime.mallocgc(0x4000, 0x788020, 0x1)
        /usr/local/go/src/runtime/malloc.go:1171 +0x768 fp=0xc000b279f8 sp=0xc000b27990 pc=0x412f48
runtime.makeslice(0xac9120?, 0x788060?, 0xc000b27a68?)
        /usr/local/go/src/runtime/slice.go:103 +0x52 fp=0xc000b27a20 sp=0xc000b279f8 pc=0x454212
github.com/dunglas/frankenphp.go_read_post(0x70ad45?, 0x409b46?, 0x4000)
        /go/src/app/frankenphp.go:590 +0x87 fp=0xc000b27b48 sp=0xc000b27a20 pc=0x7079e7
_cgoexp_411032355cef_go_read_post(0x7f348abe9ee0)
        _cgo_gotypes.go:985 +0x2c fp=0xc000b27b70 sp=0xc000b27b48 pc=0x70ae0c
runtime.cgocallbackg1(0x70ade0, 0xc000b27d20?, 0x0)
        /usr/local/go/src/runtime/cgocall.go:315 +0x2b1 fp=0xc000b27c38 sp=0xc000b27b70 pc=0x409a51
runtime.cgocallbackg(0xc000482340?, 0x300000002?, 0xc000482340?)
        /usr/local/go/src/runtime/cgocall.go:234 +0x109 fp=0xc000b27cc8 sp=0xc000b27c38 pc=0x409709
runtime.cgocallbackg(0x70ade0, 0x7f348abe9ee0, 0x0)
        <autogenerated>:1 +0x2f fp=0xc000b27cf0 sp=0xc000b27cc8 pc=0x47382f
runtime.cgocallback(0x409585, 0x759b20, 0xc000b27d80)
        /usr/local/go/src/runtime/asm_amd64.s:998 +0xb4 fp=0xc000b27d18 sp=0xc000b27cf0 pc=0x4710d4
runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:463 fp=0xc000b27d20 sp=0xc000b27d18 pc=0x46f0e0
runtime.cgocall(0x759b20, 0xc000b27d80)
        /usr/local/go/src/runtime/cgocall.go:167 +0x85 fp=0xc000b27d58 sp=0xc000b27d20 pc=0x409585
github.com/dunglas/frankenphp._Cfunc_frankenphp_execute_script(0x7ef0bc0a35d0)
        _cgo_gotypes.go:770 +0x48 fp=0xc000b27d80 sp=0xc000b27d58 pc=0x7047e8
github.com/dunglas/frankenphp.go_execute_script(0xc000482301?)
        /go/src/app/frankenphp.go:461 +0x207 fp=0xc000b27e20 sp=0xc000b27d80 pc=0x7067a7
_cgoexp_411032355cef_go_execute_script(0x7f4987ffea50?)
        _cgo_gotypes.go:925 +0x1c fp=0xc000b27e38 sp=0xc000b27e20 pc=0x70abdc
runtime.cgocallbackg1(0x70abc0, 0xc000b27fe0?, 0x0)
        /usr/local/go/src/runtime/cgocall.go:315 +0x2b1 fp=0xc000b27f00 sp=0xc000b27e38 pc=0x409a51
runtime.cgocallbackg(0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/cgocall.go:234 +0x109 fp=0xc000b27f90 sp=0xc000b27f00 pc=0x409709
runtime.cgocallbackg(0x70abc0, 0x7f348abf0980, 0x0)
        <autogenerated>:1 +0x2f fp=0xc000b27fb8 sp=0xc000b27f90 pc=0x47382f
runtime.cgocallback(0x0, 0x0, 0x0)
        /usr/local/go/src/runtime/asm_amd64.s:998 +0xb4 fp=0xc000b27fe0 sp=0xc000b27fb8 pc=0x4710d4
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000b27fe8 sp=0xc000b27fe0 pc=0x471301

Is Go doing GC while executing a script? Am I reading that right?

@withinboredom
Copy link
Collaborator

GC is a red herring. I disabled it, and analysed a bunch of stack traces. There doesn't seem to really be much in common which makes me think this is on PHP's side, not Go's.

Oh, and zend_call_stack_init is also a red herring. After more rigorous testing, it is clear that I was just "lucky".

@withinboredom
Copy link
Collaborator

There's some asm in Go that switches to g0 upon returning from a cgo callback. I suspect that is where the culprit is. However, I cannot reproduce the crash in the debugger. Still trying.

@withinboredom
Copy link
Collaborator

From cgocall.go:

// The call from C is on gp.m's g0 stack, so we must ensure
// that we stay on that M. We have to do this before calling
// exitsyscall, since it would otherwise be free to move us to
// a different M. The call to unlockOSThread is in unwindm.

hmmmm....

@withinboredom
Copy link
Collaborator

withinboredom commented Aug 8, 2023

I've managed to figure out how to reproduce it reliably while a debugger is running. Yet stepping through the code is enough to work around the issue. This is certainly a race condition:

  1. Reboot the computer.
  2. Startup debugging of frankenphp
  3. Make a request to localhost
  4. See the crash within the first 1-5 requests
  5. Goto (1)

I've narrowed down the issue to somewhere in _cgo_gotypes.go, which appears to be a generated file and not checked into source control. The issue appears to possibly be a bit of funky memory from PHP. Although that isn't certain yet. This is a very slippery bug.

In fact, the fact that it requires a reboot (unless you're running in Docker!) almost makes it smell like a kernel bug, but I doubt it. Probably some little chunk of memory gets cached in a CPU cache line that makes it work after the first requests, but switching namespaces in the kernel gets it evicted.

@withinboredom
Copy link
Collaborator

withinboredom commented Aug 11, 2023

Requires a custom build of PHP, but I can no longer reproduce it locally...

./configure $conf_vars
echo #define ZEND_FIBER_UCONTEXT 1 >> main/php_config.h
make -j

ZEND_FIBER_UCONTEXT uses a different method for managing the stack that seems to work just fine with Go.

@dunglas
Copy link
Owner Author

dunglas commented Aug 11, 2023

@withinboredom thanks for your in-depth research on this topic, I very much appreciate it!! Unfortunately, I don't think that defining ZEND_FIBER_UCONTEXT is enough, because it's already what this PR does and tests are still failing (at least with Docker): https://github.com/dunglas/frankenphp/pull/171/files#diff-dd2c0eb6ea5cfc6c4bd4eac30934e2d5746747af48fef6da689e85b752f39557R95-R96

Passing the --disable-fiber-asm flag defines ZEND_FIBER_UCONTEXT: https://github.com/php/php-src/blob/php-8.2.9/configure.ac#L1317

@withinboredom
Copy link
Collaborator

:sigh: ok, I have another lead, but that was the most promising.

@withinboredom
Copy link
Collaborator

Funny enough, today I did run into it with the asm disabled locally (outside of tests).

Anyway, I think I know what is going on. At least I'm fairly certain. This bug should affect ALL cgo programs, if I'm actually understanding things correctly.

Essentially the issue is that we're passing a pointer from C to Go that contains pointers to Go structures... in morestack, it calls newstack which makes a copy of the stack and rewrites all pointers. When we're using Fibers, this causes the stack to grow upon reentry to Go, which causes it to rewrite all pointers.

Here's the thing. These pointers are owned by C structs, not Go structs. Thus they crash quite spectacularly when they don't get rewritten by newstack.

I think this can be fixed by having a hash in Go that the C structs then reference (main context / current context). Essentially:

// Go Land
$contexts = ['main' => $ctx, 0 => $otherctx, 2 => ... ];

// C Land
{ 
  go_call_thingy('main', 0);
}

// Go Land
go_call_thingy($main, $current) {
  $main = $contexts[$main];
  $current = $contexts[$current];
}

or something like that. Here's the thing, once a few requests are served, the stack stays pretty stable, thus that is why it is hard to reproduce on 'real servers' vs. these tests which keep the stack really small and thus easy to grow, especially the g0 stack.

The moral of the story is that I think we cannot reference Go structs from C structs or vice-versa.

@dunglas
Copy link
Owner Author

dunglas commented Aug 15, 2023

Thanks for sharing your findings, this looks promising!

Maybe are we violating the pointers passation rules in some way?

Do you have already identified the offending pointers? Maybe could be avoid passing them from Go to C, or ping them?

Also, we should set GODEBUG=cgocheck=2 in tests to see if we get better errors.

@withinboredom
Copy link
Collaborator

Do you have already identifier the offending pointers? Maybe could be avoid passing them from Go to C, or ping them?

Ah, yep. I guess I accidentally hinted to them. It's the HTTP contexts from Caddy which are stored in the current/main contexts in Frankenphp's C code.

@dunglas
Copy link
Owner Author

dunglas commented Aug 16, 2023

I just reviewed the code again, and I don't think that we store any Go pointer in C memory.

We use runtime/cgo.Handle (which is designed to not have to pass Go pointers to C).

Do you have any GDB backtrace? (I still cannot reproduce locally, most likely because my CPU an Apple M1 which seems unaffected by this bug).

@withinboredom
Copy link
Collaborator

I went down quite a rabbit hole trying to find a bad pointer.

Handy bit of C:

static void debug_log(char *data) {
	pid_t id = gettid();
	fprintf(stderr, "tid %d: %s", id, data);
}
  • after commenting out sections, hardcoding values from Go, and adding debugging statements everywhere, I realized that I was getting a crash before a thread could be reused. After adding debugging code in thpool, this turned out to not be true.
  • as far as I can tell, the crash almost always happens between the call (cgo asm) to go_add_header from frankenphp_send_headers. It doesn't matter what you pass to it, whether it is an invalid pointer or not, whether it is junk length or junk data. Just simply making the call breaks everything.
  • then I realized that was not the issue, after a git stash and commenting out the go_header* funcs. No, I also had to remove the call to go_ub_write. Once those were commented out, there was no more crashing.
  • Obviously, this is not ideal. However, it appears that simply outputting something from a PHP fiber causes Go to crash in the assembly code. Sometimes. In between the C and the actual entry of the Go program. I have no idea why, I cannot catch it while debugging.

This is beyond my expertise to resolve now, as I'm not intimately familiar with Go and assembly. Here's a minimum diff to get tests to run on this branch (though they still fail due to not outputting anything) without crashing, uncommenting either one of those functions will cause the crash when running tests.

Apparently, this may be amd64 specific. I can reproduce this in an ubuntu 22.04 VM without fail, or a docker container. On my baremetal Linux laptop (pop-os 22.04), I cannot reproduce it reliably (if at all).

Now I'm wondering if this is ubuntu specific... but I'll have to test that later, it's 1am.

cc: @krakjoe

@dunglas
Copy link
Owner Author

dunglas commented Aug 19, 2023

@withinboredom here is why: golang/go#62130 (comment)

We'll likely have to patch Go to add support for Fibers.

@withinboredom
Copy link
Collaborator

I wonder if we can do this in C.

  1. detect we are in a fiber context
  2. reset the stack back to what Go expects
  3. call go
  4. on return, reset the stack to what PHP expects
  5. resume program

@withinboredom
Copy link
Collaborator

Huh, dunno how I missed this, but zend_timeout_handler gets called from a syscall to sleep, even from Go. So, as soon as Go sleeps for any reason, it tries to call zend_timeout_handler... this causes a crash, instantly.

@dunglas
Copy link
Owner Author

dunglas commented Aug 19, 2023

Regarding zend_timeout_handler, are you sure that you compiled PHP with --enable-zend-max-execution-timers? This feature prevents this issue and we have tests for this.

@withinboredom
Copy link
Collaborator

are you sure that you compiled PHP with --enable-zend-max-execution-timers

Double-checked and verified. Running 8.3-dev fwiw.

@withinboredom
Copy link
Collaborator

withinboredom commented Aug 19, 2023

In fact, building with --disable-zend-max-execution-timers causes tests to pass with fibers. It seems there is a bug with that feature?

It looks like when it tries to run zend_timeout_handler while in a Go context, with the Go stack. The whole thing crashes immediately. Unless you happen to be stepping through the thread it happens in, it just looks like the Go thread is parked in the stack trace.

@withinboredom
Copy link
Collaborator

Doing this:

static void disable_timers() {
	#ifdef ZEND_MAX_EXECUTION_TIMERS
		// unset default timeout
		// TODO: add support for max_input_time
		zend_unset_timeout();
	#endif
}

static void restart_timers() {
	#ifdef ZEND_MAX_EXECUTION_TIMERS
		// Reset default timeout
		// TODO: add support for max_input_time
		zend_set_timeout(INI_INT("max_execution_time"), 0);
	#endif
}

and wrapping calls to go that happen during scripts also seems to work. It'd be handy if zend_unset_timeout returned the seconds remaining so we could zend_set_timeout with the correct time.

@withinboredom
Copy link
Collaborator

Turns out to be that PHP isn't installing the signal handlers properly. I'm opening a PR over there to get it fixed in 8.2/8.3 -- well, hopefully. I've never opened a PR over there.

@dunglas
Copy link
Owner Author

dunglas commented Aug 20, 2023

I can assist you and review if you need any help (it's probably related to changes I've done to PHP). Feel free to contact me in DM!

@Danack
Copy link

Danack commented Aug 22, 2023

Turns out to be that PHP isn't installing the signal handlers properly.

I was going to ask if you'd tried toggling whether ZEND_SIGNALS was (en|dis)abled.

  This is the bare minimum required to make fibers work within the go
  runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using Fibers causes epic crash
4 participants