Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with finalizing external objects #3771

Closed
chisandrei opened this issue May 13, 2024 · 4 comments
Closed

Issue with finalizing external objects #3771

chisandrei opened this issue May 13, 2024 · 4 comments

Comments

@chisandrei
Copy link
Member

We get some failures in FinalizationRegistry>finalizeEphemeron:. Two stacks are below:

C stack backtrace & registers:


Registers:
ContextFlags: 0x000000000010000f
        rax 0x000000000000001f rbx 0x000001cb005e2c10 rcx 0x000000854633df00 rdx 0x0000000000000002
        rdi 0x00007ffda8c31070 rsi 0x0000000000000000 rbp 0x000001cb77d85d90 rsp 0x000000854633ded0
        r8  0x00007ffda8ccd990 r9  0x000001cb005e1c1f r10 0x0000000000000000 r11 0x000000854633dd40
        r12 0x0000000000000000 r13 0x000000854056d000 r14 0x000001cb77d85d80 r15 0x000001cb77d85d80
        rip 0x00007ffda8bd8edc


C Callstack:

[0x00007FFDBE847E71] RtlFreeHeap
[0x00007FFD8D94C291] skia_null_ptr
[0x00007FFD8D967D23] skia_typeface_drop
[0x00007FFDA8BE0A11] primitiveReleaseWorker
[0x00007FFDA8BDDD63] primitiveReleaseWorker
[0x00007FFDA8BDDC22] primitiveReleaseWorker
[0x00007FFDA8BAB3CE] shortPrintFramesOnStackPageListInUse
[0x0000000320001658] trampoline cePrimReturnEnterCogCode
[0x00007FFDA8C32FF8] primitivePerformWorkerCallAccessorDepth
[0x8000000010400021] not a method
[0x00007FFDA8C33630] primitivePerformWorkerCallAccessorDepth


All Smalltalk process stacks (active first):
Process        0x362b87558 priority 51
       0x300009c38 M TFSameThreadRunner>invokeFunction:withArguments: 0x1dd3ff8: a(n) TFSameThreadRunner
       0x300009c98 I >primRelease: 0x3dbc058: a(n)
       0x300009cd8 M [] in >release: 0x3dbc058: a(n)
       0x300009d08 M FullBlockClosure(BlockClosure)>on:do: 0x62c001c0: a(n) FullBlockClosure
       0x300009d48 M (GtBoxerExternalObject class)>release: 0x3dbc058: a(n)
       0x300009d80 M (GtBoxerExternalObject class)>finalizeResourceData: 0x3dbc058: a(n)
       0x300009db8 M FFIExternalResourceExecutor>finalize 0x6012c4c8: a(n) FFIExternalResourceExecutor
       0x300009df0 M [] in FinalizationRegistry>finalizeEphemeron: 0x1de80d8: a(n) FinalizationRegistry
       0x300009e20 M FullBlockClosure(BlockClosure)>on:do: 0x62bffee0: a(n) FullBlockClosure
       0x300009e60 M FullBlockClosure(BlockClosure)>on:fork: 0x62bffee0: a(n) FullBlockClosure
       0x300009ea0 M FinalizationRegistry>handleErrorsDuring: 0x1de80d8: a(n) FinalizationRegistry
       0x300009ed8 M FinalizationRegistry>finalizeEphemeron: 0x1de80d8: a(n) FinalizationRegistry
       0x300009f10 M FinalizationRegistryEntry>mourn 0x60120938: a(n) FinalizationRegistryEntry
       0x300009f48 M WeakArray class>mournLoopWith: 0x16300: a(n) WeakArray class
       0x300009fa0 I [] in WeakArray class>finalizationProcess 0x16300: a(n) WeakArray class
       0x300009fe0 I [] in FullBlockClosure>newProcess 0x62b87388: a(n) FullBlockClosure
--- The full stack ---
[ :anError |
			(GtVirtualMachineSymbolNotFoundError new
				symbolName: moduleSymbol;
				moduleName: module) signal ] in GtVirtualMachineFFIBackend>>loadSymbol:module:
FullBlockClosure(BlockClosure)>>cull:
Context>>evaluateSignal:
Context>>handleSignal:
PrimitiveFailed(Exception)>>signal
PrimitiveFailed class(SelectorException class)>>signalFor:
GtVirtualMachineFFIBackend(ProtoObject)>>primitiveFailed:
GtVirtualMachineFFIBackend(ProtoObject)>>primitiveFailed
GtVirtualMachineFFIBackend>>primLoadSymbol:module:
[ self primLoadSymbol: moduleSymbol module: encodedString ] in GtVirtualMachineFFIBackend>>loadSymbol:module:
FullBlockClosure(BlockClosure)>>on:do:
GtVirtualMachineFFIBackend>>loadSymbol:module:
ExternalAddress class>>loadSymbol:module:
GtVirtualMachineExternalFunction(TFExternalFunction)>>validate
TFSameThreadRunner>>invokeFunction:withArguments:
LGitCommit class>>commit_free:
LGitCommit class(LGitExternalObject class)>>finalizeResourceData:
FFIExternalResourceExecutor>>finalize
[ 
		anEphemeron value finalize ] in FinalizationRegistry>>finalizeEphemeron:

Seems that the system tries to deallocate already deallocated memory on startup. On startup ExternalAdress allSubInstancesDo: #beNull gets called. But it seems that the finalizer runs on startup before that and it tries to deallocate some of the ExternalAdress that should be null. It seems that finalizationProcess spawns a new process for each item in the registry, and we did not see anything in there that stops the image from being saved with the finalization process active.

chisandrei added a commit to feenkcom/gtoolkit-world that referenced this issue May 13, 2024
syrel pushed a commit that referenced this issue May 13, 2024
Metacello new
    baseline: 'GToolkitForPharo9';
    repository: 'github://feenkcom/gtoolkit:v1.0.763/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/gt4pharo@e045bd by Tudor Girba
Merge 9a0bad38606b91ccf38e7cc6103cd0ba9f0e7ff3

feenkcom/gt4pharo@7b7dc7 by Tudor Girba
highlight the arguments in the RB pattern highlighter #3767

feenkcom/gtoolkit-debugger@10362c by Andrei Chi�
Enable GtExceptionsWithExtensionsExamples>>#fullDebuggerOn_exampleExceptionWithDebuggingView_scripterRendering2

feenkcom/gtoolkit-world@38f2bd by Andrei Chi�
Reset ffi methods before saving the image [#3771]
@girba
Copy link
Member

girba commented May 14, 2024

This did not address the problem.
The stdout is still being written to continuously:

[Glutin] Could not init env logger
[Warning] Failed to release SkiaCompositionTransformationLayer(@ 16r200D46EA1A0) due to Could not find 'compositor_layer_drop' in '/Users/tudor/jenkins/workspace/feenkcom_gtoolkit_main/glamoroustoolkit/GlamorousToolkit.app/Contents/MacOS/Plugins/libSkia.dylib'
...

@chisandrei
Copy link
Member Author

Here is the startup order. ExternalObject >> #startUp: happens before FFIMethodRegistry >> #startUp:.

Screenshot 2024-05-14 at 10 19 59

ExternalObject >> #startUp: sets all ExternalAddress in the image to null.
However, there is validate method in an FFI function which loads the symbol from a moduleName if a handle is null.

Screenshot 2024-05-14 at 10 20 22

So if an FFI function is called after ExternalObject >> #startUp: but before or during FFIMethodRegistry >> #startUp: then we get this problem.

Ideally we should to make sure that finalizer does nothing during startUp: of the system:

  • If finalizer wakes up before ExternalObject >> #startUp: then we get a hard crash due to double free memory
  • If finalizer wakes up after ExternalObject >> #startUp: but before FFIMethodRegistry >> #startUp: then we get symbol not found error
  • If finalizer wakes up after FFIMethodRegistry >> #startUp: then we are fine

Finalizer process is supposed to restart at priority 100, so after FFIMethodRegistry >> #startUp:. However, what about its process from before the save? WeakArray implements startUp: but not shutdown:. WeakArray could terminate the finalizer process on shutdown:, then it will not run during the init of the core of the system.

One possible solution could be:

Screenshot 2024-05-14 at 10 30 59

This terminates the finalizer process regardless if we actually shutdown or not. The reason is that we must make sure that a saved image never has a running finalizer process. It is possible to save an image and just force quit it later.

@chisandrei
Copy link
Member Author

To add to the previous comment seems like the issues that we are seeing might come from FinalizationRegistry>>#finalizeEphemeron:. To finalize an object the VM sends mourn to that object,
So it is not the finalization process that kicks in, it is the VM's GC during startUp:. From what we see this is the only difference in the weak behaviour between Pharo 10 and Pharo 11.

Seems that in Pharo 11, FinalizationRegistryEntry>mourn can be sent either by GC or by WeakArray class>finalizationProcess. In Pharo 10 it was just by WeakArray class>finalizationProcess and we did not see these issues.

akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
akgrant43 added a commit that referenced this issue May 15, 2024
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
akgrant43 added a commit that referenced this issue May 15, 2024
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 15, 2024
syrel pushed a commit that referenced this issue May 15, 2024
Metacello new
    baseline: 'GToolkitForPharo9';
    repository: 'github://feenkcom/gtoolkit:v1.0.770/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/gtoolkit-glutin@8ce640 by Andrei Chi�
gtoolkit-gleam has the source directly in the root folder [#3736]

feenkcom/gtoolkit-utility@5c023d by Alistair Grant
[#3771] lazy initialisation

feenkcom/gtoolkit-utility@71629a by Alistair Grant
Merge remote-tracking branch 'origin/main'


feenkcom/gtoolkit-utility@9d8f47 by Alistair Grant
[#3771] Pharo 10 compatibility.

feenkcom/gtoolkit-utility@1df5fe by Alistair Grant
[#3771]WeakArray class>>stopFinalizationProcess.

feenkcom/gtoolkit-utility@881d11 by Alistair Grant
[#3771] Remove logging.

feenkcom/gtoolkit-utility@076cf9 by Alistair Grant
[#3771] Assign MournLoopProcess correctly.

feenkcom/gtoolkit-utility@c0e6ef by Alistair Grant
[#3771] Shutdown WeakArray finalization while saving an image

feenkcom/gtoolkit-coder@13ed29 by Juraj Kubelka
use `SmaCCMethodNodeSourceIntervalFinder` [#3735]

delete `GtPharoMethodNodeSourceIntervalFinder`

feenkcom/gtoolkit-coder@b73fb1 by Juraj Kubelka
delegate interval computation to `TGtSourceCoderEvaluatedCode>>#findSourceIntervalForContext:sourceString:` [#3735]

j-brant/SmaCC@2721a3 by John Brant
Merge pull request #24 from JurajKubelka/master

Refactor code to SmaCCMethodNodeSourceIntervalFinder

j-brant/SmaCC@1c4772 by Juraj Kubelka
clean `SmaCCMethodNodeSourceIntervalFinder`

j-brant/SmaCC@cae427 by Juraj Kubelka
clean `SmaCCMethodNodeSourceIntervalFinder`

j-brant/SmaCC@a9e6c0 by Juraj Kubelka
add `SmaCCMethodNodeSourceIntervalFinder`

feenkcom/gt4pharo@a02ad7 by Juraj Kubelka
use `GtPharoMethodNodeSourceIntervalFinder` instead of `SmaCCMethodNodeSourceIntervalFinder` [#3735]


`SmaCCMethodNodeSourceIntervalFinder` needs to be moved to a non-UI package/baseline first. Then we can remove the code duplication.

feenkcom/gt4pharo@886e12 by Juraj Kubelka
use `SmaCCMethodNodeSourceIntervalFinder` [#3735]

feenkcom/gt4pharo@e3b605 by Juraj Kubelka
Merge 657e3bd1007061dfcd095f445999013606e628ea

feenkcom/gt4pharo@f18e9b by Juraj Kubelka
implement `GtPharoEvaluatedSelector` methods [#3735]

feenkcom/gt4pharo@657e3b by Andrei Chi�
Merge 19e0d6368601fe18b530bb65466cd7621c9dcb07

feenkcom/gt4pharo@6f2eac by Andrei Chi�
Use ShiftClassInstaller in Pharo 12 [#3736]

feenkcom/gtoolkit-debugger@31ec0e by Andrei Chi�
Load GToolkit-PharoOverrides-Debugger after all code is loaded [#3736] 

feenkcom/gtoolkit-debugger@4ef5f7 by Andrei Chi�
Merge 5ab6c2dabaf29c1cf15a877f9bf329b100266fb1

feenkcom/gtoolkit-debugger@55f8c4 by Andrei Chi�
Rename and move GToolkit-Pharo9-Debugger to the debugger repository [#3736] 

feenkcom/JSLink@26a8d7 by Andrei Chi�
JavaScriptGenerator has the source in the src folder [#3736]

feenkcom/PythonBridge@6b14bd by Andrei Chi�
Python3Generator has the source in the src folder [#3736]

feenkcom/gt4smacc@a277f3 by Juraj Kubelka
reuse `SmaCCMethodNodeSourceIntervalFinder`

feenkcom/gtoolkit-visualizer@9cebe7 by akevalion
[#3775] added examples for vertical and horizontal bar lines

feenkcom/gtoolkit-releaser@cb6dd2 by Andrei Chi�
Move GToolkit-PharoMigrations to releaser [#3736]

feenkcom/gtoolkit-releaser@d2cb0e by Andrei Chi�
Update the loading using releaser for Pharo 12 [#3736]

feenkcom/gtoolkit-releaser@d49ebd by Andrei Chi�
Update the repository location when releasing for Python3Generator [#3736] 

feenkcom/gt4git@7cbb25 by Sven Van Caekenberghe
added tooltips to some GtGitRepository gtActions

d8a6de by Andrei Chi�
Move GToolkit-PharoMigrations to releaser [#3736]

900be5 by Alistair Grant
Merge 57fbb0a

501780 by Alistair Grant
[#3771] Initialize class vars.

57fbb0 by Andrei Chi�
Load GToolkit-PharoOverrides-Debugger after all code is loaded [#3736] 

1ffe6c by Andrei Chi�
Fix after merge

942442 by Andrei Chi�
Merge 78c548f

4ae043 by Andrei Chi�
Rename and move GToolkit-Pharo9-Debugger to the debugger repository [#3736] 

78c548 by Alistair Grant
[#3771] Add WeakArray class vars.

4946e1 by Andrei Chi�
Move the postload back to the original location [#3736] 

e3ade6 by Andrei Chi�
Fix package dependencies in GToolkitPrerequisites

603a4e by Andrei Chi�
Add GToolkitPrerequisites

3def3f by Andrei Chi�
Merge 07539f2

c86fa0 by Andrei Chi�
Add 'GToolkit-PharoMigrations' a a dependecy
chisandrei added a commit to feenkcom/gtoolkit-vm-bindings that referenced this issue May 16, 2024
syrel pushed a commit that referenced this issue May 16, 2024
Metacello new
    baseline: 'GToolkitForPharo9';
    repository: 'github://feenkcom/gtoolkit:v1.0.775/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/gtoolkit-vm-bindings@8ae318 by Andrei Chi�
Do not allow multiple compiled methods for the same method in the registry [#3771]

feenkcom/gtoolkit-phlow@212bc3 by Juraj Kubelka
Merge 3666338e11d9ac2ec661ad65e72e4ecd18ce1abb

feenkcom/gtoolkit-phlow@3f973c by Juraj Kubelka
add convenience methods to `GtPhlowOverviewItem`

feenkcom/gtoolkit-phlow@366633 by Andrei Chi�
Fit content in GtPhlowListTextualLinkBuilder so the link does not take the entire row.

feenkcom/gt4pharo@24c6a9 by Andrei Chi�
Remove the assertion for a deprecated package in #deprecatedPackagesFilter

feenkcom/gt4pharo@3c600f by Andrei Chi�
Use a different package in for testing GtSearchDeprecatedPackagesFilter

feenkcom/gtoolkit-inspector@486f3c by Juraj Kubelka
display Context variables using `GtPhlowOverviewItem`

feenkcom/gtoolkit-inspector@f50a84 by Juraj Kubelka
do not display `self` nodes in Raw views

feenkcom/PythonBridge@20d62b by Sven Van Caekenberghe
change PBPharoPipenvPathFinder class>>#resolvePipenvPath to use external process directly

feenkcom/PythonBridge@0f1aab by Alistair Grant
[#3514] Remove OSSubprocess reference.

feenkcom/PythonBridge@9a3451 by Alistair Grant
[#3514] Don't load OSSubprocess.

feenkcom/pharo-debugadapterprotocol@51eb70 by Alistair Grant
[#3514] Remove reference to OSSubprocess.

The rewrite to use GtExternalProcessBuilder is TBD.

feenkcom/gt4git@e9574d by Don Roberts
Add caching for commits

feenkcom/gt4git@c0fc64 by Don Roberts
Update class comment


feenkcom/gt4git@feba62 by Sven Van Caekenberghe
renamed GToolkit4Git-Libgit-replace to GToolkit4Git-Libgit-CLI and classified some methods

feenkcom/gt4git@1d6a47 by Don Roberts
Merge branch 'main' of https://github.com/feenkcom/gt4git


feenkcom/gt4git@74b3d2 by Don Roberts
Make git method history work with git cli


feenkcom/gt4git@2ec039 by Sven Van Caekenberghe
fix GtGitCliSignal>>#stdoutPrintString

feenkcom/gt4git@40d8eb by Sven Van Caekenberghe
added operation to GitCli signal/event

feenkcom/gtoolkit-world@ef1564 by Andrei Chi�
Use "GtImageSetup printNotRegisteredFFIMethods" in a few more places

feenkcom/gtoolkit-world@1dae88 by Andrei Chi�
Print FFI methods not in the registry

feenkcom/gtoolkit-world@af609e by Andrei Chi�
Add #notRegisteredSystemFFIMethods

feenkcom/gtoolkit-world@2d63ec by Andrei Chi�
Remove OSSUnixSubprocess extensions

d180c3 by Andrei Chi�
Add applyPatchForFT2Handle
akgrant43 added a commit to feenkcom/gtoolkit-utility that referenced this issue May 17, 2024
@chisandrei
Copy link
Member Author

The issue that we see seems related to FFIMethodRegistry getting into a bad state.

We saw we can get in FFIMethodRegistry multiple entries for the same compiled methods. We saw that for example, especially methods for performing the release of external objects, were executed from multiple processes. We also use quite a lot of background processes. To solve other bugs and be able to call the same FFI method from multiple threads in parallel we added to FFICalloutAPI>>#function:library: a mutex to make sure that it only gets compiled once. For example below we get two compiled methods for SkiaCompositionLayer class>>#primRelease:.

Screenshot 2024-05-16 at 17 23 14

When looking at those multiple methods we saw that we can get into a situation where the ffiNonCompiledMethod property of an FFI method is actually another FFI method. This seems two happen as when same same FFI method is called in parallel from two processes, the first method is set correctly, and then the second time the same FFI method is compiled, the code sender methodClass methodDict at: sender selector. returns actually the FFI method, not the original one.

Screenshot 2024-05-16 at 17 23 08 Screenshot 2024-05-16 at 17 23 28

The side-effect of this is that when doing FFIMethodRegistry resetAll depending on the order in which the same versions of a method were restored, the reset could actually install in the class an FFI method instead of the original method. For example the code below shows that after a reset we were getting FFI methods installed in the image that were not in the registry.

Screenshot 2024-05-16 at 17 23 41

So when we copy the image to a new platform or location those compiled methods were not refreshed and were pointing to the wrong libraries. This was leading to errors in various places. As a temporary fix, we changed the logic from FFICalloutAPI>>#function:library: to not install an FFI method if an FFI method is actually installed (feenkcom/gtoolkit-vm-bindings@8ae318e). So it should be safe to run the same FFI method in parallel and it will be installed in the registry only once.

In GT we extracted part of the logic from FFICalloutAPI>>#function:library: into a method so we can guard the compilation using a mutex: FFICalloutAPI>>#compileFunction:library:sender:. We noticed that if we add a signal to this method we get events that the same FFI method is compiled multiple times, for example SkiaPictureRecorder class>>#primRelease:.

Screenshot 2024-05-17 at 10 39 28

If we look at the stack we notice that those calls are coming from different finalisation processes.

Screenshot 2024-05-17 at 10 39 16 Screenshot 2024-05-17 at 10 39 20

The only possibility for that is if we get multiple finalisation processes waiting on the mutex in FFICalloutAPI>>#compileFunction:library:sender:. In WeakArray class>>#finalizationProcess the first action is to signal the semaphore, so in theory if the current finalisation process gets stuck, another one could start.

We also saw cases when we got more compiled methods in a wrong state (this screenshot is from Pharo 10; all the ones above from Pharo 11):

Screenshot 2024-05-17 at 10 25 04

syrel pushed a commit that referenced this issue May 17, 2024
Metacello new
    baseline: 'GToolkitForPharo9';
    repository: 'github://feenkcom/gtoolkit:v1.0.778/src';
    load

All commits (including upstream repositories) since last build:
feenkcom/Bloc@ea1d38 by Aliaksei Syrel
Merge pull request #16 from StephanEggermont/patch-15

Limit window title length on creation in BlRustWindowingHost.class.st

feenkcom/Bloc@2703a0 by Stephan Eggermont
using exact limit for max window title length

feenkcom/Bloc@be1c53 by Stephan Eggermont
Limit window title length on creation in BlRustWindowingHost.class.st

67638 characters of json string was too much :)

feenkcom/gtoolkit-utility@57de7c by Alistair Grant
Add ProcessorScheduler>>gtQuiescentProcesses

feenkcom/gtoolkit-utility@e1a7d6 by Alistair Grant
[#3771] Ensure only a single mourn loop at a time.

feenkcom/Brick@eee845 by Andrei Chi�
Fix wrong call in Pharo 11

feenkcom/gtoolkit-coder@367931 by Andrei Chi�
Add GtSharedVariablesBindings>>#interactive to support Pharo 11

feenkcom/gt4pharo@a4eef5 by Juraj Kubelka
password styler: use just one pragma collector for all instances

feenkcom/gtoolkit-remoterunner@dcb832 by Alistair Grant
Add GtRrTaskExecutionData>>jobId

feenkcom/gt4git@ba1f74 by Sven Van Caekenberghe
renamed GtIceGitRepositoryExamples into IceGitCliRepositoryExamples

feenkcom/gt4git@8399ff by Sven Van Caekenberghe
added check for directory existence to GtIceRepositoryCreator>>#addLocalRepository

feenkcom/gt4git@3e022b by Don Roberts
Rename Git CLI support classes


feenkcom/gt4git@346bb2 by Don Roberts
Rename GtIceGitRepository to IceGitCliRepository


feenkcom/gtoolkit-world@d8536f by Sven Van Caekenberghe
disabled GtWorldByScripterExamples>>#clickOnWorldHomeCoderIcon by commenting out gtExample pragma

feenkcom/gtoolkit-world@be193e by Tudor Girba
Merge ef1564b651a8b81d4cab5ed8f1beed6d1940b07a

feenkcom/gtoolkit-world@80a353 by Tudor Girba
remove the tools from the main tab #3766

a1da6c by Tudor Girba
a bit more text (should be replaced by pictures soon) #3766

8ae1d3 by Tudor Girba
Merge cf4ffc3

d2b061 by Tudor Girba
work on the initial text #3766
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants