Replies: 31 comments
-
Compiling only gcmain3.c with -O0 instead of -O2 "fixes" the problem. It's also the case that adding a printf of "link" or referencing it in other ways causes the code to behave correctly. The compiler used was:
|
Beta Was this translation helpful? Give feedback.
-
The code appears to work when compiled with the Solaris Developer Studio 12.6 compiler with optimization on. There are a few adjustments necessary to include files to make that compiler happy (need to include arith.h which defines the macro N_ARITH_SWITCH anywhere that includes "my.h" since functions defined in "my.h" use the macro even though they themselves aren't used... implicit function definition leads to an undefined reference) |
Beta Was this translation helpful? Give feedback.
-
The same issue happens with gcc compiled code on armv7l architecture.
|
Beta Was this translation helpful? Give feedback.
-
Adding -fno-strict-aliasing to the compilation of gcmain3.c appears to prevent the error, at least initially. |
Beta Was this translation helpful? Give feedback.
-
I had a very quick look at this issue and did some tiny experiment. It does not look to me like the garbage collector is the culprit, but somewhere else memory gets corrupted. As I am completely new to the code base: does anybody here know how memory is supposed to be managed or is there any conceptual description anywhere? Or is the actual source code all we have? Just asking before I go down the rabbit hole... |
Beta Was this translation helpful? Give feedback.
-
Hi --
Get the latest update to source (or, more specifically, the makefiles) -- see commit cd4c59dbb3114e4d26260705cb3f10e7b4b67d31
the problem with the segfault in the GC code is a result of (mostly with gcc) not having -fno-strict-aliasing -- the code is very sloppy about pointer aliasing because it's running a virtual machine where memory is just an array of 16-bit words... except when it isn't. The authors defined various C structures, for ease of field access over multiple 16-bit words, and the compiler mustn't make *any* assumptions about about there being no aliasing between pointers of arbitrary type. This most notably shows up in the GC because it's trying to touch everything...
If you, for example, toss in a printf to track things that are GC'd... the segfault goes away.
…-- Nck Briggs
On Aug 23, 2020, at 6:24 PM, pbaur ***@***.***> wrote:
I had a very quick look at this issue and did some tiny experiment. It does not look to me like the garbage collector is the culprit, but somewhere else memory gets corrupted. As I am completely new to the code base: does anybody here know how memory is supposed to be managed or is there any conceptual description anywhere? Or is the actual source code all we have? Just asking before I go down the rabbit hole...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <https://github.com/Interlisp/maiko/issues/3#issuecomment-678857821>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB6DAWLMA3VYKVLB633XD4DSCG6N5ANCNFSM4PG7HZ4A>.
|
Beta Was this translation helpful? Give feedback.
-
There's the original Interlisp-D implementation of the GC, and this C code is an implementation of both some VM opcodes that were microcoded on the D-machine and, at a slightly higher level, some Lisp code that used those opcodes. I don't think the Lisp code is up on github yet. |
Beta Was this translation helpful? Give feedback.
-
I took the latest sources including the Thank you very much for the quick response, explanation and hints. Being completely new to the code base, and apart from some historic reading on the architecture, new to the D-architecture and Interlisp-D; I have a little catching up to do. And while I am at it: lots of thanks for making the code available, I have always been intrigued by the machine as well as the software architecture of the D-Line, now I finally get a chance to dig into both. Looking forward to see this restoration project prosper, and happy to help (once I am at a point of being of any help, that is). Last but not least, my apologies for not introducing myself properly, I wasn't aware my profile is nameless: I'm Patrick Baur. |
Beta Was this translation helpful? Give feedback.
-
Hi Patrick -- what OS and architecture are you compiling for, and with which compiler/version? In case we overlap: I've been building on MacOS 10.11 with clang 8.00 (standard Apple release) 32- and 64-bit, FreeBSD with gcc-9.3.0 and clang versions 8, 9, and 12, 32-bit only, Ubuntu 20.04.1 with clang 10.0 (64-bit), Solaris 11.4 on SPARC T1 Niagara with the Solaris 12.6 Studio compiler (mostly 32-bit), Solaris 11.4 on x86_64 but compiling 32-bit mode with clang 6.0.0, and most recently a Beaglebone Black (armv7l) running Debian 9.9, gcc (which I don't have running at the moment so I don't know which version). |
Beta Was this translation helpful? Give feedback.
-
Hi Nick. I started out on my desktop with the precompiled version from the medley repo: AMD Ryzen 7, Fedora 32 (virtualized). Works like a charm, thus didn't yet go into recompiling it (which I will do soon to get a baseline to compare the Rpi to). That one is running gcc 10.2.1, current Fedora repo version. On the Raspberry, I got a 4B/4GB (Linux raspberrypi 5.4.51-v7l+ #1333 SMP Mon Aug 10 16:51:40 BST 2020 armv7l GNU/Linux) with gcc (Raspbian 8.3.0-6+rpi1) 8.3.0 and clang version 7.0.1-8+rpi3 (tags/RELEASE_701/final. I might recover a Mac as well, would be current OS and compiler suite of theirs, if they still support the model. |
Beta Was this translation helpful? Give feedback.
-
OK. Not sure that Larry recompiled the versions in the repo after I had added the -fno-strict-aliasing, it may have been compiled with reduced optimization though. I'm afraid I don't have an RPi, so no joy there. Generally, I've been happier with the results from clang over gcc. With a currently supported Mac you'd be able to install Catalina and the latest XCode. If it's a Mac that's older you'd be able to install MacOS El Capitan (10.11.x) -- my everyday Mac is from 2008 and can't get past 10.11.x, but others are running Catalina (10.15.x). |
Beta Was this translation helpful? Give feedback.
-
I just recompiled your newest maiko on my desktop, and it appears to be working fine, at least for some five minutes. That is with gcc (against your makefile which would favour clang, just didn't have it installed, thus took gcc for a spin first). Once it looks like it is working properly, I'll try clang. |
Beta Was this translation helpful? Give feedback.
-
The D-machines that this is emulating are big-endian, and the C code was originally running on Motorola 68020 Sun3 systems and then and on SPARC (both big-endian). The port to little-endian came later. Initially it was done on the Sun i386 boxes, and there is special dispensation beyond the generic little-endianness because specifically there the compiler behaved differently. p.s., not my SPARC, but someone is kindly providing access to it. |
Beta Was this translation helpful? Give feedback.
-
Actually, I was too forward. While it does not crash as such (drop you in the command monitor due to segfault), with optimization it hangs (does not accept any input anymore, but still using its 100% CPU---not sure whether it was still kinda happy or went berserk). As you mentioned, appears that the repo version was indeed compiled with reduced optimization. Taking optimiziation out, it looks fine, fingers crossed. |
Beta Was this translation helpful? Give feedback.
-
It always uses 100% of a cpu -- that's not a surprise since the idle loop and timers are within the emulated code. Normally you just click within a window and it will take the focus. I have seen very few cases of it locking up in a loop -- though it will drop into its own debugger, uraid, when things go south. If you're looking at memory access, there are access macros in lsptypes.h for things like
and so on. There is also a lot of conditional code for things like BIGVM, BIGATOMS, which affects the sizes of objects that are referenced out of the Lisp virtual memory. FYI, these days it's all BIGVM and BIGATOMS. As I mentioned, the code is very sloppy with regard to the sizes of objects it's accessing. I had quite a time getting it to run when compiled LP64, and you'll probably find things in there that I missed, though this problem in particular happens in ILP32 mode too. |
Beta Was this translation helpful? Give feedback.
-
This may not be the right place for this wide ranging discussion, but... |
Beta Was this translation helpful? Give feedback.
-
Agreed, I already felt bad myself for usurping this issue's thread. Thank you for replying nonetheless, valuable insights, much appreciated!
Unlike the Lisp stack, the C stack appears to get completely obliterated. I read the VM specification, which is like the ham in the sandwich; I will try to get somewhat acquainted with the maiko code, but otherwise wait for the Lisp sources to be released, it is kinda hard to figure what's going wrong if you don't know what right would be to start with: |
Beta Was this translation helpful? Give feedback.
-
I'd look for a byteswap problem in the UNWIND opcode (which, I confess, I have no idea what it is supposed to do) |
Beta Was this translation helpful? Give feedback.
-
I can reproduce the FONTCREATE crash with the current code compiled with "gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516" on "Linux beaglebone 4.14.108-ti-r113 #1 SMP PREEMPT Wed Jul 31 00:01:10 UTC 2019 armv7l GNU/Linux" (armv7l-unknown-linux-gnueabihf) -- I think this is a different problem from the pointer aliasing problem in the garbage collector. |
Beta Was this translation helpful? Give feedback.
-
I didn't spot anything on |
Beta Was this translation helpful? Give feedback.
-
From
I don't know where this all is having effects, but looks like the Lisp side has its own idea about words, word sizes, etc. |
Beta Was this translation helpful? Give feedback.
-
We maintained Interlisp-10 (and Interlisp-MAXC) through years where those numbers were variables (or would appear in (CONSTANT BITSPERHALFWORD) macros. A general warning probably is in order for the LL- and A- files that they might look like they're in ordinary Interlisp, but there are (mainly not yet written) rules about reading and compiling. The emulated address space changed twice by removing bits from CDR-coding. |
Beta Was this translation helpful? Give feedback.
-
As Larry says, the Lisp code running on the virtual machine is working with 16-bit big-endian words. Some things are coded (in Lisp) in multiple 16-bit words. The VM accesses some Lisp structures directly, and has to be aware of the possibility of an endianness mismatch between the Lisp view and the native machine view, therefore the access macros. Layer on top of this the fact that the original code was implemented on an ILP32 (int/long/pointer all 32-bit) system, and then ported to a more modern ILP32 system (MacOS X) and then to an LP64 system (again MacOS X). Along the way the default changed from unsigned characters to signed characters (which we now force back again with -funsigned-char) which caused quite a few bugs, and there was a lot of sloppiness in the original code with regard to whether native pointers were declared as such -- e.g., conversion from a "lisp pointer" (an offset from 0 in the Lisp memory) to a native pointer represented as sometimes an integer (and variously int or unsigned) or a declared pointer (e.g., unsigned short * or sometimes int *). With the ILP32 to LP64 transition I had to track down a bunch of issues where it depended on wraparound in 32-bit arithmetic. Now add the optimizations that the new compilers assume they can do because the C standard has changed with regard to pointer aliasing... and you start to see why it's difficult to get/keep this old code running. Oh, and there is no guarantee that the original code was correctly ported to deal with the endianness issues (witness the FBITMAPBIT fix I made just the other day) |
Beta Was this translation helpful? Give feedback.
-
The notes about word size constants and other things belong in the documentation. The README is fine unless it gets too long. Are there any new makefile "parts" (cf #25 )? |
Beta Was this translation helpful? Give feedback.
-
I tried to have another look at the issue but ran into some problem there: loading FONT from sources drops you into an endless loop complaining that |
Beta Was this translation helpful? Give feedback.
-
If you're poking around in lower-level code, and using sources rather than compiled code, you probably need to load EXPORTS.ALL first. |
Beta Was this translation helpful? Give feedback.
-
@pbaur -- Could you explain the reasoning that takes you to the FONT code (or any other Lisp code) rather than the VM emulator C code in search of this problem? |
Beta Was this translation helpful? Give feedback.
-
Two reasons: (a) I would like to know what it wants to do before I try to figure why it fails to do it; (b) I'd like to isolate the problem as much as possible. |
Beta Was this translation helpful? Give feedback.
-
For stack issues, vs. the GC code issues, I've been compiling the C code with -DSTACKCHECK and it occasionally finds problems with the stack structure -- the code is the only "documentation" for the details of the Interlisp spaghetti stacks, and I haven't had a chance to work out what things should really look like. |
Beta Was this translation helpful? Give feedback.
-
This discussion is all over but it isn't clear what the problem is. For now, this is a Discussion. |
Beta Was this translation helpful? Give feedback.
-
The "link" pointer is pointing at unallocated memory. The same code is executed on little-endian machines, so I'm suspecting an endian related issue in one of the helpful macros used in this code. The problem is reproducible.
Beta Was this translation helpful? Give feedback.
All reactions