New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support library source file archives #16348
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request, @WalterBright! Bugzilla referencesYour PR doesn't reference any Bugzilla issue. If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog. Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#16348" |
8e6ec94
to
1fab703
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to bog this down with bikeshedding but I would suggest -sar=
(matches the file extension), -src-ar=
(separates out two reasonably well used abbreviations) or even -source-archive=
(LDC will probably name this --output-sar
)
This looks similar to .tar
, why is that format not sufficient and requires wheel reinvention? (what about zipping?)
compiler/src/dmd/file_manager.d
Outdated
{ | ||
if (fileManager.useSourceArchive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be just else if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line 144. else if
doesn't quite work here.
changelog/dmd.source-archive.dd
Outdated
|
||
To create a .sar file, such as one for Phobos' std: | ||
|
||
dmd -srcar=/home/duser/dmd/src/phobos/std |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it not needed to specify all the source files that will end up in the archive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It recursively searches the directory for .d, .di, .c and .i files only. Manually enumerating the list would be a maintenance chore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those should be specified as part of a release build any way, also dmd -sar **/*.d
is a thing
To use the std.sar file, nothing needs to be changed in the user's build system. | ||
dmd will automatically prefer using any .sar file it finds. To disable using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simply pass the .sar
on the command line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly possible as an additional behavior. One of the reasons for the existing behavior is so that one would not have to change one's makefile to make use of the .sar files.
I tried to make it as easy to use as possible. It's likely that after some experience with it, we will want to change the user interface to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makefiles are supposed to take this kind of strain, though. Especially given that I want to know when to rebuild it (so it either needs to be explicit or mentioned in a depfile)
I'm fine with any of those.
The same reason .tar has been reinvented many times. It started out as a Tape ARchival format. That means data can only be appended to the tar file, i.e. no table of contents at the beginning. Maybe tar has been enhanced with a ToC, but I don't care enough about it.
Zip is much more complex, with lots of unnecessary features I don't want to bother with. There's also the issue that the source files have to end with four 0 bytes. If an existing format doesn't allow for that, that means dmd would have to copy the data into another buffer then append the 0 bytes, which would be an unacceptable slowdown. The .sar format is stupid simple, and only requires a trivial amount of code to implement. Compare it to the compiler's libmach.d, libomf.d, etc., for overly complex formats (and unportable ones, at that). With our own format, we're not going to be bothered by changing versions, zip bombs, OPB (Other Peoples' Bugs), etc. |
What problem does this solve? And can it really not be solved without inventing yet another file format? |
I was wondering the same thing. I see documentation (though it is a little vague...) on the format and on how to use it, but the closest thing I see to motivation is that it doesn't provide any performance improvements. I also notice that there is no endianness defined, nor any conversion to/from native endianness being performed. That is important if we should expect these archives to be portable between different architectures. |
Big endian machines are obsolete. But, yeah, the spec should say the format is little endian. |
I added a "Rationale" section to the changelog.
Yes, it can be done using an existing file format. Implementing a clone of the zippers will take a lot more time and code, though, because zip files have a lot of other junk in them we don't need. Zip archivers are also vulnerable to "zip bombs". Using open source archivers can also have problems with their licenses. tar files don't have a table of contents at the beginning. The object code library formats are not portable between machines. And, a unique extension means it won't be confused with other archivers. P.S. I wanted for decades to implement this, but never got around to it. |
changelog/dmd.source-archive.dd
Outdated
|
||
A standalone archiver program can be easily created from the implementation in DMD. | ||
|
||
## Rationale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rationale screams of 'a solution looking for problems', instead of actual problems looking for a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not try it out first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of trying it out before pushing it into a release.
changelog/dmd.source-archive.dd
Outdated
## Rationale | ||
|
||
1. All the source files in a project or library can be represented as a single file, | ||
making it easy to deal with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could simply be done by allowing multiple module x;
declarations in a single source file, so you can simply find -name '*.d' | xargs cat
to create a single file. A binary file is more annoying to edit or use version control on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could simply be done by allowing multiple module x; declarations in a single source file
It's not simple at all. The entire setup is designed with a 1:1 correspondence between module name and file name.
you can simply find -name '*.d' | xargs cat to create a single file.
Then the whole file has to be read to find the modules in it.
A binary file is more annoying to edit or use version control on.
True. I'm not suggesting doing that. If you're actively developing code, you'd be editing the individual files.
changelog/dmd.source-archive.dd
Outdated
making it easy to deal with. | ||
|
||
2. To compile all the source files at once with DMD, the command line can get | ||
extremely long, and certainly unwieldy. With .sar files, you may not even need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved by dmd -i
or dub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -i system is a bit wonky, with its exclusive and inclusive patterns, and various rules. The .sar is much simpler. If it's on the command line, its to be compiled in. If it's on the -I command, it's not compiled in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How dare you slander the holy -i
option! dmd -i
works like a charm, exclusion and inclusion patterns are optional. Please file a bugzilla issue if you find a problem with it. We can't fix every 'wonky' feature by adding a new feature. The new feature will enviably accumulate its own 'wonkyness'.
changelog/dmd.source-archive.dd
Outdated
In std.algorithm, for example, the individual algorithms could be placed into multiple | ||
files, since they don't refer to each other. This could also help the people who | ||
don't want the automatic "friend" status of declarations within a single module. | ||
.sar files can make much more granular modules more attractive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.sar is only a distribution file. When developing code, you're not hex editing the .sar file, you're dealing with the separate small files all the same, so it's not more convenient or attractive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never edit the files in /usr/include or any utilities I download and compile from source. I only edit the project I am actively working on. I doubt many users edit dub projects other than the maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I constantly jump in the editor to Phobos functions, to check unittests or the implementation. The same for source code of dependant libraries. I don't see how having plain source code hidden in a custom archive is more convenient while developing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I constantly jump in the editor to Phobos functions, to check unittests or the implementation. The same for source code of dependant libraries. I don't see how having plain source code hidden in a custom archive is more convenient while developing.
IDE's are quite capable of allowing you to view source files in an archive. They do this for zip all the time. Especially in Java land.
If yours doesn't, then yes you will need to extract it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a zip, it's a custom format, not supported by any editor out there. So of course we need to extract it every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course this needs to be implemented in the plugin level. And this is a very simple format that even my own engine has it implemented too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wouldn't overthink it. IDE's have the facilities to support it.
We would need to make sure the big D plugins are updated to use it before we ship anything with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dustmite is another one that comes to mind. @CyberShadow
changelog/dmd.source-archive.dd
Outdated
|
||
4. A directory in a project can be anything. But multiple .sar files in a project means | ||
that's where the code is. Multiple versions of the project can exist in the same directory | ||
by using .sar files with different names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no problem being described here. Multiple versions can also exist in the same directory with import paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just think how much handier libphobos.a is than a directory full of .o files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter is kinder to incremental compilation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no problem being described here. Multiple versions can also exist in the same directory with import paths.
Yes. This is inventing a solution to a problem that was solved 40 years ago by make, or any other build system.
Also, was solved by dmd -i
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latter is kinder to incremental compilation
I don't see how.
changelog/dmd.source-archive.dd
Outdated
5. Experiments with small programs (hello world) show a negligible change in compile | ||
speed. Much larger programs may show significant compile speed increases with .sar | ||
files, simply because there are fewer file operations. For slow file systems, such as | ||
SD cards, or network file systems, the speedup could be substantial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this seems to be what it all actually comes done to: performance. But can we please let profiling guide us on what to optimize, instead of making speculative optimizations? This really begs for a test of what time this actually saves in real world scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back when I developed Warp, I did a great deal of profiling. Caching file lookups (doing stat on files) was a huge win. Unfortunately, I don't have a large D project to try this on. But one reason the:
srcar=on|off
is there is to make it easy to discern the difference. It's not possible to test it without building it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warp was a different tool for a different language and from a different decade. This PR looks like it's in a state that it can be tested now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file lookup for source files is the same in D as it is for C .h files. But yeah, a different decade.
bbc7890
to
23284e6
Compare
@thewilsonator I took your advice and changed the switch to |
So I tried this out by copying Phobos' source tree, renaming The first thing I noticed was that a simple
So the problem of long command lines listing many source files isn't solved by sar files.
I copied the files over to a MicroSD card and repeated the test:
So I can't find performance benefits. |
compiler/src/dmd/file_manager.d
Outdated
if (FileName.exists(si) == 1) | ||
if (isFileExisting(si)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could changes like this be introduced in a separate refactoring PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a local nested function that encapsulates some repeated code.
File.read(fileNames[i][0 .. strlen(fileNames[i])], fb); // read file into fb | ||
content.importOffset = cast(uint)ar.length; | ||
content.importLength = cast(uint)fb.length; // don't include terminating 0 | ||
ar.write(fb.peekSlice()); // append file contents to ar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing only the file contents ad-verbatim seems to be missing a trick. Why not serialize the Lexer pass? Or the generated .di
of the module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message handling that prints out source lines would need the source code. That need is the original reason why the file_manager caching system was created.
content.importOffset = cast(uint)ar.length; | ||
content.importLength = cast(uint)fb.length; // don't include terminating 0 | ||
ar.write(fb.peekSlice()); // append file contents to ar | ||
ar.write32(0); // append 4 bytes of 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is accounting for 2 trailing nulls for sentinel and 2 for lexer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's accounting for a null termination if it is a UTF32 source file.
compiler/src/dmd/file_manager.d
Outdated
@@ -23,6 +28,7 @@ import dmd.location; | |||
|
|||
enum package_d = "package." ~ mars_ext; | |||
enum package_di = "package." ~ hdr_ext; | |||
enum ext_sar = "sar"; // file extension for source archive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this begin with .d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's being proposed as a bespoke D package format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.dar files are already Disk ARchive files. .sar files are a variety of formats.
Possible conflation with other formats is why the header has a magic number and numerous checks for goodness.
I know. I had that idea after I submitted this PR. That part isn't implemented yet. This PR only covers imports covered by -I. |
I can't either with a small application. On a microSD card, I expect the operating system caches files, so repeated runs will be reading the files out of memory, not the card. If there was a difference, I'd expect it only on a cold system. |
Big Endian is alive and well. |
Indeed. It's been pronounced dead quite a few times, but it always manages to rise from its grave... |
changelog/dmd.source-archive.dd
Outdated
None of these make it a slam-dunk, after all, no other compiler does this that I'm | ||
aware of. Even so, some surprising uses can be expected of .sar files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Go, Rust, and C++ (cxx98 templates at a stretch, but also cxx20 modules) could be named as compilers with some sort of packaging format for speed/caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- C++-98's -frepo produced an .rpo for all exported template in a given build, imported for linking.
- C++-20's modules generate .gcm for compiled module interfaces (can contain multiple modules within IIUC) for importing into compilation on
import
. - Go exports types and definitions to a .gox file for importing packages on
import
. - Rust crates export metadata and optionally encoded MIR to a .rmeta (or .rox for gccrs) imported on
use
. - I've just remembered C has precompiled headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the point, none of those are full source archives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yet, all of them (except frepo) can/are used to replace source files for speeding up compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that vein, you could include the precompiled headers generated and used by my Symantec C++ compiler. They did substantially improve compile times. But it's not a system I wished to propagate to D. C++ made a mistake going with their binary import system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of them (except frepo) can/are used to replace source files for speeding up compilation.
This doesn't replace source files, which is quite an important difference e.g. lets say file system accesses are just so catastrophically expensive that this is worthwhile, it'd still be parsing them and running semantic analysis every single time (long story short this is where the actual pain is - and more importantly the memory consumption. I'd take a 10% longer build if it didn't use 70GB of ram to do a naive build of our code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't a JAR file exactly this (and more) ?
Personally, I see any performance increases as a far-fringe benefit. Modern NVMe drives have largely solved the seek-time problem, so the biggest improvements are decreased sector wastage and improved sequential reads (which NVMe still benefits from), however sequential read only improves if this is not implemented as a memory-mapped file and instead loads everything into RAM once. What everybody is missing here is how this solves a major problem in D library distributions. Right now DUB "packages" are git repos. While it is a simple solution, it comes with some severe penalties and limitations. Consider the following examples:
With this change we can finally make all of that go away. To my mind this change is aimed entirely at improving our tooling story. Code-D/Serve-D is obviously the most immediate beneficiary of these archive's. But the tooling possibilities extend far beyond that and are worth serious discussions that far exceed the scope of a PR comment. This PR has my unequivocal endorsement, and I'd be happy to push the merge button on it. |
Can confirm that Windows caches file data. Currently about 1/4th of my total RAM is being used by Windows as a Data+Code cache. You can see it in the Task Manager starting with Windows 10. |
Last I checked, dub used .zip files to distribute packages. But even if it didn't, what's stopping dub from solving these problems that inventing a new format is going to eliminate? Why doesn't dub just eliminate the "git-flotsam" on its own? |
I assume you meant
? (you didn't adjust the second path) |
Yes both should be modified. |
cfff386
to
b0540c4
Compare
This is not a language change, but I still think it should go through a DIP. It is a big change pushed onto the community by DFL. The same should go for any high-impact change in any component owned by the DFL, whether it's in the language or not. In case it's not clear - the feature, and especially if DFL to start shipping such
|
A few more: Windows shell, fuse (posix), 7zip. Using something more standard like zip doesn't seem like such a bad idea against such a list ;) Note: I already sent Adam Wilson a tutorial on this prior to CyberShadow's comment for Windows shell. Oh and it gets worse, using the original spec, it will need to be entirely reengineered before something is released using it. Once it becomes obvious that moving to a block-based design with cli arg support would be the bare minimum for it to scale long-term. I am not arguing for or against this, but I am arguing against not doing it right and causing massive issues down the line (like needing a reengineering in multiple different codebases that we may not control). |
I went through it again and fixed it so it doesn't do unnecessary file stats. It gives a small but consistent edge in compile performance on helloworld. But not enough to be a selling point in and of itself. I've talked with @LightBender about changing Phobos from a small number of files with collected routines in them, to a much more granular approach, which means several times as many files. It may show a more significant compile speed advantage with large numbers of files. (Phobos doesn't actually have many files in it.) |
implemented suggested changelog changes
All that is actually needed for those is to be able to extract a .sar file into its constituents. I haven't bothered yet to write a utility to do that, but the code to implement such a utility can just be extracted from the compiler source. It would be convenient for those tools to build it in, but it isn't necessary. |
7b88bd1
to
883de17
Compare
Adds the compiler switch -srcar so the compiler can read a source archive file rather than look for individual files.
The code to write and read the source archives is in file_manager.d
cc @LightBender
It works on my system. But I don't have a test case here because I haven't spent the time learning how to do scripts in the test suite. The script would look like:
where
path
is wherever Phobos is on the test machine.P.S. This is not a D language feature.