Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintainability: properly document the build process #723

Open
cfriedt opened this issue Dec 5, 2023 · 13 comments
Open

maintainability: properly document the build process #723

cfriedt opened this issue Dec 5, 2023 · 13 comments
Labels
area: Documentation Issues related to documentations

Comments

@cfriedt
Copy link
Member

cfriedt commented Dec 5, 2023

Was "maintainability: do not check in generated code"

Currently, the version of gcc that is contained in the Zephyr SDK (https://github.com/zephyrproject-rtos/gcc) contains some generated code that is checked-in (e.g. ./configure scripts).

This requires an additional manual step of regenerating the ./configure script from configure.ac (and many other support files) via autoreconf that may or many not be easily reproducible (e.g. the default autoreconf in Ubuntu might not work, it might be necessary to get the latest from gnu).

It's generally bad to check generated code into version control and generally worse to require either manually patching the generated code or some specialized knowledge about how to do it.

The main issue is sustainability; rather than the build process being predictable and linear, it becomes unpredictable, non-linear, and not really stainable. Without having specialized tools or domain specific knowledge, or a particular build machine or version, it makes it difficult for developers to make successful PRs to the SDK.

So I would like to just request that we do not check in generated code (in the form of ./configure scripts and so on), and instead insert (or populate) a dedicated step in the build process to simply regenerate those scripts.

@stephanosio
Copy link
Member

stephanosio commented Dec 7, 2023

maintainability: do not check in / edit generated code

We never edit generated code (configure scripts). Everything in configure script is generated using autoconf from the relevant source files.

Currently, the version of gcc that is contained in the Zephyr SDK (https://github.com/zephyrproject-rtos/gcc) contains some generated code that is checked-in (e.g. ./configure scripts).

This is nothing particular to the Zephyr SDK. Upstream GCC does this, and so do all the other projects that use the GNU build system.

This requires an additional manual step of regenerating the ./configure script from configure.ac (and many other support files) via autoreconf that may or many not be easily reproducible (e.g. the default autoreconf in Ubuntu might not work, it might be necessary to get the latest from gnu).

What I (and many others working with the GNU build system) do is to have the various common autoconf versions installed under their own prefix (e.g. /opt/autoconf-x.y) and add their bin directory to the PATH in the RC file for each build environment. This should be simple enough to do.

The main issue is sustainability; rather than the build process being predictable and linear, it becomes unpredictable, non-linear, and not really stainable.

What part of it becomes unpredictable? I understand that it can be hard to follow at first for the people who are not familiar with the GNU tooling; but, it is a fairly standard and very predictable process.

Without having specialized tools or domain specific knowledge, or a particular build machine or version, it makes it difficult for developers to make successful PRs to the SDK.

There is nothing particular to Zephyr SDK about how this works. This is just how the GNU build system works. It is not pretty, it is extremely outdated and far from ideal; but, I am afraid nobody has time to overhaul the entire GCC codebase to a different build system ...

It's generally bad to check generated code into version control

It certainly is not good to check in generated code into VCS in general; but, that is the standard process for the upstream GCC, and we are not going to deviate from that.

generally worse to require either manually patching the generated code or some specialized knowledge about how to do it.

Once again, we do not edit any manual patching to the generated code (configure scripts).

So I would like to just request that we do not check in generated code (in the form of ./configure scripts and so on), and instead insert (or populate) a dedicated step in the build process to simply regenerate those scripts.

Sorry, but we are not going to deviate from the upstream GCC process for this.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 7, 2023

maintainability: do not check in / edit generated code

We never edit generated code (configure scripts). Everything in configure script is generated using autoconf from the relevant source files.

Patching via git is ~equivalent to manually editing generated code.

In particular, given that the output of autoconf can and does vary from one machine to another, not only based on the release of autoconf but also based on the presence of other tools that it uses, it's a bit of a slippery slope.

Currently, the version of gcc that is contained in the Zephyr SDK (https://github.com/zephyrproject-rtos/gcc) contains some generated code that is checked-in (e.g. ./configure scripts).

This is nothing particular to the Zephyr SDK. Upstream GCC does this, and so do all the other projects that use the GNU build system.

That's a fallacy.

Most projects that use the GNU build system only generate the .configure script when release tarballs are generated.

https://stackoverflow.com/a/3291181

My guess as to why GNU started doing this for GCC / Binutils was that enough previous tarball users complained that it wasn't there after they had switched to RCS.

Generally, it's bad, but it has clearly snowballed well out of control.

What I (and many others working with the GNU build system) do is to have the various common autoconf versions installed under their own prefix (e.g. /opt/autoconf-x.y) and add their bin directory to the PATH in the RC file for each build environment. This should be simple enough to do.

^^ This should be documented somewhere. Actually, so should the entire process.

The main issue is sustainability; rather than the build process being predictable and linear, it becomes unpredictable, non-linear, and not really stainable.

What part of it becomes unpredictable? I understand that it can be hard to follow at first for the people who are not familiar with the GNU tooling; but, it is a fairly standard and very predictable process.

The last time I had to fix the build, it was because I had no way of predicting what was in the (private) AWS caches that are used by Zephyr's SDK builder. This was after painstakingly trying to reproduce what was done in CI for some time. Like weeks of effort due to what should have been easy to reproduce following some simple steps.

If it's as predictable as you suggest, the please document the steps to manually reproduce builds.

There is nothing particular to Zephyr SDK about how this works. This is just how the GNU build system works. It is not pretty, it is extremely outdated and far from ideal; but, I am afraid nobody has time to overhaul the entire GCC codebase to a different build system ...

There is domain specific knowledge (see your paragraph above).

There is no need to overhaul anything. Autotools may not be pretty but they do work.

However, currently, the documented process to build it is to make a PR to the Zephyr project.

That effectively creates a black box (due to insufficient diagnostics / privileged access / private AWS caches).

Once again, we do not edit any manual patching to the generated code (configure scripts).

Submitting patches to the configure script via git is equivalent to manually editing the generated code. It's bad practice in any case, whether upstream is doing it or not.

It (at least) doubles the amount of work that needs to be done for changes to the SDK.

Likely far more than 2x though, e.g. it took me maybe a couple of hours to edit the necessary .ac / .m4 files , and now it's going on several days of debugging the build (in CI as a black box).

The last time I had to fix something that was broken in the SDK, it took me weeks. Eventually, I realized it was due to a deprecated release of zlib or something like that. The tarballs still existed in Zephyr's AWS cache though, so the build actually succeeded in unpredictable ways.

I've been building autotools packages for close to 20 years. If it isn't obvious to me how to build the SDK, then how do you expect it to be obvious to a newcomer?

Please document the manual build process, even if that is only for a single host / target.

@cfriedt cfriedt changed the title maintainability: do not check in / edit generated code maintainability: properly document the build process Dec 7, 2023
@cfriedt cfriedt added area: Documentation Issues related to documentations and removed invalid labels Dec 7, 2023
@stephanosio
Copy link
Member

stephanosio commented Dec 7, 2023

Patching via git is ~equivalent to manually editing generated code.

It is not. The script is checked in as is without any modifications/patches into git.

In particular, given that the output of autoconf can and does vary from one machine to another, not only based on the release of autoconf but also based on the presence of other tools that it uses, it's a bit of a slippery slope.

That is not true. The configure script states which version of autoconf was used to generate it -- as long as you use the same version, the output should be exactly the same.

Most projects that use the GNU build system only generate the .configure script when release tarballs are generated.

That is arguable. Many projects still include the pre-generated configure script in tree for convenience as well as for "predictability" (because you do not want a bunch of bugs saying "build fails because every developer is using a different version of autoconf").

Submitting patches to the configure script via git is equivalent to manually editing the generated code. It's bad practice in any case, whether upstream is doing it or not.

I am having a very hard time understanding how that works; but, either way, this is not a decision made by me or anyone else working on the Zephyr SDK -- it is the decision made by upstream GCC and, as a downstream project using GCC, sdk-ng is not going to deviate from that.

If you have a problem with this, please email the GCC mailing list.

The last time I had to fix the build, it was because I had no way of predicting what was in the (private) AWS caches that are used by Zephyr's SDK builder. This was after painstakingly trying to reproduce what was done in CI for some time. Like weeks of effort due to what should have been easy to reproduce following some simple steps.

I do not understand why the AWS cache matters here. The source tarball cache is literally a directory with the tarballs downloaded by crosstool-ng (that is uploaded after a crosstool-ng run).

If it does not exist locally, crosstool-ng will download everything from the scratch (i.e. it will be 100% locally reproducible as long as you have a working internet connection and none of the mirrors are broken -- see below).

Likely far more than 2x though, e.g. it took me maybe a couple of hours to edit the necessary .ac / .m4 files , and now it's going on several days of debugging the build (in CI as a black box).

I think you are mixing up CI and crosstool-ng. The CI itself is pretty much just a wrapper around crosstool-ng (and Yocto for building host tools).

All the toolchain builds are done through crosstool-ng with the configs located inside the sdk-ng tree. Anyone familiar with crosstool-ng should be able to build the sdk-ng toolchains using the crosstool-ng toolchain config files (configs/*.config) in tree without much effort, as long as you have installed "the right set of packages," which is the the hard part because everyone has different working environment.

The last time I had to fix something that was broken in the SDK, it took me weeks. Eventually, I realized it was due to a deprecated release of zlib or something like that. The tarballs still existed in Zephyr's AWS cache though, so the build actually succeeded in unpredictable ways.

If you are talking about the local crosstool-ng run failing to download the source tarballs from broken mirrors, that happens. In fact, that was one of the reasons why the cache was introduced in the first place, aside from the download speed. I am afraid no amount of documentation is going to fix a broken third party mirror ...

Please document the manual build process, even if that is only for a single host / target.

I think the missing link here is crosstool-ng. You may be familiar with how autotools work; but, you do not seem to be very familiar with crosstool-ng which sdk-ng uses to build toolchains -- if you are, you would have probably looked at the crosstool-ng output logs and manually invoked the gcc configure script with the exact command line that was used by crosstool-ng (yes, it is there in the logs); in which case, you do not have to go through the whole ordeal of waiting for CI (or local crosstool-ng run for that matter) to re-build everything from the scratch -- instead, you can just check out https://github.com/zephyrproject-rtos/gcc/ and directly build and debug GCC locally.

I can try to document hints like these in the FAQ for those who are not familiar with crosstool-ng. I suppose this should lessen the amount of frustration for newcomers who do not have much experience working with embedded toolchains -- though, crosstool-ng is a fairly standard tool for generating embedded cross compiler toolchains; so, many people contributing to sdk-ng tend to already have working knowledge of it, which I suppose is why we have not had much problem in the past with third-party PRs to sdk-ng from many people ...

As for documenting the whole process, I am afraid "take a look at what ci.yml does" is going to be the best answer unless someone is willing to dedicate their time translating the YAML language in ci.yml to English language ...

As for things seemingly randomly breaking, I am afraid no amount of documentation is going to ease the pain with that. Even I, as a maintainer of sdk-ng, sometimes spend days troubleshooting weird CI, crosstool-ng, gcc build system, binutils build system, third party mirrors, or whatever-other-crap-in-the-whole-chain breakages.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 7, 2023

I'm not sure if one or two lines in a FAQ is sufficient.

It would be nice to know exact steps to build a toolchain.

What is maybe obvious to you likely is not obvious to others.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 7, 2023

For reference, the following patches were required when building the 0.15.2 SDK manually. The CI build only worked because of deprecated packages (some with security vulnerabilities) being in the AWS cache.

Not that I'm saying the documentation should include transient patches, but it would be nice if someone didn't need to extrapolate everything out to a bash script to make SDK builds easily reproducible.

https://github.com/cfriedt/zephyr-sdk-builder

0000-crosstool-ng-update-to-zlib-1.2.13.patch
0000-poky-fix-io-fwide-issue-in-cross-localedef-native-2.27.patch
0001-crosstool-ng-update-to-expat-2.5.0.patch

@stephanosio
Copy link
Member

I'm not sure if one or two lines in a FAQ is sufficient.

The FAQ could be more comprehensive. Here we already have a few candidates from the above.

It would be nice to know exact steps to build a toolchain.

What is maybe obvious to you likely is not obvious to others.

The problem is that it is not obvious to me what is not obvious to others, and it is very difficult to decide where the documentation should begin and end (e.g. should the documentation cover crosstool-ng 101, working with GCC, or even fixing the problem with a mirror that replaced an existing source tarball with the same exact filename/version number?).

The only fundamental solution to that is to provide a very detailed documentation on the whole process; which, as I said above, will require significant amount of effort from a willing party -- I just do not have the bandwidth to write such a detailed documentation (or book).

At least, the (somewhat implicit) expectation for sdk-ng contributors up until now has been that they have some experience working with embedded toolchains (and hence likely with crosstool-ng) in one way or another; and, if they had any questions specific to sdk-ng, I have answered them in the issues/PRs or privately in chat.

@keith-packard
Copy link
Collaborator

As someone who maintains gcc-based toolchains for other projects (debian), and has been hacking autotools-based projects for well over 20 years, you're experiencing how people commonly used autotools 'back in the day'. You'd ship a generated configure script because that's what was expected. And that often meant that the generated script was checked into the VCS so that a bare check-out would exactly match the distributed tarballs. GCC is about as legacy a project as you will ever see, and they've stuck to this practice for a very long time.

Most other autotools-based projects changed to delivering an 'autogen.sh' script and expected users to run that to get the required configure script. Heck, there's even 'autoreconf' these days for this job.

However, GCC has very strict requirements about which autotools version you can use to generate the scripts; older or newer versions often simply fail because autotools doesn't guarantee backwards compatibility. Because of this, GCC is usually many versions behind the default autotools versions provided on most systems. For someone simply building the compiler, it's far more reliable to use the provided scripts than attempt to generate them locally.

Yes, this places a huge burden on anyone hacking on the compiler; as @stephanosio says, you end up installing the precise autotools versions required for GCC so that the generated scripts match what's in the VCS. But, once you've got it set up, things are fairly straightforward, if a bit icky -- you hack the source code, re-build the generated scripts and commit both together. With luck, the diffs to the generated scripts are easy to manually verify. And, yes, there is a strong temptation for those doing a drive-by change to simply manually edit both the source scripts and the generated scripts. Which means that when you review patches to the autotools scripts, the best practice is to apply the source script patch and then verify that the generated script patch matches.

If you've ever looked at the autotools scripts that gcc uses, you'll probably understand why there hasn't been any serious attempt to replace them with cmake or meson. For every horribly ugly little kludge, there's someone who depends upon the existing behavior to get their work done.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 10, 2023

As someone who maintains gcc-based toolchains for other projects (debian), and has been hacking autotools-based projects for well over 20 years, you're experiencing how people commonly used autotools 'back in the day'.

@keith-packard - as someone who has maintained gcc-based toolchains for other projects for the last 20 years (Gentoo based, Yocto based), I'm fairly confident in labeling my experiences.

Again, the point of this issue isn't trying to categorize the user. It's simply asking for better documentation and / or to improve the build process.

You'd ship a generated configure script because that's what was expected. And that often meant that the generated script was checked into the VCS so that a bare check-out would exactly match the distributed tarballs.

GCC is about as legacy a project as you will ever see, and they've stuck to this practice for a very long time.

Most other autotools-based projects changed to delivering an 'autogen.sh' script and expected users to run that to get the required configure script. Heck, there's even 'autoreconf' these days for this job.

The source-based distros that I use typically regenerate generated code as part of the build process (mostly always). As a result, it is significantly easier to maintain the toolchain as the process is (again) linear - does not really hide any skeletons, etc.

So whether or not a particular project checks in configure to revision control is mostly irrelevant to the people building it on a regular basis.

However, GCC has very strict requirements about which autotools version you can use to generate the scripts; older or newer versions

Yes, I've been told by both Stephanos and by our version of GCC that very specific autoconf versions need to be used.

There are 2 problems there:

  1. The suggested versions differ, and
  2. Neither version seems to work

If only there were a sequence of documented instructions .. 🤔

often simply fail because autotools doesn't guarantee backwards compatibility. Because of this, GCC is usually many versions behind the default autotools versions provided on most systems.

For someone simply building the compiler, it's far more reliable to use the provided scripts than attempt to generate them locally.

Yes, which is why GCC ships generated code / checks it into version control.

Most autotools projects only do this when creating a release tarball.

Yes, this places a huge burden on anyone hacking on the compiler;

Exactly - so why not lessen that burden?

  1. with some proper documentation, and
  2. by regenerating generated sources as part of the build process (using the approach source-based distros have used for decades)

The latter suggestion was where this issue started. While it would make everyone's lives significantly easier, that was deemed too much work by @stephanosio, so now we are left with door number 1.

-- you hack the source code, re-build the generated scripts and commit both together. With luck, the diffs to the generated scripts are easy to manually verify. And, yes, there is a strong temptation for those doing a drive-by change to simply manually edit both the source scripts and the generated scripts.

Again, there is this misconception that I haven't also been working with gcc fairly intimately for the last 20 years..

The only reason I've done the latter is because the suggested ways have not worked.

If you've ever looked at the autotools scripts that gcc uses, you'll probably understand why there hasn't been any serious attempt to replace them with cmake or meson.

I'm perfectly comfortable with autotools and the autotools scripts in gcc and (again) have been working with autotools projects and gcc for 20 years. I am far more familiar with autotools than CMake or meson.

For every horribly ugly little kludge, there's someone who depends upon the existing behavior to get their work done.

Sure...

I guess my argument here is that life can be made significantly easier with proper documentation.

Personally, when I contribute to a project, if the instructions are:

  1. Make a PR
  2. See if it works

I'm going to be skeptical about it.

Since it became significantly more complicated than that, and since I needed to manually set up a build environment to match what was in CI so that I could manually diagnose what the problem was, I thought it would be wise to ask for some documentation about how to manually set up a build environment to match what was in CI.

It was essentially the same gripe I had when I needed to build the SDK manually last time.

This correct resolution of this issue isn't about "maybe you've never contributed to an autotools project / gcc before", or "what reasons are there to not write proper documentation?"

It's more along the lines of, "yes, there is a conventional build flow, and here is a page that describes that".

With that, there is at least some starting point at an intuitive location for people, and a place to put knowledge that is otherwise maybe only just in @stephanosio ' head at the moment.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 10, 2023

The problem is that it is not obvious to me what is not obvious to others, and it is very

Fair enough.

I would suggest starting from first principles with some assumptions. Try to solve a much smaller version of the bigger problem.

E.g. user has an Ubuntu Linux environment, e.g. build/host is x86_64, target is e.g. arm. Must install these .deb's, must manually build this version of that tool...

Even documenting an existing container image to run that has some of these things built already?

documentation cover crosstool-ng 101,

Crossdev-ng has decent documentation already, so a link could be sufficient.

working with GCC, or even fixing the problem

There are already links to GCC and they have docs already.

The only fundamental solution to that is to provide a very detailed documentation on the whole process; which, as I said above, will require significant amount of effort from a willing party -- I just do not have the bandwidth to write such a detailed documentation (or book).

Well, that's one option. Maybe a detailed doc like that would be good overall, conceptually, but it's probably more work than necessary.

But why not simply write down a sequence of exact steps (i.e. commands) necessary to build one toolchain?

Ideally, snippets could even be factored-out to external scripts that can be used by both CI and by users.

People can extrapolate from there. If someone wants to build the macos tools, some optional steps could be added later.

At least, the (somewhat implicit) expectation for sdk-ng contributors up until now has been that they have some experience working with embedded toolchains (and hence likely with crosstool-ng) in one way or another;

Please, feel free to continue making that assumption or not.

It should be mostly irrelevant though.

@stephanosio
Copy link
Member

Yes, this places a huge burden on anyone hacking on the compiler;

Exactly - so why not lessen that burden?

  1. with some proper documentation, and
  2. by regenerating generated sources as part of the build process (using the approach source-based distros have used for decades)

The latter suggestion was where this issue started. While it would make everyone's lives significantly easier, that was deemed too much work by @stephanosio, so now we are left with door number 1.

I am not really sure where you got the idea that it was "deemed too much work" to regenerate generated sources as part of the build process.

All I said was "this is not a decision made by me or anyone else working on the Zephyr SDK -- it is the decision made by upstream GCC and, as a downstream project using GCC, sdk-ng is not going to deviate from that."

It is just as much of a good practice to, as a downstream project, not make arbitrary decisions deviating from the way the upstream project does things. It really has nothing to do with how much work it would be to generate these generated sources a part of the build process.

Since it became significantly more complicated than that, and since I needed to manually set up a build environment to match what was in CI so that I could manually diagnose what the problem was, I thought it would be wise to ask for some documentation about how to manually set up a build environment to match what was in CI.

First of all, this issue was initially opened for "regenerating generated sources [in GCC] as part of the build process," and later changed to "properly documenting the build process" -- these two are completely different and independent topics; so, let us try not to mix these up.

Regarding "generating generated sources [in GCC] as part of the build process," this is a deviation from the upstream GCC development process and I have voiced negative opinions about it for the aforementioned reasons.

Regarding "properly documenting the build process," I have already clarified in #723 (comment) that there is room for improvement (e.g. providing an FAQ); but, for a detailed "full" documentation, a willing party will need to dedicate a significant amount of their time for it to happen.

With that, there is at least some starting point at an intuitive location for people, and a place to put knowledge that is otherwise maybe only just in @stephanosio ' head at the moment.

Which part of ci.yml looks like "knowledge that is otherwise maybe only just in @stephanosio ' head" to you?

@stephanosio
Copy link
Member

But why not simply write down a sequence of exact steps (i.e. commands) necessary to build one toolchain?

Sure, that could be a good starting point; though, keeping it up to date and making it actually work locally would be easier said than done. It should be quite doable targeting a very specific environment though, as you have mentioned.

Ideally, snippets could even be factored-out to external scripts that can be used by both CI and by users.

Actually, this used to be the case (there used to be a script that was used by CI and could also be used locally to invoke crosstool-ng and Yocto build processes).

That script was removed with the addition of macOS and Windows host support because the CI infrastructure and the build process were too closely coupled for this to be practical (and, at the time of writing the CI workflow for all three major host operating systems, I did not have a very good idea of what it would look like at the end).

Now that ci.yml is fairly well established and stable, we could consider refactoring its build steps out to external script(s) that can also be used locally. I have nothing against such an approach.

At this time, I do not have any spare bandwidth to take on such an endeavour; but, if someone is willing to put their effort looking into it, I would be more than glad to review and provide feedback.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 13, 2023

This would be a good doc to link to
https://crosstool-ng.github.io/docs/build/

Might be good to include the part about ct-ng build RESTART=<step> STOP=<other_step> or ct-ng libc_headers etc.

@cfriedt
Copy link
Member Author

cfriedt commented Dec 13, 2023

Probably would be good to mention CT_DEBUG_CT_SAVE_STEPS=y is necessary to restart builds at a saved spot when they fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Documentation Issues related to documentations
Projects
None yet
Development

No branches or pull requests

3 participants