Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Integration Infastructure #3721

Closed
RX14 opened this issue Dec 17, 2016 · 64 comments
Closed

Continuous Integration Infastructure #3721

RX14 opened this issue Dec 17, 2016 · 64 comments

Comments

@RX14
Copy link
Contributor

RX14 commented Dec 17, 2016

Currently, crystal uses Travis CI for continuous integration. This works well, but has some limitations. Travis currently allows us to test on our major architectures: linux 64 and 32 bit, and macos. However, in the past year we have gained ARM support in 32 and 64 bit, as well as support for freebsd/openbsd. These architectures would be difficult to test using travis. Without continuous integration on a target triple, that triple is essentially unsupported and could break at any time. In addition, travis lacks the ability to do automated releases. This makes the release process more error-prone and precludes the ability to do nightly releases.

I have been working on setting up Jenkins as a replacement for travis. Jenkins is a much more flexible system, as it allows connecting your own nodes with their own customised build environment. For example we could test crystal on an actual raspberry pi for every commit. We could also schedule jobs to create nightly builds, and authorised users on the web interface could kick off an automated release process.

Currently I have a test jenkins instance running at https://crystal-ci.rx14.co.uk/, here is a (nearly) passing build. Jenkins builds can be configured by a Jenkinsfile in the repository like travis. Here's the one I made for crystal. I've documented the setup for the master and slave instances here. Currently i'm thinking of running every slave in qemu/kvm on a x86_64 host for consistency between slaves. Automating slave installs using packer seems trivial.

There are quite a few different options for jenkins slaves however. It's possible to create jenkins slaves on the fly by integrating with different cloud providers. This has the added benefit of the environment being completely from scratch on every build. It also may be cheaper, depending on build length, commit frequency, and hardware constraints. It might also be wise to mix this and the previous approaches, for example using some raspberry pis for arm, a long-running VM for openbsd, and google compute engine for the x86 linux targets (musl in docker?).

Rust seems to use buildbot instead of jenkins, but Jenkins has really surpassed buildbot in the last year in terms of being a modern tool suitable for non-java builds (released 2.0, added jenkinsfile, seamless github integration like travis). I also have 3-4 years experience working with jenkins, but have never worked with buildbot before.

The problem I have in proceeding is that I don't know the options and preferences @asterite and Manas have in terms of infrastructure and how they would like this set up. Before I sink too much into creating qemu VM images to run on a fat VM host.

TL;DR: CI on every target triple? Nightly builds? Yay! Now how do I proceed?

@asterite
Copy link
Member

@RX14 This is awesome! Every point you say is a +1 in comparison to running things in travis, at least for a project this big and with many platforms to support. I'll try to discuss this at Manas next week and see how we can proceed.

@RX14
Copy link
Contributor Author

RX14 commented Dec 17, 2016

In addition, if we have macos install media, it seems we can use https://github.com/boxcutter/macos to create a macos VM, so it looks like we really can test on every triple using jenkins.

@RX14
Copy link
Contributor Author

RX14 commented Dec 22, 2016

Any progress on this issue?

@asterite
Copy link
Member

@RX14 We didn't have time to review this yet, some of the team is on vacation at this time of the year (it's summer here ^_^) but this is something that we'll definitely take a look at, and probably switch to, once we discuss it.

@mjago
Copy link
Contributor

mjago commented Dec 26, 2016

The analogy in Ruby is rubyci.org running builds tests and specs on multiple platforms in the background' whilst travis and appveyor work in the foreground' on marker versions giving `immediate' feedback to pull requests and branches. This works well since edge cases turn up constantly across the different platforms whilst travis usually catches low lying fruit. Ruby use chkbuild - a CI server written in Ruby. Apologies if you know all of this - I'm a contributor to Ruby/Spec and thought I would share. TLDR 👍

@RX14
Copy link
Contributor Author

RX14 commented Dec 26, 2016

I think I would strongly prefer running builds for every platform for every commit. Whether the PR is "okayed" back to github after only the faster builds have completed is a question which I think will have to be answered after it's all set up.

@spalladino
Copy link
Contributor

Hey @RX14, so sorry for the delay in the reply. We discussed this internally at Manas, and we agree that having a Jenkins (or equivalent) environment for managing the builds in multiple platform would be nice awesome to have. If this is something you'd like to work on, and feel comfortable in Jenkins, then Jenkins it is; we do have some experience in Jenkins and none in buildbot, so it seems the best choice to go.

The first thing to work out would be hosting. We'd prefer to handle as much as possible of the infrastructure in-cloud, having a master Jenkins node running in Amazon (where we host most of our assets), and slaves covering all the required platforms.

So step one is to build a list of all targets, and figure out the best place to run them. I guess that a node in EC2 running qemu (I understand there are no limitations for running qemu on an EC2 machine, right?) would be a good choice for most architectures, though I'm not sure if we can cover all of them this way.

What do you think?

@ysbaddaden
Copy link
Contributor

Architectures should be tested on real hardware if possible. I discovered bugs when running Crystal on a AArch64 server (provided by Packet) that didn't happen in QEMU for example. It's also very, very, slow.

Scaleway has cheap ARMv7 servera, Packet have expensive but incredibly powerful ARMv8 servers (Cavium ThunderX × 2).

Same for Alpine, running in a container will be different than running in a VM, because the Alpine kernel is patched (grsecurity, ASLR, ...).

@RX14
Copy link
Contributor Author

RX14 commented Jan 10, 2017

@spalladino absolutely no problem with the delay in reply.

Obviously i'd like to automate as much of the deployment of the master as possible. Docker would be my first choice as I have experience with it (and already made a container). The master node doesn't need to be that powerful. Needs probably only 2gb of ram and somewhat decent processor. We can assess if it needs more resources as the project grows.

The list of targets is here: https://github.com/crystal-lang/crystal/tree/master/src/lib_c, we'll want something for every one of those in an ideal world. I don't want to run any targets in qemu if possible, but acquiring hardware for every single target in the future is impossible. In hindsight every non-x86 slave is going to end up being a special snowflake, so there's not much point in attempting qemu for every slave.

Build slaves don't really need a particularly good connection, so they are fine to be run at home if it comes to it. For example I'd like to run the ARM targets on a raspi as that's realistically going to be the most common device they're run on. Raspi 1 is ARMv6, raspi 2 is ARMv7 and raspi 3 is ARMv8, and I know that aarch64 distros exist for the raspi 3. Raspis are quite a cheap non-recurring cost and racking solutions for raspi exist so the slave hardware for ARM seems to look bright in that direction.

x86_64 slaves should be able to be run in the cloud with ease. Even if they need to be virtualized, KVM should ensure that they're running on a real CPU most of the time using VT-x. Architectures which you can get an AMI for can be run directly on ec2. AFAIK ec2 doesn't use containerized virtualisation so we should have full control over kernel versions etc for alpine.

Another interesting question is that of LLVM versions? Which LLVM version do we test with? Testing with every LLVM version on every architecture seems like a waste of time. We could pick a random LLVM architecture for each run (maybe deterministically from the commit hash).

@spalladino This is a topic which would probably benefit from real-time discussion, so don't hesitate to ping me on IRC/gitter if you want to chat. I'll be around all day (after midday gmt) tomorrow.

@ysbaddaden
Copy link
Contributor

For LLVM, we should test a few versions. It's usually overkill, but we may break compatibility when supporting a new LLVM version or when the compiler uses more features. So we may:

  • use the stable branch by default (3.8, soon 3.9);
  • have 1 VM with the qualification branch (3.9, soon 4.0);
  • have 1 VM with the oldest supported release (3.5, soon 3.8?);

Note that ARM / AArch64 have best results with 3.9 (maybe 3.8 is enough); older versions lead to crashes in release mode for example.

BTW I couldn't find any AArch64 distributions for Raspberry 3 (only ARMv6 or v7). It can boot in 64bits mode, but when I searched a few months ago, there was no kernel (only preliminary attempts). I'd love to see a distribution, thought. Note that Packet was willing to sponsor us with an ARMv8 server, I can ping them back (or send a DM to @packethost on Twitter).

@RX14
Copy link
Contributor Author

RX14 commented Jan 11, 2017

How about only running tests on all LLVM versions nightly? I'm a little wary of build times for PRs slowing down development.

@drhuffman12
Copy link

What about nightly tests for all config's, and 'live' tests [per PR or per merge-to-xyz branch] only on one/few select config's?

@ysbaddaden
Copy link
Contributor

@RX14 What about primary builds that will report a green state as quickly as possible, then additional builds that will report more feedback? Or maybe have a trigger in commit messages, to enable LLVM builds (eg: matching /LLVM/i)? Usually we shouldn't care about LLVM except for particular branches / pull requests (supporting new LLVM version, using new LLVM features, ...).

@RX14
Copy link
Contributor Author

RX14 commented Jan 11, 2017

@ysbaddaden That could work. Trying things out should be quite easy once we have the infrastructure set up.

@spalladino
Copy link
Contributor

I agree on having primary builds that can report a green state as quickly as possible. I'm not sure though about how to handle the additional builds: I think I'd start with nightly builds of master (as @RX14 suggested). Later, we could set something up to auto-merge a PR if all builds have passed and a committer has given a thumbs-up, having a quick primary build report an initial ok state.

Anyway, first we need to have the builds running somewhere, so let's go back to the hosting.

I wasn't aware of potential issues when running in QEMU as @ysbaddaden mentioned, so I guess that rules QEMU out (speed would not be much of a problem if we are using them for nightlies or "extra" builds).

  • For any Linux on x86_64, an EC2 node sounds like the best option. Should we use Docker for generating different isolated different environments within the same node?
  • For Windows on x86_64 (when it's ready), a Windows EC2 node should do the trick; though we could check if the cost in Azure is lower.
  • For Mac OSX, there seem to be cloud providers for Mac, such as this one, we should look into them and choose the best in terms of pricing
  • For ARMv8, if we can have support from Packet, that would be great. @ysbaddaden would you mind contacting them again? Feel free to CC me or @bcardiff to discuss sponsorship details. Also, if they are open to offering support for other architectures and we can reduce the cost from Amazon, all the better.
  • For other ARMs, I'd like to avoid having a rack here at the office; but if there is no cloud provider available, we'll go down that route.

Regarding LLVM versions, I'd pick a primary LLVM version for each platform for the primary builds, and then add other combinations (as mentioned by @ysbaddaden) as additional builds.

Am I missing something? What do you think? I'll be in the IRC in a few minutes if you want to follow up there, though I'd rather keep the conversation here (just for the sake of keeping the history more organised).

@drhuffman12
Copy link

Maybe branches like [but possibly renamed if desired]:

  • master [for releases]
  • nightly [for nightly scans across all configs], which would be merged into master if all pass
  • (all others) [for 'live' scans on primary configs and (as @RX14 suggested) other configs when trigger in commit messages or in PR], which would be merged into nightly if all pass

@RX14
Copy link
Contributor Author

RX14 commented Jan 11, 2017

@drhuffman12 I don't think that changing the git repo setup would be a good idea.

@asterite
Copy link
Member

Note that the only reason we keep compatibility with older LLVM versions in the source code is because for our release process we are stuck with LLVM 3.5 for the moment, but that should be upgraded to the latest LLVM version (and upgraded each time LLVM lands a new version)... but that's kind of tricky to do, as far as I know.

So in my opinion, I wouldn't have a matrix of LLVM versions to test against. In fact, once we upgrade the omnibus build to the latest LLVM version I would directly remove Crystal code that deals with older LLVM versions.

@asterite
Copy link
Member

Also, old LLVM versions have bugs, so shipping Crystal with support for an older LLVM version means shipping a buggy version of Crystal... so that's another good reason to drop support for older LLVM versions.

@ysbaddaden
Copy link
Contributor

Supporting the LLVM stable (3.8) and qualification (3.9) branches simplifies building on many distributions. Alpine and OpenBSD ship LLVM 3.8 for example.

It's not that hard to have compatibility for 2 or 3 LLVM versions; despite the breaking changes, the C API goes through a deprecation release before removal in the following release. The current complexity is that we now support 5 versions (3.5, 3.6, 3.8, 3.9, 4.0pre) which totals many breaking changes...

@ysbaddaden
Copy link
Contributor

@spalladino we can still use QEMU, it's very good. I wouldn't have ported Crystal to ARM without it, but it's still an emulation, not real hardware, and there may be some quirks that it doesn't exhibit. Maybe not that much, though.

@spalladino
Copy link
Contributor

Got it. Should we take longer than expected to set up real hardware, we can rely on it for the time being then.

@RX14
Copy link
Contributor Author

RX14 commented Jan 12, 2017

@ysbaddaden I think for now we should be able to get real hardware for each build slave (raspi+aws(+packet)), but if we port to more unconventional architectures in the future, I think qemu will be required for those.

@spalladino
Copy link
Contributor

For now the first step would be to make a few tests with a master node and a slave node in a rather standard architecture, and see whether we want to keep slaves running or use jcloud to start and terminate them on the fly, and check if it works to use multiple AMIs for isolating configs (or we need to rely on Docker or similar). Chris will be kindly making a few experiments during the weekend and we can pick up from there.

@ysbaddaden
Copy link
Contributor

I couldn't find any AArch64 distributions for Raspberry 3

I stand corrected, it appears openSuse has one: https://en.opensuse.org/HCL:Raspberry_Pi3

@RX14
Copy link
Contributor Author

RX14 commented Jan 13, 2017

@ysbaddaden Thanks for the link! I think that archlinux-arm has an AArch64 distro for raspi 3 too (scroll to bottom): https://archlinuxarm.org/platforms/armv8/broadcom/raspberry-pi-3

@matiasgarciaisaia
Copy link
Member

Just a note here - CircleCI offers macOS builds for Open Source projects. We should probably ping them when times come.

@RX14
Copy link
Contributor Author

RX14 commented Mar 3, 2017

@spalladino I think it's better done using github comments by team members. That used to be how Jenkins approved building PRs before 2.0 and pipelines. It doesn't seem to be possible now in the pipeline plugin. Also, encoding metadata in commit data (especially the title) is quite ugly.

@spalladino
Copy link
Contributor

spalladino commented Mar 3, 2017 via email

@RX14
Copy link
Contributor Author

RX14 commented Mar 3, 2017

I think that the best thing to do is to use CircleCI or travis for an "initial smoketest" which runs on every single PR. This initial smoketest would be as simple as make std_spec crystal, followed by testing samples and crystal tool format --check.

A small bot can then be written (in crystal!) which looks at github issue comments, and schedules jenkins builds using the jenkins API. Jenkins builds would run a full matrix validation suite. I'd argue that we should run full validation on every commit to master as well. Using a custom bot has the advantage of somewhat decoupling the interface to CI from the implementation so we can possibly make the CI more fine-grained (run just windows/osx/mac) in the future. This kind of setup seems very similar to what swift has.

But first, we should get something simple and useful working, I suggest this: build every push to master, on several LLVM versions, using EC2. From there we can evaluate the cost of running on EC2 and what optimisations are worth the effort. I do somewhat feel we're getting lost in the details and optimizations instead of getting something working...

@RX14
Copy link
Contributor Author

RX14 commented Mar 4, 2017

Ok, I've set up the Jenkins master on the new server, and written some reproducible instructions on how to set it up here: https://github.com/RX14/crystal-jenkins/tree/master/master. The Jenkins is now live at https://jenkins.crystal-lang.org/. Next steps: get EC2 credentials and configure Jenkins with them (and document), and set up the actual job (and document).

I'll try to focus on good documentation for this infrastructure, because I want the infrastructure to be well-understood even if i'm busy. If you have any questions or suggestions, don't hesitate to ask me to improve the documentation. Also, do you think moving RX14/crystal-jenkins to the crystal-lang organization would be a good idea?

@vielmetti
Copy link

@ysbaddaden - there is also a new ARMv8 Debian build for Raspberry Pi 3 at https://blog.hypriot.com/post/building-a-64bit-docker-os-for-rpi3/ which you might find of interest.

I'm working with Packet on getting their ARM infrastructure together for CI builds, and will bring this issue to the attention of the team here that's doing this work.

@RX14
Copy link
Contributor Author

RX14 commented Apr 7, 2017

Build is set up on the new infastructure and (somewhat) working: https://jenkins.crystal-lang.org/blue/organizations/jenkins/crystal/detail/feature%2Fjenkinsfile/1/pipeline

Next steps include fixing #4089 and merging the completed Jenkinsfile (i'll PR it soon). Once this is done we can start running builds and setting status checks on master and PRs.

@drhuffman12
Copy link

:)

@RX14
Copy link
Contributor Author

RX14 commented Apr 26, 2017

I've very nearly got 32bit support working on the CI, but i've hit this problem: https://jenkins.crystal-lang.org/job/crystal/job/feature%252Fjenkinsfile/11/execution/node/13/log/. It appears to me that my -m32 link flags aren't being passed to cc on macro runs. This is exactly what you want when cross-compiling typically, but as my libcrystal.a has been compiled using -m32, this simply doesn't work. It looks like we need another kind of link flags which get passed to every cc invocation. @asterite what are your thoughts?

@matiasgarciaisaia
Copy link
Member

Hey @RX14!

I think I agree with you, at least in this case - the link flags should get forwarded to the macro run. I'm not sure, however, if there's any scenario in which you want those flags to not be passed to the macro run.

I'm still trying to reproduce this issue - have to set the environment up and whatever - because I can't see it from the code. The only mentions there are to CC or cc are in compiler.cr, and they all include the @link_flags.

But please do fill the issue, so we can track it down 👍

@bcardiff
Copy link
Member

bcardiff commented May 9, 2017

@RX14 Something I don't get is, in the current ci there is a 32 bit environment and the specs are passing there. Why a change in the infrastructure would trigger the behavior of passing the link flags? I am not discussing if it should or not (I am not fully convinced), but I wouldn't expect that to be an issue since there a 32 bits environment in travis running. What has changed?

@matiasgarciaisaia
Copy link
Member

I think Jenkin's a 64 bit build, and so the issue would be "Cross compiling from x64 to i386 is broken".

@bcardiff
Copy link
Member

bcardiff commented May 9, 2017

I would mimic the release process. i.e.: using the latest 32 release to compile the next 32 release. That also match what the user will be doing if contributing to the compiler from a 32 platform.

I don't think the flags (at least the -m32) should be forwarded, because the macro run will execute the compiled program in the original environment so, in this case, the compiler used for macro run should be setup for 64bits.

From the Jenkins log I guess you are trying to cross-compile the specs, there the macro run is used, which end up calling the just cross compiled compiler with a libcrystal.a for 32 bits. Again, I would just compiler the compiler and the specs in 32 bits without cross compilation. But, if not, do not cross compile the the specs at least. I think that should work.

@RX14
Copy link
Contributor Author

RX14 commented May 9, 2017

@bcardiff In the current CI there are 2 docker containers, one containing a 64bit filesystem, one containing a 32bit filesystem. This means that the linker itself is 32bit by default, you don't need to pass -m32. It still executes in a 64bit kernel on travis.

In the new environment, I didn't want to spin up 2 VMs for 32 and 64bit, or use containers (added complexity), so I chose to use debian's multiarch features. This seems to work very well, apart from when macro runs are required. It turns out though that this is impossible anyway because libevent isn't correctly packaged for multiarch so 2 VM images are required.

Using a 32bit crystal release of crystal in the 32bit builds would be a good idea. I think that in the future I would like to make downloading a crystal release part of the build process, instead of baking in the crystal version to the VM image. But I don't think that that should require 2 VMs, or to create a chroot. Crystal should be able to utilise multiarch features to cross-compile for 32bit. Someone will want to do it in the future, so I think we should support it.

I ended up passing -m32 by setting CC=cc -m32, which is probably actually the recommended solution to this problem, however it seems the macro runs don't pass the target to the new compiler instance. I added a quick commit which probably fixes the problem (RX14@ad941fb) but it's a hack I can't test and it's looking more and more like this is a dead end.

Unfortunately there are no official debian 32bit AMIs, so i'm going to have to think about how else to do this :(

Sorry for the brain dump, I wrote this comment while working on these workarounds.

@RX14
Copy link
Contributor Author

RX14 commented May 16, 2017

Debian don't provide 32bit AMIs, with a message telling you to use multilib, which is what I tried to use in the previous comment and failed. Getting it to work would require compiling the compiler for x86_64 before compiling for 32bit. This is difficult as it would require both 64 and 32bit development versions of the libraries used, which is difficult in certain cases as some packages still aren't fully multilib compliant. In general debian's 32bit support seems a huge mess right now.

I could try creating a 32bit AMI using bootstrap-vz but there isn't an official manifest for it. If that doesn't work out, I'll have to think of something else.

@RX14
Copy link
Contributor Author

RX14 commented May 16, 2017

Ok, building a 32bit AMI was easier than expected, I even nearly completed a whole 32bit build: https://jenkins.crystal-lang.org/job/crystal/job/feature%252Fjenkinsfile/18/console. It fails to build the compiler due to some linking errors, however. Any suggestions?

I think the next steps are really to merge the Jenkinsfile and get nightly builds going, gradually explanding the available architectures. After that I think i'll try and get PR builds working, which I will probably end up building a bot very similar to rust's bors, but using the jenkins api, passing targets to be built and git sha as build parameters. Thoughts?

@RX14
Copy link
Contributor Author

RX14 commented May 23, 2017

I've added swap space and used --threads 1 to control the memory usage on 32bit. It seems to be working. We now have a full matrix of 32 and 64bit debian using llvm 3.5-4.0. I've created a crystal-nightly job which currently builds my fork nightly.

Next steps are to merge the jenkinsfile to get nightly builds kicked off and get notifications sending on failed builds so that the nightly results don't get ignored. That includes deciding which matrix we should include in the nightly builds.

A weird bug seems to have been exposed though: both 32 and 64bit LLVM 4.0 builds failed with linker errors. A build log is here.

@Val
Copy link
Contributor

Val commented Jun 2, 2017

A weird bug like crystal-lang/crystal_lib#25 or #1269 ?

@RX14
Copy link
Contributor Author

RX14 commented Jun 2, 2017

Arrgh, this is why I need to finish crane and use it on the CI, so that we have a native crystal install and ditch the omnibus.

@straight-shoota
Copy link
Member

Is there any progress on this? It would be great to have nightly builds. I think it would help avoiding releases requiring an immediate followup bugfix release when changes in master can be more easily tested in the wild ;)

@mverzilli
Copy link

Totally agree. One thing that's certainly missing is a wiki page to specify what kind of support we're providing for each target. I'd like to basically "translate" this page to one in the Crystal repo wiki: https://forge.rust-lang.org/platform-support.html

This page should state our initial goals and not where we currently are, so we can use it as guidance to inform this issue.

I won't have time to start that until the end of next week, so if anyone wants to take a stab at it before, we can iterate from there.

@Val
Copy link
Contributor

Val commented Aug 12, 2017

For the linking problems #4825 ...

@bararchy bararchy mentioned this issue Jan 9, 2018
@rishavs
Copy link

rishavs commented Nov 14, 2018

It is probably too late for this but in case we want to consider Azure DevOps as a cloud hosted CI options, do let me know. I work in the DevOps team.

@j8r
Copy link
Contributor

j8r commented Dec 14, 2020

We have GitHub Actions and Circle CI. I propose it is time to remove Travis (1 check remaining), and duplicated Circle CI checks. (note: same story for shards).
There are currently 36 checks running!

@RX14
Copy link
Contributor Author

RX14 commented Dec 14, 2020

Yeah, Crystal CI's good now :)

@RX14 RX14 closed this as completed Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests