Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: failures and race conditions with concurrent publish update operations #1125

Open
DrLex0 opened this issue Oct 26, 2022 · 8 comments · May be fixed by #1156, #1261 or #1271
Open

API: failures and race conditions with concurrent publish update operations #1125

DrLex0 opened this issue Oct 26, 2022 · 8 comments · May be fixed by #1156, #1261 or #1271
Assignees

Comments

@DrLex0
Copy link

DrLex0 commented Oct 26, 2022

Detailed Description

Using Aptly 1.5.0.
Running 2 API calls PUT {api_url}/api/publish/{path}/{distribution} on the same path at the same time, will lead to various failures and potentially an unpredictable end result. It seems as if there is no locking at all on this operation?

This looks like a regression introduced in the latest Aptly release. I cannot remember having encountered this while we were using version 1.4.0.

Context

We run builds on a Jenkins server that publish Debian packages to repositories. Often there are multiple builds that want to publish their different products to the same repository. Since there are multiple workers, it is not unusual that 2 builds try to upload their products and do a publish update at the same time. Often the build will fail in this case, which is already annoying enough on its own. But it gets worse: sometimes the API calls don't even seem to return an error and everything seems to be fine, but one of the packages is not in the repository. This will then lead to weird errors later on and a lot of wasted time figuring out what exactly happened.

Your Environment

Attached is a Python3 script with some Debian packages randomly taken from Ubuntu jammy that reproduced this problem on the first try for me.
aptly-publish-bug.tar.gz

  1. Create and publish a repository test-aptly-race-bug.
    I have also added a random package to it, just to ensure the following steps will start from a similar state.
    aptly repo create test-aptly-race-bug
    aptly repo add test-aptly-race-bug vim_8.2.3995-1ubuntu2_amd64.deb
    aptly publish repo -distribution jammy test-aptly-race-bug test-aptly-race-bug
  2. Open a terminal, unpack the tarball, and cd to the gnome-shell directory.
  3. In this terminal, have this command line ready but don't execute it yet:
    ../publish_pkgs.py -a 'http://your-aptly-server:8080' test-aptly-race-bug jammy gnome-shell_42.4-0ubuntu0.22.04.1.dsc gnome-shell_42.4-0ubuntu0.22.04.1_source.changes
    (Adjust the -a argument to point to your aptly test server, or edit DEFAULT_APTLY_URI in the script itself and omit this parameter.)
  4. Open another terminal and cd to the tzdata directory.
  5. In this terminal, have this command line ready:
    ../publish_pkgs.py -a 'http://your-aptly-server:8080' test-aptly-race-bug jammy tzdata_2022e-0ubuntu0.22.04.0.dsc tzdata_2022e-0ubuntu0.22.04.0_source.changes
    (Again, update or omit -a argument.)
  6. Hit enter in the gnome-shell terminal, and then as quickly as possible in the tzdata terminal as well, such that the 2 instances of the script start almost at the same time.
  7. Kaboom 💥.

Journalctl on the aptly server during the failure:

Oct 26 13:37:30 mocha aptly[31969]: Signing file 'Release' with gpg, please enter your passphrase when prompted:
Oct 26 13:37:30 mocha aptly[31969]: Signing file 'Release' with gpg, please enter your passphrase when prompted:
Oct 26 13:37:30 mocha aptly[31969]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted:
Oct 26 13:37:30 mocha aptly[31969]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted:
Oct 26 13:37:30 mocha aptly[31969]: [GIN] 2022/10/26 - 13:37:30 | 500 |  310.528273ms |   192.168.12.33 | PUT      "/api/publish/test-aptly-race-bug/jammy"
Oct 26 13:37:30 mocha aptly[31969]: Error #01: unable to update: unable to rename: rename /mnt/dev-artifact/backup/mocha/public/test-aptly-race-bug/dists/jammy/main/Contents-amd64.tmp.gz /mnt/dev-artifact/backup/mocha/public/test-aptly-race-bug/dists/jammy/main/Contents-amd64.gz: no such file or directory
Oct 26 13:37:30 mocha aptly[31969]: [GIN] 2022/10/26 - 13:37:30 | 500 |  325.660056ms |   192.168.12.33 | PUT      "/api/publish/test-aptly-race-bug/jammy"
Oct 26 13:37:30 mocha aptly[31969]: Error #01: unable to update: unable to rename: rename /mnt/dev-artifact/backup/mocha/public/test-aptly-race-bug/dists/jammy/main/binary-amd64/Packages.tmp.bz2 /mnt/dev-artifact/backup/mocha/public/test-aptly-race-bug/dists/jammy/main/binary-amd64/Packages.bz2: no such file or directory
Oct 26 13:37:30 mocha aptly[31969]: [GIN] 2022/10/26 - 13:37:30 | 200 |   69.542796ms |   192.168.12.33 | DELETE   "/api/files/upload_bxvkuihc"
Oct 26 13:37:30 mocha aptly[31969]: [GIN] 2022/10/26 - 13:37:30 | 200 |   73.519901ms |   192.168.12.33 | DELETE   "/api/files/upload_cmttlfyd"

It is obvious that the 2 instances of the API call are just running concurrently and get in each other's way because one instance has already moved/deleted files that the other is also using. The specific failure may vary depending on the timing of starting both instances of the script. Sometimes the API call does not break entirely, but the final result lacks one of the packages because one instance overwrote the modified Packages file from the other.

@DrLex0 DrLex0 changed the title API: failures and race conditions if 2 concurrent publish update operations API: failures and race conditions with concurrent publish update operations Oct 27, 2022
@randombenj randombenj added the bug label Nov 21, 2022
@randombenj
Copy link
Member

randombenj commented Nov 30, 2022

This was probably introduced in the background api changes: #971
I created a mutex that locks the publish updates with the api: #1130

Despite all your awesome preparations 🎉 I was not able to reproduce your issues. However I encountered
different ones while publishing which are now also fixed with this change.

Would you mind testing this version and see if it fixes the problem?

aptly_1.5.0+20+gba222732_linux_amd64.tar.gz

@randombenj randombenj self-assigned this Nov 30, 2022
@DrLex0
Copy link
Author

DrLex0 commented Jan 5, 2023

I could try it but I suspect the ease of reproducing this depends on factors like I/O speed. Our setup has storage on an NFS mount and the database is huge (several mirrors of various Ubuntu releases). Deploying a development build on our production setup isn't really an option though, and if I try it in a fake setup, it will probably not be reproducible either.

So, we're sort of trusting your judgment that the fix will work, but to avoid this bug in the meantime, we have wrapped some duct tape around it by implementing a mutex server and having each build script require acquiring this mutex to perform the publish PUT call.
We have just noticed that this isn't sufficient though, it looks like the repos/…/file POST call that comes right before this must also be mutexed, otherwise if 2 builds do their POST at the same time and then there are 2 consecutive publish PUT calls, even when protected by our duct tape, only one of the 2 added packages may end up in the updated repo even though no errors will have been logged. I guess this log line hints at this:

Dec 23 16:56:47 mocha aptly[31969]: 2022/12/23 16:56:47 Executing task synchronously

This message is logged both when doing a publish call as well as when doing a repos/…file call, and I guess it means as much as “this needs to be mutexed”. So, it appears you may also need to implement a similar fix for the repos/…/file API call.

@randombenj
Copy link
Member

randombenj commented Mar 8, 2023

@DrLex0 After digging into the async implementation a bit more, I do have a suspicion.
Would you mind trying to recreate the issue and adding the ?_async=true to the publish API calls?

As far as I understand is that there is some resource management implemented into the async API, however
the non async just executes without resource management. This means that when two API calls are made without ?_async=true they are still executed in parallell.

@gmelillo
Copy link

Got this issue today and using the binary provided by @randombenj the problem is solved.
There is any option to have this bugfix released in 1.5.x?

@jmunson
Copy link

jmunson commented Jun 7, 2023

Using _async=true on publish updates seems to have fixed this and we no longer see the apt hash sum mismatch errors.
However we are seeing reports of packages that appear to have been uploaded and registered successfully but that never get added to the repo - Let me know if you'd rather track this in another ticket, but it does appear to be the same race condition. but on the "/repos/:name/file/:dir/:file" endpoint.

When this is triggered the package can be found with aptly package search, but does not show up in aptly repo search. Slightly redacted logs below:

Jun 07 16:27:41 aptlyhost aptly[46662]: [GIN] 2023/06/07 - 16:27:41 | 200 |  982.623296ms |     1.0.0.25 | POST     "/api/files/1686155260484-b'5D6EZVBSJ3QMV6BB'"
Jun 07 16:27:41 aptlyhost aptly[46662]: 2023/06/07 16:27:41 Executing task synchronously
Jun 07 16:27:43 aptlyhost aptly[46662]: [GIN] 2023/06/07 - 16:27:43 | 200 |   56.138473ms |    1.0.0.198 | POST     "/api/files/1686155263670-b'2HTLF3RJORCMLT2N'"
Jun 07 16:27:43 aptlyhost aptly[46662]: 2023/06/07 16:27:43 Executing task synchronously
Jun 07 16:27:45 aptlyhost aptly[46662]: Added: packageone_2023.6.200_amd64 added
Jun 07 16:27:45 aptlyhost aptly[46662]: [GIN] 2023/06/07 - 16:27:45 | 200 |  3.285205584s |     1.0.0.25 | POST     "/api/repos/v2_any_stable_common/file/1686155260484-b'5D6EZVBSJ3QMV6BB'/packageone_2023.6.200_amd64.deb"
Jun 07 16:27:45 aptlyhost aptly[46662]: Added: packagetwo-2023.6.3_2023.6.3_amd64 added
Jun 07 16:27:45 aptlyhost aptly[46662]: [GIN] 2023/06/07 - 16:27:45 | 200 |  1.399051772s |    1.0.0.198 | POST     "/api/repos/v2_any_stable_common/file/1686155263670-b'2HTLF3RJORCMLT2N'/packagetwo-2023.6.3_2023.6.3_amd64.deb"
Jun 07 16:27:45 aptlyhost aptly[46662]: [GIN] 2023/06/07 - 16:27:45 | 202 |   50.074073ms |     1.0.0.25 | PUT      "/api/publish/v2_any/stable?_async=true"
Jun 07 16:27:49 aptlyhost aptly[46662]: Signing file 'Release' with gpg, please enter your passphrase when prompted:
Jun 07 16:27:49 aptlyhost aptly[46662]: Clearsigning file 'Release' with gpg, please enter your passphrase when prompted:


$ aptly.main package search --format "{{.}}" "packageone (=2023.6.200)"
map[Architecture:amd64 Description: packageone Runtime
 Filename:packageone_2023.6.200_amd64.deb FilesHash:c7d49b0f29e108c2 Key:Pamd64 packageone 2023.6.200 c7d49b0f29e108c2 MD5sum:902b3ac62cc3ea8316e8120757a2a107 Maintainer:packageone Runtime Team Package:packageone Priority:optional SHA1:5ac08024becd2817f2ae006cb56c73e64bf195d0 SHA256:810e7ce9c11855204259f6598d474e5ddfe2c3b6544ea787ce9497c40180a339 SHA512:0b5bd51ab781e7617babcc86b6abf6986f3c5fd3bef60e96f765af455fbf438c2c86554fc1b62b1637e4e0da2a04f3a6aa7ed6c804f930a61dfbe9971754eb9c Section:contrib/devel ShortKey:Pamd64 packageone 2023.6.200 Size:270759594 Version:2023.6.200]
map[Architecture:arm64 Description: packageone Runtime
 Filename:packageone_2023.6.200_arm64.deb FilesHash:b3d0f79d971b378a Key:Parm64 packageone 2023.6.200 b3d0f79d971b378a MD5sum:5c1804b35249a17d9467869fd811817a Maintainer:packageone Runtime Team Package:packageone Priority:optional SHA1:9402ea4a32d2fecdb72de2b8c015d34a4ac95f86 SHA256:6c741a8fedb0ad2ae1a6af8ec42c840086ed4fc39894e62d9fb2c342d5dc0cf0 SHA512:38efcbc4f6516f01a8805fb54c31d35c86c3d77c70efec0319b2140b26afa1d7498ce82355dc387df3618c604b4b8df2373afdb80d7d91c70a1f6fe5fa3c0684 Section:contrib/devel ShortKey:Parm64 packageone 2023.6.200 Size:268873502 Version:2023.6.200]
$ aptly.main repo search v2_any_stable_common "packageone (=2023.6.200)"
ERROR: no results

@DrLex0
Copy link
Author

DrLex0 commented Jun 20, 2023

@jmunson If it's the same race condition, then another ticket wouldn't be necessary.

@jmunson
Copy link

jmunson commented Jun 30, 2023

Just an update: we're now hitting both PUT /api/publish/:path/:distribution and POST /repos/:name/file/:dir/:file with _async=true, and have not experienced the original issue of corrupted by-hash Packages files that result in "hash sum mismatch".

Concurrent operations seem less of a problem now, but we do still see complaints of missing packages, and my current suspicion is that the script that registers the file and then hits the publish endpoint is publishing before the file is done being added. I'm still not entirely confident this is what is happening and haven't had a clean reproduction yet, but I do wonder if what we really need now is to actually utilize the TaskIDs so that after we add a file we call /api/tasks/:ID/wait to ensure our call to publish does not happen before the adding of file finishes. If this is the case, I do still wonder what happens if something else attempts to publish at the same time the first job is waiting for the /wait api to return.

Does this idea seem worth while to you, or is there any other approach I might be missing here?

@runitonmetal
Copy link

I stumbled on this issue in a use case that I guess is outside the typical usage. The setup I'm working with is a number of CI's that publish debs at any time, and a CI that consume debs for building a distro, and a number of devboxes that at anytime could install a given version of a package or the latest.

There probably are more elegant solutions that publishing on every package added, but for what I'm doing that is certainly the simplest.

Since I do a lot of concurrent publishes (over the api) I get alot of error HTTP returns. This again can be worked around by just looping the http PUT until it succeeds, but not so elegant.

In an effort to be a good citizen (and a lazy developer that does not need to maintain workarounds) I've opened #1261 which adds a failing test and a really ugly workaround. The value of the MR is pretty much the test case as is, but I'm committed to pushing a better implementation/workaround (given that this is actually a use case that is desired to be supported).

Far as I can tell there is reason for not queuing up operations in the code (even though my submission is blocking rather than queuing).

Is the issue or the MR the correct place for further discussion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment