Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plagued by segmentation faults #2286

Open
bluthg opened this issue Feb 16, 2024 · 17 comments
Open

Plagued by segmentation faults #2286

bluthg opened this issue Feb 16, 2024 · 17 comments
Assignees

Comments

@bluthg
Copy link

bluthg commented Feb 16, 2024

Please provide the following information when submitting an issue (feature requests or general comments can skip this):

  1. pgBackRest version:
    2.49 and also former releases

  2. PostgreSQL version:
    13.x

  3. Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each:
    Ubuntu 20.04

  4. Did you install pgBackRest from source or from a package?
    Package (PGDG repo)

  5. Please attach the following as applicable:

    • pgbackrest.conf file(s)
    • postgresql.conf settings applicable to pgBackRest (archive_command, archive_mode, listen_addresses, max_wal_senders, wal_level, port)
    • errors in the postgresql log file before or during the time you experienced the issue
    • log file in /var/log/pgbackrest for the commands run (e.g. /var/log/pgbackrest/mystanza_backup.log)

Not allowed to fully disclose, but the essence is:

[global]
log-level-file=detail
repo1-host=XXXXXXXXXXXXXXX
archive-async=y
archive-get-queue-max=4GB
compress-type=lz4
spool-path=/var/spool/pgbackrest
repo1-cipher-pass=XXXXXXXXXX
process-max=25
io-timeout=240
archive-timeout=120

[global:archive-get]
process-max=25

[global:archive-push]
process-max=15

Repo host uses an internal S3 service.

  1. Describe the issue:
    My client sees a lot of segfaults (between 0 and 30+ per day per cluster) on the DB servers:
2024-02-15 07:02:15.380 P01   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-1 process terminated unexpectedly on signal 11
2024-02-15 07:02:15.382 P06   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-6 process terminated unexpectedly on signal 11
2024-02-15 07:02:16.503 P03   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-3 process terminated unexpectedly on signal 11
2024-02-15 07:02:37.604 P01   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-1 process terminated unexpectedly on signal 11
2024-02-15 07:02:37.726 P02   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-2 process terminated unexpectedly on signal 11
2024-02-15 20:24:11.466 P01   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-1 process terminated unexpectedly on signal 6
2024-02-15 21:01:44.029 P02   WARN: could not push WAL file 'XXX' to the archive (will be retried): [102] local-2 process terminated unexpectedly on signal 11

with corresponding syslog entries:

[6792684.581533] pgbackrest[127957]: segfault at 200000 ip 0000000000200000 sp 00007fff6f0a3fe8 error 14 in pgbackrest[557640128000+d000]
[6792684.618881] pgbackrest[127978]: segfault at 0 ip 0000000000000000 sp 00007ffdde906ca8 error 14 in pgbackrest[56489d2d9000+d000]
[6792686.007148] pgbackrest[127965]: segfault at 0 ip 0000000000000000 sp 00007ffe9e639828 error 14 in pgbackrest[5639eaeed000+d000]
[6792707.229129] pgbackrest[129870]: segfault at 0 ip 0000000000000000 sp 00007ffe25328bb8 error 14 in pgbackrest[55856214a000+d000]
[6843054.611934] pgbackrest[2227296]: segfault at 0 ip 0000000000000000 sp 00007fff188849c8 error 14 in pgbackrest[55bbdde9c000+d000]

(occasional signal 6 ones too)

Apart from it being annoying and spamming the monitoring, this is not really critical, but alas! it feels wrong... ;-)

I've changed the archive_command on one cluster to include catchsegv, which may hopefully shed some light on the root cause?

@dwsteele dwsteele self-assigned this Feb 18, 2024
@dwsteele
Copy link
Member

Apart from it being annoying and spamming the monitoring, this is not really critical, but alas! it feels wrong... ;-)

Agreed.

I've changed the archive_command on one cluster to include catchsegv, which may hopefully shed some light on the root cause?

I don't think this will work since the process having the issue is two fork/execs away from archive_command. If you could get a stack trace from one of the core dumps that would be very helpful. Be sure to install debug symbols.

@bluthg
Copy link
Author

bluthg commented Feb 19, 2024

Confirmed that catchsegv doesn't yield anything... Will try to get a coredump.

@bluthg
Copy link
Author

bluthg commented Feb 20, 2024

Do these help?
I got this error from GDB for 5 of the files:
1261 common/memContext.c: No such file or directory.

@bluthg
Copy link
Author

bluthg commented Feb 20, 2024

Hmm, I can't attach the tarball apparently. So here we go.
Judging by the size of the coredumps, we've had 4 different crash situations, in increasing order (1x,1x,5x,10x):

[New LWP 3691411]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `pgbackrest --exec-id=3691389-7ca11e09 --log-level-console=off --log-level-file='.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ff08ea2f00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007ff08ea2f00b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff08ea0e859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ff08ea7926e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007ff08ea812fc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x00007ff08ea82c65 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x000056170a9684e2 in memContextFree (this=0x56170ca93210) at common/memContext.c:1414
#6  0x000056170a9685c8 in memContextClean (tryDepth=3, fatal=false) at common/memContext.c:1224
#7  0x000056170a960e68 in errorInternalCatch (errorTypeCatch=0x56170aa510a0 <RuntimeError>, fatalCatch=fatalCatch@entry=false) at common/error/error.c:313
#8  0x000056170a9f57b4 in protocolServerProcess (this=this@entry=0x56170ca70e90, retryInterval=0x56170ca96bc0, handlerList=handlerList@entry=0x56170aa56040 <commandLocalHandlerList>, handlerListSize=handlerListSize@entry=5) at protocol/server.c:203
#9  0x000056170a99addf in cmdLocal (server=0x56170ca70e90) at command/local/local.c:38
#10 0x000056170a95f8a8 in main (argListSize=9, argList=0x7fff4a15d5a8) at common/io/fdRead.h:23
[New LWP 3625650]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `pgbackrest --exec-id=3625633-f444565f --log-level-console=off --log-level-file='.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f62f92616f0 in free () from /lib/x86_64-linux-gnu/libc.so.6
#0  0x00007f62f92616f0 in free () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x0000561eed899b30 in memFreeInternal (buffer=<optimized out>) at common/memContext.c:1353
#2  memContextFreeRecurse (this=0x561eef237c40) at common/memContext.c:1353
#3  0x0000561eed89a4e2 in memContextFree (this=0x561eef237c40) at common/memContext.c:1414
#4  0x0000561eed89a5c8 in memContextClean (tryDepth=3, fatal=false) at common/memContext.c:1224
#5  0x0000561eed892e68 in errorInternalCatch (errorTypeCatch=0x561eed9830a0 <RuntimeError>, fatalCatch=fatalCatch@entry=false) at common/error/error.c:313
#6  0x0000561eed9277b4 in protocolServerProcess (this=this@entry=0x561eef201240, retryInterval=0x561eef226bc0, handlerList=handlerList@entry=0x561eed988040 <commandLocalHandlerList>, handlerListSize=handlerListSize@entry=5) at protocol/server.c:203
#7  0x0000561eed8ccddf in cmdLocal (server=0x561eef201240) at command/local/local.c:38
#8  0x0000561eed8918a8 in main (argListSize=9, argList=0x7ffd339cdb48) at common/io/fdRead.h:23
[New LWP 3613803]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `pgbackrest --exec-id=3613779-c68c79a3 --log-level-console=off --log-level-file='.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000564533ca29c9 in memContextCallbackRecurse (this=this@entry=0x5645351047c0) at common/memContext.c:1261
#0  0x0000564533ca29c9 in memContextCallbackRecurse (this=this@entry=0x5645351047c0) at common/memContext.c:1261
#1  0x0000564533ca34ca in memContextFree (this=0x5645351047c0) at common/memContext.c:1409
#2  0x0000564533ca35c8 in memContextClean (tryDepth=3, fatal=false) at common/memContext.c:1224
#3  0x0000564533c9be68 in errorInternalCatch (errorTypeCatch=0x564533d8c0a0 <RuntimeError>, fatalCatch=fatalCatch@entry=false) at common/error/error.c:313
#4  0x0000564533d307b4 in protocolServerProcess (this=this@entry=0x5645350db240, retryInterval=0x564535100bc0, handlerList=handlerList@entry=0x564533d91040 <commandLocalHandlerList>, handlerListSize=handlerListSize@entry=5) at protocol/server.c:203
#5  0x0000564533cd5ddf in cmdLocal (server=0x5645350db240) at command/local/local.c:38
#6  0x0000564533c9a8a8 in main (argListSize=9, argList=0x7fff031dbf88) at common/io/fdRead.h:23
[New LWP 3613821]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `pgbackrest --exec-id=3613779-c68c79a3 --log-level-console=off --log-level-file='.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
#0  0x0000000000000000 in ?? ()
#1  0x000055c39efe29cb in memContextCallbackRecurse (this=this@entry=0x55c3a0cc07c0) at common/memContext.c:1261
#2  0x000055c39efe34ca in memContextFree (this=0x55c3a0cc07c0) at common/memContext.c:1409
#3  0x000055c39efe35c8 in memContextClean (tryDepth=3, fatal=false) at common/memContext.c:1224
#4  0x000055c39efdbe68 in errorInternalCatch (errorTypeCatch=0x55c39f0cc0a0 <RuntimeError>, fatalCatch=fatalCatch@entry=false) at common/error/error.c:313
#5  0x000055c39f0707b4 in protocolServerProcess (this=this@entry=0x55c3a0c97240, retryInterval=0x55c3a0cbcbc0, handlerList=handlerList@entry=0x55c39f0d1040 <commandLocalHandlerList>, handlerListSize=handlerListSize@entry=5) at protocol/server.c:203
#6  0x000055c39f015ddf in cmdLocal (server=0x55c3a0c97240) at command/local/local.c:38
#7  0x000055c39efda8a8 in main (argListSize=9, argList=0x7ffc1a22bad8) at common/io/fdRead.h:23

@dwsteele
Copy link
Member

Unfortunately these are not as useful as I had hoped. It looks like an error has been thrown and these traces show the attempted cleanup. My guess from past experience is we are getting some sort of unexpected output/error from the S3 clone and it is causing problems. Not sure why this would cause a segfault, though.

We'll need to get some logs to get an idea of what is happening when the error occurs. Add this configuration to one of your systems that has frequent errors:

[global:archive-push]
log-level-file=debug
log-subprocess=y

When you get an error, identify the subprocess that failed (e.g. local-3 process terminated unexpectedly) and find the log file associated with that subprocess, in this case [stanza]-archive-push-local-003.log.

There probably won't be any errors in this log so you'll need to look for log blocks (separated by ----) that do not end in archive-push:local command end: completed successfully. That should mean the process terminated abnormally. If we can get one those blocks we might have a chance to figure out what was happening at the time of the error.

Please note that the logs might get rather large, depending on WAL volume.

@bluthg
Copy link
Author

bluthg commented Feb 22, 2024

I've set this up on our most affected server. And now we wait ;-)

@bluthg
Copy link
Author

bluthg commented Feb 23, 2024

Ok, so we caught some signal 11 and a signal 6.

Note that I had to redact a few more informations ("redacted" w/o "<>").

  • Signal 6
    ** DB Server
2024-02-23 07:09:34.627 P00   INFO: archive-push:async command begin 2.49: [/var/lib/postgresql/13/main/pg_wal] --archive-async --compress-type=lz4 --exec-id=2758216-07e08201 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=off --log-subprocess --pg1-path=/var/lib/postgresql/13/main --process-max=15 --repo1-host=redacted --spool-path=/var/spool/pgbackrest --stanza=redacted
...
...
2024-02-23 07:09:59.945 P00   WARN: unable to write to local-6 process write: [32] Broken pipe
2024-02-23 07:09:59.946 P06   WARN: could not push WAL file '000000110001DF9D00000022' to the archive (will be retried): [102] local-6 process terminated unexpectedly on signal 6
2024-02-23 07:09:59.946 P00  DEBUG:     command/archive/common::archiveAsyncStatusErrorWrite: (archiveMode: push, walSegment: {"000000110001DF9D00000022"}, code: 102, message: {"local-6 process terminated unexpectedly on signal 6"})
2024-02-23 07:09:59.946 P00  DEBUG:     storage/storage::storageNewWrite: (this: {type: posix, path: /var/spool/pgbackrest, write: true}, fileExp: {"<SPOOL:ARCHIVE:OUT>/000000110001DF9D00000022.error"}, param.modeFile: 0000, param.modePath: 0000, param.user: null, param.group: null, param.timeModified: 0, param.n
oCreatePath: false, param.noSyncFile: false, param.noSyncPath: false, param.noAtomic: false, param.noTruncate: false, param.compressible: false)
2024-02-23 07:09:59.946 P00  DEBUG:     storage/storage::storageNewWrite: => {type: posix, name: /var/spool/pgbackrest/archive/redacted/out/000000110001DF9D00000022.error, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}
2024-02-23 07:09:59.946 P00  DEBUG:     storage/storage::storagePut: (file: {type: posix, name: /var/spool/pgbackrest/archive/redacted/out/000000110001DF9D00000022.error, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}, buffer: {used: 55, size: 55})
2024-02-23 07:09:59.947 P00  DEBUG:     storage/storage::storagePut: => void
2024-02-23 07:09:59.947 P00  DEBUG:     command/archive/common::archiveAsyncStatusErrorWrite: => void
2024-02-23 07:09:59.948 P00  DEBUG:     command/archive/push/push::cmdArchivePushAsync: => void
2024-02-23 07:10:00.048 P00  DEBUG:     command/exit::exitSafe: (result: 0, error: false, signalType: 0)
2024-02-23 07:10:00.048 P00   INFO: archive-push:async command end: completed successfully (25422ms)

** pgBackRest repo host (local process log on DB server didn't sport anything from that time):

2024-02-23 07:09:35.807 P00   INFO: archive-push:remote command begin 2.49: --archive-async --exec-id=2758216-07e08201 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --pg1-path=/var/lib/postgresql/13/main --pg2-path=/var/lib/postgresql/13/main --pg3-path=/var/lib/postgresql/13/main --pg4-path=/var/lib/postgresql/13/main --process=6 --remote-type=repo --repo=1 --repo1-path=/repo --repo1-s3-bucket=redacted --repo1-s3-endpoint=redacted --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=europe --repo1-s3-uri-style=path --repo1-type=s3 --stanza=redacted
...
...
2024-02-23 07:09:59.698 P00  DEBUG:     protocol/server::protocolServerError: (this: {name: remote-6}, code: 39, message: {"HTTP request failed with 200 (OK):
                                        *** Path/Query ***:
                                        POST /redacted/repo/archive/redacted/13-1/000000110001DF9D/000000110001DF9D00000022-9a7c909098ca5443f8f0c36a3221f09bc28f8fd3.lz4?uploadId=z1soZl5xufdvcSxuUF5szgAC2abaFmPlWmTkanMIBE3u5_C0l1rvodYlGA
                                        *** Request Headers ***:
                                        authorization: <redacted>
                                        content-length: 435
                                        content-md5: hRf/u0gF7yLHRxWcD2TFfA==
                                        host: redacted
                                        x-amz-content-sha256: bf7df4dfeee0e83d39f7471790822f8a8096763a4caeb689d4fd5ea0d1749f36
                                        x-amz-date: <redacted>
                                        *** Response Headers ***:
                                        connection: close
                                        content-type: application/xml
                                        date: Fri, 23 Feb 2024 06:09:49 GMT
                                        server: StorageGRID/11.6.0.9
                                        vary: Accept-Encoding
                                        x-amz-id-2: 12422083
                                        x-amz-request-id: 1708668575841931
                                        x-amz-version-id: MUU5OUI5QUEtRDIxMi0xMUVFLTkzQjgtQUNDNzAwQkQ4QkMz
                                        x-ntap-sg-trace-id: 950999080a7ef361
                                        *** Response Content ***:
                                        <?xml version="1.0" encoding="UTF-8"?>
                                         <Error><Code>ServiceUnavailable</Code><Message>Please reduce your request rate.</Message></Error>"}, stack: {"common/io/http/request.c:httpRequestError:302:(trace log level required for parameters)
                                        storage/s3/write.c:storageWriteS3Close:(trace log level required for parameters)
                                        common/io/write.c:ioWriteClose:(trace log level required for parameters)
                                        storage/remote/protocol.c:storageRemoteOpenWriteProtocol:(param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: remote-6})
                                        protocol/server.c:protocolServerProcess:(this: {name: remote-6}, retryInterval: null, handlerList: *void, handlerListSize: 14)
                                        command/remote/remote.c:cmdRemote:(void)
                                        main.c:main:(debug log level required for parameters)"})
2024-02-23 07:09:59.699 P00  DEBUG:     protocol/server::protocolServerError: => void
2024-02-23 07:09:59.703 P00  DEBUG:     protocol/server::protocolServerError: (this: {name: remote-6}, code: 29, message: {"unexpected EOF"}, stack: {"common/type/pack.c:pckReadBuffer:441:(test build required for parameters)
                                            ... function(s) omitted ...
                                        protocol/server.c:protocolServerCommandGet:(trace log level required for parameters)
                                        protocol/server.c:protocolServerProcess:(this: {name: remote-6}, retryInterval: null, handlerList: *void, handlerListSize: 14)
                                        command/remote/remote.c:cmdRemote:(void)
                                        main.c:main:(debug log level required for parameters)"})
2024-02-23 07:09:59.704 P00  DEBUG:     protocol/server::protocolServerError: => void
2024-02-23 07:09:59.704 P00  DEBUG:     command/exit::exitSafe: (result: 0, error: true, signalType: 0)
2024-02-23 07:09:59.704 P00  ERROR: [029]: unexpected EOF
...
2024-02-23 07:09:59.704 P00 DETAIL: statistics: {"http.client":{"total":1},"http.close":{"total":1},"http.request":{"total":7},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2024-02-23 07:09:59.704 P00   INFO: archive-push:remote command end: aborted with exception [029]
2024-02-23 07:09:59.704 P00  DEBUG:     command/exit::exitSafe: => 29
2024-02-23 07:09:59.704 P00  DEBUG:     main::main: => 29
  • Signal 11
    ** DB server
2024-02-22 20:49:31.620 P00   WARN: unable to write to local-3 process write: [32] Broken pipe
2024-02-22 20:49:31.621 P03   WARN: could not push WAL file '000000110001DF1E000000AD' to the archive (will be retried): [102] local-3 process terminated unexpectedly on signal 11
2024-02-22 20:49:31.621 P00  DEBUG:     command/archive/common::archiveAsyncStatusErrorWrite: (archiveMode: push, walSegment: {"000000110001DF1E000000AD"}, code: 102, message: {"local-3 process terminated unexpectedly on signal 11"})
2024-02-22 20:49:31.621 P00  DEBUG:     storage/storage::storageNewWrite: (this: {type: posix, path: /var/spool/pgbackrest, write: true}, fileExp: {"<SPOOL:ARCHIVE:OUT>/000000110001DF1E000000AD.error"}, param.modeFile: 0000, param.modePath: 0000, param.user: null, param.group: null, param.timeModified: 0, param.noCreatePath: false, param.noSyncFile: false, param.noSyncPath: false, param.noAtomic: false, param.noTruncate: false, param.compressible: false)
2024-02-22 20:49:31.621 P00  DEBUG:     storage/storage::storageNewWrite: => {type: posix, name: /var/spool/pgbackrest/archive/redacted/out/000000110001DF1E000000AD.error, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}
2024-02-22 20:49:31.622 P00  DEBUG:     storage/storage::storagePut: (file: {type: posix, name: /var/spool/pgbackrest/archive/redacted/out/000000110001DF1E000000AD.error, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}, buffer: {used: 56, size: 56})
2024-02-22 20:49:31.623 P00  DEBUG:     storage/storage::storagePut: => void
2024-02-22 20:49:31.623 P00  DEBUG:     command/archive/common::archiveAsyncStatusErrorWrite: => void

** repo host

2024-02-22 20:49:31.615 P00  DEBUG:     protocol/server::protocolServerError: (this: {name: remote-3}, code: 39, message: {"HTTP request failed with 200 (OK):
                                        *** Path/Query ***:
                                        POST /redacted/repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4?uploadId=q3nk70fa8GnSW2PxQrdjCNwGoJO8W4B4exWtFe9v_nFwQpwyQywVeZ2LDg
                                        *** Request Headers ***:
                                        authorization: <redacted>
                                        content-length: 349
                                        content-md5: GqgHnjdA3PWRsMg+OqrMzA==
                                        host: redacted
                                        x-amz-content-sha256: 67a2298a042f3d39f35eba35fb36a5d38eaae0133ee0555b9ad4813c41236f08
                                        x-amz-date: <redacted>
                                        *** Response Headers ***:
                                        connection: close
                                        content-type: application/xml
                                        date: Thu, 22 Feb 2024 19:49:21 GMT
                                        server: StorageGRID/11.6.0.9
                                        vary: Accept-Encoding
                                        x-amz-id-2: 12674478
                                        x-amz-request-id: 1708631360413366
                                        x-amz-version-id: Nzg5QTA4OEEtRDFCQi0xMUVFLTg5NTgtNTQxNTAwQzE2NUFF
                                        x-ntap-sg-trace-id: 4fee5ab625579827
                                        *** Response Content ***:
                                        <?xml version="1.0" encoding="UTF-8"?>
                                         <Error><Code>ServiceUnavailable</Code><Message>Please reduce your request rate.</Message></Error>"}, stack: {"common/io/http/request.c:httpRequestError:302:(trace log level required for parameters)
                                        storage/s3/write.c:storageWriteS3Close:(trace log level required for parameters)
                                        common/io/write.c:ioWriteClose:(trace log level required for parameters)
                                        storage/remote/protocol.c:storageRemoteOpenWriteProtocol:(param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: remote-3})
                                        protocol/server.c:protocolServerProcess:(this: {name: remote-3}, retryInterval: null, handlerList: *void, handlerListSize: 14)
                                        command/remote/remote.c:cmdRemote:(void)
                                        main.c:main:(debug log level required for parameters)"})
2024-02-22 20:49:31.616 P00  DEBUG:     protocol/server::protocolServerError: => void
2024-02-22 20:49:31.621 P00  DEBUG:     protocol/server::protocolServerError: (this: {name: remote-3}, code: 29, message: {"unexpected EOF"}, stack: {"common/type/pack.c:pckReadBuffer:441:(test build required for parameters)
                                            ... function(s) omitted ...
                                        protocol/server.c:protocolServerCommandGet:(trace log level required for parameters)
                                        protocol/server.c:protocolServerProcess:(this: {name: remote-3}, retryInterval: null, handlerList: *void, handlerListSize: 14)
                                        command/remote/remote.c:cmdRemote:(void)
                                        main.c:main:(debug log level required for parameters)"})
2024-02-22 20:49:31.621 P00  DEBUG:     protocol/server::protocolServerError: => void
2024-02-22 20:49:31.621 P00  DEBUG:     command/exit::exitSafe: (result: 0, error: true, signalType: 0)
2024-02-22 20:49:31.621 P00  ERROR: [029]: unexpected EOF
                                    --------------------------------------------------------------------
                                    If SUBMITTING AN ISSUE please provide the following information:
                                    
                                    version: 2.49
                                    command: archive-push:remote
                                    options: --archive-async --exec-id=1701722-03e79a03 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --pg1-path=/var/lib/postgresql/13/main --pg2-path=/var/lib/postgresql/13/main --pg3-path=/var/lib/postgresql/13/main --pg4-path=/var/lib/postgresql/13/main --process=3 --remote-type=repo --repo=1 --repo1-path=/repo --repo1-s3-bucket=redacted --repo1-s3-endpoint=redacted --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=europe --repo1-s3-uri-style=path --repo1-type=s3 --stanza=redacted
                                    
                                    stack trace:
                                    common/type/pack.c:pckReadBuffer:441:(test build required for parameters)
                                        ... function(s) omitted ...
                                    protocol/server.c:protocolServerCommandGet:(trace log level required for parameters)
                                    protocol/server.c:protocolServerProcess:(this: {name: remote-3}, retryInterval: null, handlerList: *void, handlerListSize: 14)
                                    command/remote/remote.c:cmdRemote:(void)
                                    main.c:main:(debug log level required for parameters)
                                    --------------------------------------------------------------------
2024-02-22 20:49:31.622 P00 DETAIL: statistics: {"http.client":{"total":1},"http.close":{"total":1},"http.request":{"total":6},"http.session":{"total":1},"socket.client":{"total":1},"socket.session":{"total":1},"tls.client":{"total":1},"tls.session":{"total":1}}
2024-02-22 20:49:31.622 P00   INFO: archive-push:remote command end: aborted with exception [029]
2024-02-22 20:49:31.622 P00  DEBUG:     command/exit::exitSafe: => 29
2024-02-22 20:49:31.622 P00  DEBUG:     main::main: => 29

Obviously, we'll be working on solving the root cause of
<Error><Code>ServiceUnavailable</Code><Message>Please reduce your request rate.</Message></Error>
in parallel ;-)

@dwsteele
Copy link
Member

It looks like the remotes are working as expected, i.e. getting an error and throwing it to the local process. But then the local crashes and the remote errors on EOF. Can we get the relevant local logs?

Obviously, we'll be working on solving the root cause of
ServiceUnavailablePlease reduce your request rate.
in parallel

This is a pretty awful place throw a rate limiting error, when the upload is complete and we are just trying to stitch it together. Worse, it is not returned with a proper HTTP code or sensible error.

None of this should cause a crash, of course...

@bluthg
Copy link
Author

bluthg commented Feb 26, 2024

As mentioned, the local logs have nothing in them, e.g.:

2024-02-22 20:49:20.667 P00  DEBUG:     storage/storage::storageNewWrite: => {type: remote, name: /repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}

-------------------PROCESS START-------------------
2024-02-22 20:49:36.896 P00   INFO: archive-push:local command begin [...]

(while the crash of the local-003 process was at 2024-02-22 20:49:31)

@dwsteele
Copy link
Member

As mentioned, the local logs have nothing in them

I'm confused -- doesn't look like nothing to me -- in fact looks a lot like what I was expecting. Can we get more above the entry at 2024-02-22 20:49:20.667?

@bluthg
Copy link
Author

bluthg commented Feb 28, 2024

D'uh! What I didn't realize was that that was actually the subprocess we're looking for...! Someone else had extracted the logs already, so I just copied them from the ticket system... mea culpa!

So here's the beginning of the log on the repo host up to the point pasted earlier:

-------------------PROCESS START-------------------
2024-02-22 20:49:20.401 P00   INFO: archive-push:remote command begin 2.49: --archive-async --exec-id=1701722-03e79a03 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --pg1-path=/var/lib/postgresql/13/main --pg2-path=/var/lib/postgresql/13/main --pg3-path=/var/lib/postgresql/13/main --pg4-path=/var/lib/postgresql/13/main --process=3 --remote-type=repo --repo=1 --repo1-path=/repo --repo1-s3-bucket=redacted --repo1-s3-endpoint=redacted --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=europe --repo1-s3-uri-style=path --
repo1-type=s3 --stanza=redacted
2024-02-22 20:49:20.401 P00  DEBUG:     common/lock::lockInit: (path: {"/tmp/pgbackrest"}, execId: {"1701722-03e79a03"}, type: 0)
2024-02-22 20:49:20.401 P00  DEBUG:     common/lock::lockInit: => void
2024-02-22 20:49:20.401 P00  DEBUG:     config/load::cfgLoad: => void
2024-02-22 20:49:20.401 P00  DEBUG:     command/remote/remote::cmdRemote: (void)
2024-02-22 20:49:20.402 P00  DEBUG:     protocol/server::protocolServerProcess: (this: {name: remote-3}, retryInterval: null, handlerList: *void, handlerListSize: 14)
2024-02-22 20:49:20.402 P00  DEBUG:     config/protocol::configOptionProtocol: (param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: remote-3})
2024-02-22 20:49:20.403 P00  DEBUG:     config/protocol::configOptionProtocol: => void
2024-02-22 20:49:20.403 P00  DEBUG:     storage/remote/protocol::storageRemoteFeatureProtocol: (param: null, server: {name: remote-3})
2024-02-22 20:49:20.403 P00  DEBUG:     protocol/helper::repoIsLocal: (repoIdx: 0)
2024-02-22 20:49:20.403 P00  DEBUG:     protocol/helper::repoIsLocal: => true
2024-02-22 20:49:20.403 P00  DEBUG:     storage/s3/helper::storageS3Helper: (repoIdx: 0, write: true, pathExpressionCallback: *void)
2024-02-22 20:49:20.403 P00  DEBUG:     storage/s3/storage::storageS3New: (path: {"/repo"}, write: true, pathExpressionFunction: (function *), bucket: {"redacted"}, endPoint: {"redcated"}, uriStyle: path, region: {"europe"}, keyType: shared, partSize: 5242880, tag: null, host: null, port: 443, timeout: 240000, verifyPeer: true, caFile: null, caPath: null)
2024-02-22 20:49:20.403 P00  DEBUG:     common/io/socket/client::sckClientNew: (host: {"redacted"}, port: 443, timeoutConnect: 240000, timeoutSession: 240000)
2024-02-22 20:49:20.403 P00  DEBUG:     common/io/socket/client::sckClientNew: => {type: socket, driver: {host: redacted, port: 443, timeoutConnect: 240000, timeoutSession: 240000}}
2024-02-22 20:49:20.403 P00  DEBUG:     common/io/tls/client::tlsClientNew: (ioClient: {type: socket, driver: {host: redacted, port: 443, timeoutConnect: 240000, timeoutSession: 240000}}, host: {"redacted"}, timeoutConnect: 240000, timeoutSession: 240000, verifyPeer: true, param.caFile: null, param.caPath: null, param.certFile: null, param.keyFile: null)
2024-02-22 20:49:20.405 P00  DEBUG:     common/io/tls/client::tlsClientNew: => {type: tls, driver: {ioClient: {type: socket, driver: {host: redacted, port: 443, timeoutConnect: 240000, timeoutSession: 240000}}, timeoutConnect: 240000, timeoutSession: 240000, verifyPeer: true}}
2024-02-22 20:49:20.405 P00  DEBUG:     common/io/http/client::httpClientNew: (ioClient: {type: tls, driver: {ioClient: {type: socket, driver: {host: redacted, port: 443, timeoutConnect: 240000, timeoutSession: 240000}}, timeoutConnect: 240000, timeoutSession: 240000, verifyPeer: true}}, timeout: 240000)
2024-02-22 20:49:20.405 P00  DEBUG:     common/io/http/client::httpClientNew: => {ioClient: {type: tls, driver: {ioClient: {type: socket, driver: {host: redacted, port: 443, timeoutConnect: 240000, timeoutSession: 240000}}, timeoutConnect: 240000, timeoutSession: 240000, verifyPeer: true}}, reusable: 0, timeout: 240000}
2024-02-22 20:49:20.405 P00  DEBUG:     storage/s3/storage::storageS3New: => {type: s3, path: /repo, write: true}
2024-02-22 20:49:20.405 P00  DEBUG:     storage/s3/helper::storageS3Helper: => {type: s3, path: /repo, write: true}
2024-02-22 20:49:20.405 P00  DEBUG:     storage/remote/protocol::storageRemoteFeatureProtocol: => void
2024-02-22 20:49:20.406 P00  DEBUG:     storage/remote/protocol::storageRemoteListProtocol: (param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: remote-3})
2024-02-22 20:49:20.663 P00  DEBUG:     storage/remote/protocol::storageRemoteListProtocol: => void
2024-02-22 20:49:20.666 P00  DEBUG:     storage/remote/protocol::storageRemoteFeatureProtocol: (param: null, server: {name: remote-3})
2024-02-22 20:49:20.666 P00  DEBUG:     storage/remote/protocol::storageRemoteFeatureProtocol: => void
2024-02-22 20:49:20.667 P00  DEBUG:     storage/remote/protocol::storageRemoteOpenWriteProtocol: (param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: remote-3})
2024-02-22 20:49:20.667 P00  DEBUG:     storage/s3/storage::storageS3NewWrite: (this: {StorageS3}, file: {"/repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4"})
2024-02-22 20:49:20.667 P00  DEBUG:     storage/s3/storage::storageS3NewWrite: => {type: s3, name: /repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4, modeFile: 0000, modePath: 0000, createPath: true, syncFile: true, syncPath: true, atomic: true}
2024-02-22 20:49:31.615 P00  DEBUG:     protocol/server::protocolServerError: (this: {name: remote-3}, code: 39, message: {"HTTP request failed with 200 (OK):

and I try to attach the snippet of the -archive-push-async-local-003.log

@bluthg
Copy link
Author

bluthg commented Feb 28, 2024

No idea why I can't attach files... sorry!

-------------------PROCESS START-------------------
2024-02-22 20:49:19.725 P00   INFO: archive-push:local command begin 2.49: --archive-async --exec-id=1701722-03e79a03 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --pg1-path=/var/lib/postgresql/13/main --process=3 --remote-type=repo --repo1-host=redacted --spool-path=/var/spool/pgbackrest --stanza=redacted
2024-02-22 20:49:19.725 P00  DEBUG:     common/lock::lockInit: (path: {"/tmp/pgbackrest"}, execId: {"1701722-03e79a03"}, type: 0)
2024-02-22 20:49:19.725 P00  DEBUG:     common/lock::lockInit: => void
2024-02-22 20:49:19.725 P00  DEBUG:     config/load::cfgLoad: => void
2024-02-22 20:49:19.725 P00  DEBUG:     command/local/local::cmdLocal: (void)
2024-02-22 20:49:19.725 P00  DEBUG:     protocol/server::protocolServerProcess: (this: {name: local-3}, retryInterval: {VariantList}, handlerList: *void, handlerListSize: 5)
2024-02-22 20:49:19.904 P00  DEBUG:     command/archive/push/protocol::archivePushFileProtocol: (param: {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}, server: {name: local-3})
2024-02-22 20:49:19.904 P00  DEBUG:     command/archive/push/file::archivePushFile: (walSource: {"/var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD"}, headerCheck: true, modeCheck: true, pgVersion: 130000, pgSystemId: 7021831570073544007, archiveFile: {"000000110001DF1E000000AD"}, compressType: 3, compressLevel: 1, repoList: *void, priorErrorList: {[]})
2024-02-22 20:49:19.905 P00  DEBUG:     postgres/interface::pgWalFromFile: (walFile: {"/var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD"}, pgVersionForce: null)
2024-02-22 20:49:19.905 P00  DEBUG:     storage/storage::storageNewRead: (this: {type: posix, path: /, write: false}, fileExp: {"/var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD"}, param.ignoreMissing: false, param.compressible: false, param.offset: 0, param.limit: null)
2024-02-22 20:49:19.905 P00  DEBUG:     storage/storage::storageNewRead: => {type: posix, name: /var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD, ignoreMissing: false}
2024-02-22 20:49:19.905 P00  DEBUG:     storage/storage::storageGet: (file: {type: posix, name: /var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD, ignoreMissing: false}, param.exactSize: 512)
2024-02-22 20:49:19.907 P00  DEBUG:     storage/storage::storageGet: => {used: 512, size: 512}
2024-02-22 20:49:19.907 P00  DEBUG:     postgres/interface::pgWalFromFile: => {version: 130000, systemId: 7021831570073544007}
2024-02-22 20:49:19.907 P00  DEBUG:     storage/storage::storageNewRead: (this: {type: posix, path: /, write: false}, fileExp: {"/var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD"}, param.ignoreMissing: false, param.compressible: false, param.offset: 0, param.limit: null)
2024-02-22 20:49:19.907 P00  DEBUG:     storage/storage::storageNewRead: => {type: posix, name: /var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD, ignoreMissing: false}
2024-02-22 20:49:19.910 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: (this: {inputSame: false, done: false}, filter: {type: hash})
2024-02-22 20:49:19.910 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: => {inputSame: false, done: false}
2024-02-22 20:49:19.910 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: (this: {inputSame: false, done: false}, filter: {type: sink})
2024-02-22 20:49:19.910 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: => {inputSame: false, done: false}
2024-02-22 20:49:19.938 P00  DEBUG:     common/io/filter/group::ioFilterGroupResult: (this: {inputSame: false, done: true}, filterType: hash, param.idx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     common/io/filter/group::ioFilterGroupResultPack: (this: {inputSame: false, done: true}, filterType: hash, param.idx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     common/io/filter/group::ioFilterGroupResultPack: => {Pack}
2024-02-22 20:49:19.939 P00  DEBUG:     common/io/filter/group::ioFilterGroupResult: => {depth: 0, idLast: 0, tagNextId: 0, tagNextType: 0, tagNextValue 0}
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::repoIsLocal: (repoIdx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::repoIsLocal: => false
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::protocolRemoteGet: (protocolStorageType: repo, hostIdx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::protocolRemoteParamSsh: (protocolStorageType: repo, hostIdx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::protocolRemoteParam: (protocolStorageType: repo, hostIdx: 0)
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::protocolRemoteParam: => {["--archive-async", "--exec-id=1701722-03e79a03", "--io-timeout=240", "--log-level-console=off", "--log-level-file=debug", "--log-level-stderr=error", "--log-subprocess", "--process=3", "--remote-type=repo", "--repo=1", "--stanza=redacted", "archive-push:remote"]}
2024-02-22 20:49:19.939 P00  DEBUG:     protocol/helper::protocolRemoteParamSsh: => {["-o", "LogLevel=error", "-o", "Compression=no", "-o", "PasswordAuthentication=no", "pgbackrest@redacted", "pgbackrest --archive-async --exec-id=1701722-03e79a03 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --process=3 --remote-type=repo --repo=1 --stanza=redacted archive-push:remote"]}
2024-02-22 20:49:19.939 P00  DEBUG:     common/exec::execNew: (command: {"ssh"}, param: {["-o", "LogLevel=error", "-o", "Compression=no", "-o", "PasswordAuthentication=no", "pgbackrest@redacted", "pgbackrest --archive-async --exec-id=1701722-03e79a03 --io-timeout=240 --log-level-console=off --log-level-file=debug --log-level-stderr=error --log-subprocess --process=3 --remote-type=repo --repo=1 --stanza=redacted archive-push:remote"]}, name: {"remote-3 process on 'redacted'"}, timeout: 1830000)
2024-02-22 20:49:19.939 P00  DEBUG:     common/exec::execNew: => {Exec}
2024-02-22 20:49:19.939 P00  DEBUG:     common/exec::execOpen: (this: {Exec})
2024-02-22 20:49:19.939 P00  DEBUG:     common/exec::execOpen: => void
2024-02-22 20:49:20.403 P00  DEBUG:     protocol/helper::protocolRemoteGet: => {name: remote-3 ssh protocol on 'redacted', state: idle}
2024-02-22 20:49:20.403 P00  DEBUG:     storage/remote/storage::storageRemoteNew: (modeFile: 0640, modePath: 0750, write: false, pathExpressionFunction: (function *), client: {name: remote-3 ssh protocol on 'redacted', state: idle}, compressLevel: 3)
2024-02-22 20:49:20.406 P00  DEBUG:     storage/remote/storage::storageRemoteNew: => {type: remote, path: /repo, write: false}
2024-02-22 20:49:20.406 P00  DEBUG:     command/archive/find::walSegmentFindOne: (storage: {type: remote, path: /repo, write: false}, archiveId: {"13-1"}, walSegment: {"000000110001DF1E000000AD"}, timeout: 0)
2024-02-22 20:49:20.406 P00  DEBUG:     command/archive/find::walSegmentFindNew: (storage: {type: remote, path: /repo, write: false}, archiveId: {"13-1"}, single: true, timeout: 0)
2024-02-22 20:49:20.406 P00  DEBUG:     command/archive/find::walSegmentFindNew: => {WalSegmentFind}
2024-02-22 20:49:20.406 P00  DEBUG:     command/archive/find::walSegmentFind: (this: {WalSegmentFind}, walSegment: {"000000110001DF1E000000AD"})
2024-02-22 20:49:20.406 P00  DEBUG:     storage/storage::storageList: (this: {type: remote, path: /repo, write: false}, pathExp: {"<REPO:ARCHIVE>/13-1/000000110001DF1E"}, param.errorOnMissing: false, param.nullOnMissing: false, param.expression: {"^000000110001DF1E000000AD-[0-f]{40}(\.gz|\.lz4|\.zst|\.xz|\.bz2){0,1}$"})
2024-02-22 20:49:20.406 P00  DEBUG:     storage/storage::storageNewItr: (this: {type: remote, path: /repo, write: false}, pathExp: {"<REPO:ARCHIVE>/13-1/000000110001DF1E"}, param.level: 0, param.errorOnMissing: false, param.recurse: false, param.nullOnMissing: false, param.sortOrder: 0, param.expression: {"^000000110001DF1E000000AD-[0-f]{40}(\.gz|\.lz4|\.zst|\.xz|\.bz2){0,1}$"}, param.recurse: false)
2024-02-22 20:49:20.406 P00  DEBUG:     storage/iterator::storageItrNew: (driver: *void, path: {"/repo/archive/redacted/13-1/000000110001DF1E"}, level: 3, errorOnMissing: false, nullOnMissing: false, recurse: false, sortOrder: 0, expression: {"^000000110001DF1E000000AD-[0-f]{40}(\.gz|\.lz4|\.zst|\.xz|\.bz2){0,1}$"})
2024-02-22 20:49:20.666 P00  DEBUG:     storage/iterator::storageItrNew: => {stack: {size: 1}}
2024-02-22 20:49:20.666 P00  DEBUG:     storage/storage::storageNewItr: => {stack: {size: 1}}
2024-02-22 20:49:20.666 P00  DEBUG:     storage/storage::storageList: => {[]}
2024-02-22 20:49:20.666 P00  DEBUG:     command/archive/find::walSegmentFind: => null
2024-02-22 20:49:20.666 P00  DEBUG:     command/archive/find::walSegmentFindOne: => null
2024-02-22 20:49:20.666 P00  DEBUG:     storage/storage::storageNewRead: (this: {type: posix, path: /, write: false}, fileExp: {"/var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD"}, param.ignoreMissing: false, param.compressible: false, param.offset: 0, param.limit: null)
2024-02-22 20:49:20.666 P00  DEBUG:     storage/storage::storageNewRead: => {type: posix, name: /var/lib/postgresql/13/main/pg_wal/000000110001DF1E000000AD, ignoreMissing: false}
2024-02-22 20:49:20.666 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: (this: {inputSame: false, done: false}, filter: {type: lz4-cmp})
2024-02-22 20:49:20.666 P00  DEBUG:     common/io/filter/group::ioFilterGroupAdd: => {inputSame: false, done: false}
2024-02-22 20:49:20.666 P00  DEBUG:     protocol/helper::repoIsLocal: (repoIdx: 0)
2024-02-22 20:49:20.666 P00  DEBUG:     protocol/helper::repoIsLocal: => false
2024-02-22 20:49:20.666 P00  DEBUG:     protocol/helper::protocolRemoteGet: (protocolStorageType: repo, hostIdx: 0)
2024-02-22 20:49:20.666 P00  DEBUG:     protocol/helper::protocolRemoteGet: => {name: remote-3 ssh protocol on 'redacted', state: idle}
2024-02-22 20:49:20.666 P00  DEBUG:     storage/remote/storage::storageRemoteNew: (modeFile: 0640, modePath: 0750, write: true, pathExpressionFunction: (function *), client: {name: remote-3 ssh protocol on 'redacted', state: idle}, compressLevel: 3)
2024-02-22 20:49:20.667 P00  DEBUG:     storage/remote/storage::storageRemoteNew: => {type: remote, path: /repo, write: true}
2024-02-22 20:49:20.667 P00  DEBUG:     storage/storage::storageNewWrite: (this: {type: remote, path: /repo, write: true}, fileExp: {"<REPO:ARCHIVE>/13-1/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4"}, param.modeFile: 0000, param.modePath: 0000, param.user: null, param.group: null, param.timeModified: 0, param.noCreatePath: false, param.noSyncFile: false, param.noSyncPath: false, param.noAtomic: false, param.noTruncate: false, param.compressible: false)
2024-02-22 20:49:20.667 P00  DEBUG:     storage/remote/storage::storageRemoteNewWrite: (this: {StorageRemote}, file: {"/repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4"}, param.modeFile: 0640, param.modePath: 0750, param.user: null, param.group: null, param.timeModified: 0, param.createPath: true, param.syncFile: true, param.syncPath: true, param.atomic: true, param.compressible: false)
2024-02-22 20:49:20.667 P00  DEBUG:     storage/remote/storage::storageRemoteNewWrite: => {type: remote, name: /repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}
2024-02-22 20:49:20.667 P00  DEBUG:     storage/storage::storageNewWrite: => {type: remote, name: /repo/archive/redacted/13-1/000000110001DF1E/000000110001DF1E000000AD-56b04711203231e816af474604c1a7e943a3691f.lz4, modeFile: 0640, modePath: 0750, createPath: true, syncFile: true, syncPath: true, atomic: true}


@bluthg
Copy link
Author

bluthg commented Mar 14, 2024

@dwsteele , do these shed some light on the issue?

@dwsteele
Copy link
Member

do these shed some light on the issue?

Somewhat, but I have not had a chance to reproduce. Even if I can reproduce it I doubt there is an easy fix. The real answer is to commit the protocol changes in #2108 but there are some performance problems there to address. I'm hoping to get to that in this release cycle.

@dwsteele
Copy link
Member

I spent some time trying to reproduce this but with no luck. I always see the error properly reported and a normal abort. Given that, I think it makes sense to see if #2108 fixes this issue.

@bluthg
Copy link
Author

bluthg commented Mar 27, 2024

#2108 certainly looks like something that will have some influence on this, yeah ;-)
I guess we'll wait until we roll out a release that includes it until further investigation then!

@dwsteele
Copy link
Member

I have made the updates to #2108. We can't be sure that this will fix your issue, but I'm hoping so. It should now be a lot harder for the protocol to get into a bad state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants