Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHK resource fails to come up on AWS EKS due to wrong FS permissions on /var/lib/clickhouse-keeper #1370

Open
hodgesrm opened this issue Mar 9, 2024 · 9 comments

Comments

@hodgesrm
Copy link
Member

hodgesrm commented Mar 9, 2024

What's wrong:

The CHK example in 02-extended-1-node.yaml fails if you use the Altinity 23.8.8.21 docker image on EKS. I thought this was an Altinity Stable Bug but it also happens with clickhouse/clickhouse-keeper:23.8.10.43-alpine.

I'm using the 0.23.3 operator installed using helm. I am using Kubernetes 1.26 on AWS EKS installed with our EKS blueprint.

How to Reproduce:
Run kubectl apply -f on this resource.

apiVersion: "clickhouse-keeper.altinity.com/v1"
kind: "ClickHouseKeeperInstallation"
metadata:
  name: chk-1-node-reduced
spec:
  configuration:
    clusters:
      - name: "reduced-1"
        layout:
          replicasCount: 1
  templates:
    podTemplates:
      - name: default
        spec:
          containers:
            - name: clickhouse-keeper
              imagePullPolicy: IfNotPresent
              image: "altinity/clickhouse-keeper:23.8.8.21.altinitystable"
    volumeClaimTemplates:
      - name: default
        metadata:
          name: both-paths
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 25Gi

What happens:

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e1a23b in /usr/bin/clickhouse-keeper
1. DB::ErrnoException::ErrnoException(String const&, int, int, std::optional<String> const&) @ 0x0000000000a2a0b4 in /usr/bin/clickhouse-keeper
2. DB::throwFromErrnoWithPath(String const&, String const&, int, int) @ 0x0000000000e1b7e7 in /usr/bin/clickhouse-keeper
3. DB::WriteBufferFromFile::WriteBufferFromFile(String const&, unsigned long, int, std::shared_ptr<DB::Throttler>, unsigned int, char*, unsigned long) @ 0x0000000000eb220b in /usr/bin/clickhouse-keeper
4. DB::DiskLocal::writeFile(String const&, unsigned long, DB::WriteMode, DB::WriteSettings const&) @ 0x00000000009a82f3 in /usr/bin/clickhouse-keeper
5. DB::ChangelogWriter::setFile(std::shared_ptr<DB::ChangelogFileDescription>, DB::WriteMode) @ 0x000000000079ddca in /usr/bin/clickhouse-keeper
6. DB::ChangelogWriter::rotate(unsigned long) @ 0x000000000079c287 in /usr/bin/clickhouse-keeper
7. DB::Changelog::readChangelogAndInitWriter(unsigned long, unsigned long) @ 0x0000000000799625 in /usr/bin/clickhouse-keeper
8. DB::KeeperLogStore::init(unsigned long, unsigned long) @ 0x00000000007f0817 in /usr/bin/clickhouse-keeper
9. DB::KeeperServer::startup(Poco::Util::AbstractConfiguration const&, bool) @ 0x00000000007f7718 in /usr/bin/clickhouse-keeper
10. DB::KeeperDispatcher::initialize(Poco::Util::AbstractConfiguration const&, bool, bool, std::shared_ptr<DB::Macros const> const&) @ 0x00000000007d9438 in /usr/bin/clickhouse-keeper
11. DB::Context::initializeKeeperDispatcher(bool) const @ 0x0000000000a48f00 in /usr/bin/clickhouse-keeper
12. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b58fdb in /usr/bin/clickhouse-keeper
13. Poco::Util::Application::run() @ 0x0000000000fbf406 in /usr/bin/clickhouse-keeper
14. DB::Keeper::run() @ 0x0000000000b55e5e in /usr/bin/clickhouse-keeper
15. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000000fd6519 in /usr/bin/clickhouse-keeper
16. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b54dd8 in /usr/bin/clickhouse-keeper
17. main @ 0x0000000000b63bb9 in /usr/bin/clickhouse-keeper
 (version 23.8.10.43 (official build))
2024.03.09 21:01:39.116891 [ 1 ] {} <Debug> KeeperDispatcher: Shutting down storage dispatcher
2024.03.09 21:01:39.116956 [ 1 ] {} <Information> KeeperServer: RAFT doesn't start, shutdown not required
2024.03.09 21:01:39.117825 [ 1 ] {} <Error> void DB::KeeperDispatcher::shutdown(): Code: 49. DB::Exception: Changelog must be initialized before flushing records. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x0000000000e1a23b in /usr/bin/clickhouse-keeper
1. DB::Exception::Exception<char const (&) [54]>(int, char const (&) [54]) @ 0x00000000007a6c80 in /usr/bin/clickhouse-keeper
2. DB::Changelog::flushAsync() @ 0x00000000007a6a48 in /usr/bin/clickhouse-keeper
3. DB::Changelog::flush() @ 0x00000000007a64b6 in /usr/bin/clickhouse-keeper
4. DB::KeeperLogStore::flushChangelogAndShutdown() @ 0x00000000007f1587 in /usr/bin/clickhouse-keeper
5. DB::KeeperDispatcher::shutdown() @ 0x00000000007dcabd in /usr/bin/clickhouse-keeper
6. DB::KeeperDispatcher::~KeeperDispatcher() @ 0x00000000007dda70 in /usr/bin/clickhouse-keeper
7. DB::ContextSharedPart::~ContextSharedPart() @ 0x0000000000a4f18f in /usr/bin/clickhouse-keeper
8. DB::SharedContextHolder::~SharedContextHolder() @ 0x0000000000a46a58 in /usr/bin/clickhouse-keeper
9. DB::Keeper::main(std::vector<String, std::allocator<String>> const&) @ 0x0000000000b5bff2 in /usr/bin/clickhouse-keeper
10. Poco::Util::Application::run() @ 0x0000000000fbf406 in /usr/bin/clickhouse-keeper
11. DB::Keeper::run() @ 0x0000000000b55e5e in /usr/bin/clickhouse-keeper
12. Poco::Util::ServerApplication::run(int, char**) @ 0x0000000000fd6519 in /usr/bin/clickhouse-keeper
13. mainEntryClickHouseKeeper(int, char**) @ 0x0000000000b54dd8 in /usr/bin/clickhouse-keeper
14. main @ 0x0000000000b63bb9 in /usr/bin/clickhouse-keeper
 (version 23.8.10.43 (official build))
2024.03.09 21:01:39.117849 [ 1 ] {} <Debug> KeeperDispatcher: Dispatcher shut down
2024.03.09 21:01:39.118624 [ 26 ] {} <Trace> BaseDaemon: Received signal 6
2024.03.09 21:01:39.118719 [ 37 ] {} <Fatal> BaseDaemon: ########## Short fault info ############
2024.03.09 21:01:39.118762 [ 37 ] {} <Fatal> BaseDaemon: (version 23.8.10.43 (official build), build id: 4642563B164611A8691A973CA4983D140F1A7C08, git hash: a278225bba98c092a9b8101e6c02836bbc4d030b) (from thread 1) Received signal 6
2024.03.09 21:01:39.118787 [ 37 ] {} <Fatal> BaseDaemon: Signal description: Aborted
2024.03.09 21:01:39.118799 [ 37 ] {} <Fatal> BaseDaemon:
2024.03.09 21:01:39.118814 [ 37 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a374b0
2024.03.09 21:01:39.118823 [ 37 ] {} <Fatal> BaseDaemon: ########################################
2024.03.09 21:01:39.118860 [ 37 ] {} <Fatal> BaseDaemon: (version 23.8.10.43 (official build), build id: 4642563B164611A8691A973CA4983D140F1A7C08, git hash: a278225bba98c092a9b8101e6c02836bbc4d030b) (from thread 1) (no query) Received signal Aborted (6)
2024.03.09 21:01:39.118869 [ 37 ] {} <Fatal> BaseDaemon:
2024.03.09 21:01:39.118883 [ 37 ] {} <Fatal> BaseDaemon: Stack trace: 0x0000000000a374b0
2024.03.09 21:01:39.118916 [ 37 ] {} <Fatal> BaseDaemon: 0. signalHandler(int, siginfo_t*, void*) @ 0x0000000000a374b0 in /usr/bin/clickhouse-keeper
2024.03.09 21:01:39.118929 [ 37 ] {} <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
2024.03.09 21:01:39.118937 [ 37 ] {} <Fatal> BaseDaemon: Report this error to https://github.com/ClickHouse/ClickHouse/issues

Possible Root Cause:
It appears that the paths under /var/lib/clickhouse-keeper come up with root ownership. I confirmed this by hacking the liveness probe so that I could bring up the pod and check permissions.

Mitigations:

  1. If you can run chown -R clickhouse:clickhouse /var/lib/clickhouse-keeper and delete the pod to make it restart, Keeper comes up fine.

**Notes: **
This also fails with clickhouse/clickhouse-keeper:23.8.10.43-alpine.

@Slach
Copy link
Collaborator

Slach commented Mar 10, 2024

@sunsingerus @alex-zaitsev look like latest version of clickhouse backported changes for default user
we need to change default security context

look to entypoint.sh in https://github.com/ClickHouse/ClickHouse/tree/master/docker/keeper/

@hodgesrm try to add

  templates:
    podTemplates:
      - name: default
        spec:
          securityContext:
            fsGroup: 101 
            runAsUser: 101 
          containers:
            - name: clickhouse-keeper
              imagePullPolicy: IfNotPresent
              image: "altinity/clickhouse-keeper:23.8.8.21.altinitystable" 

@alex-zaitsev
Copy link
Member

Planned for 0.23.4

@hodgesrm
Copy link
Member Author

@Slach your fix works and will hold me until 0.23.4 is available. Thank you both for the quick turnaround.

@alex-zaitsev
Copy link
Member

@Slach , why we do not need securityContext for ClickHouse but need for ClickHouseKeeper?

@Slach
Copy link
Collaborator

Slach commented Apr 17, 2024

@alex-zaitsev
i don't know
maybe this is different entrypoint.sh in keeper image
maybe this is different behavior inside clickhouse-keeper binary during startup

@orloffv
Copy link

orloffv commented Apr 23, 2024

any plans here?

@Slach
Copy link
Collaborator

Slach commented Apr 23, 2024

@orloffv
workaround is here
#1370 (comment)

@orloffv
Copy link

orloffv commented Apr 23, 2024

interesting, but it can't help me.
only
If you can run chown -R clickhouse:clickhouse /var/lib/clickhouse-keeper and delete the pod to make it restart, Keeper comes up fine.

@alex-zaitsev
Copy link
Member

Security context is needed for both CHI and CHK, so behavior is consistent now. Maybe we need a separate task to add default security context to images, and also make sure it is correctly merged with a security context provided in by user explicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants