Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request/Question: Running this in Docker #23

Open
kaysond opened this issue Jun 14, 2022 · 19 comments
Open

Feature Request/Question: Running this in Docker #23

kaysond opened this issue Jun 14, 2022 · 19 comments

Comments

@kaysond
Copy link

kaysond commented Jun 14, 2022

I'm curious if you've thought about/tried installing or deploying this from within docker itself. It would be really convenient to use docker's own orchestration to get this running on every node. I took a quick look through the code, and I think you could do this using a docker-dind image with host networking (and some CAP_ADD's).

I've done something similar in https://github.com/kaysond/trafficjam

I think the one tricky bit is grabbing the ingress network IPs without having to run the daemon manually on each. I had a similar issue where I need to find the internal loadbalancer IPs. I ended up with a very hack-y but totally functional/reliable method: trafficjam runs as a service on a manager node which itself deploys "child" services on all nodes. The child services are able to detect and report their loadbalancer IPs via the docker logs. The parent service monitors those logs, and when it finds new loadbalancer IPs, it restarts all the child services.

@struanb
Copy link
Contributor

struanb commented Jun 14, 2022

Hi @kaysond. Thanks for the suggestion. We have considered this, as it would be neat to be able to use docker services to deploy DIND.

As you suggest, it's possible to gain most of the required privileges by launching the container using --network=host --cap-add=all, and the resulting service container can perform the iptables reconfiguration to the host ingress network.

However, without also launching the DIND container with --pid=host, the container does not have the ability to access the namespaces of those service containers that DIND needs to administer (as nsenter needs to access /proc/<pid>/ns to do its job). Unfortunately docker service create does not support --pid=host.

Further, it seems the DIND container is not permitted to use sysctl in the DIND container to access the relevant settings, either on the host, or in those service containers that DIND needs to administer, without --privileged. This functionality isn't needed in all setups, but if it is --privileged is currently needed. Unfortunately docker service create does not support --privileged either.

And so while one can't launch a DIND service directly, one can launch a DIND container. Here's a Debian POC (which assumes /usr/bin/docker on host is statically compiled and will run in the container, and that the DIND repo is at /opt/dind on the host):

docker run --rm -it --privileged --pid=host -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker:/var/run/docker -v /var/run/docker.sock:/var/run/docker.sock -v /opt/dind:/opt/dind debian bash -c 'apt update && apt -y install iproute2 iptables procps; /opt/dind/docker-ingress-routing-daemon --install --ingress-gateway-ips 10.0.0.2'

As you suggest, it is also possible to bootstrap this DIND container using a wrapper service. Here's a POC of that:

docker service create --name=dind --mode=global --mount=type=bind,src=/usr/bin/docker,dst=/usr/bin/docker --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock debian docker run --rm --privileged --pid=host -v /usr/bin/docker:/usr/bin/docker -v /var/run/docker:/var/run/docker -v /var/run/docker.sock:/var/run/docker.sock -v /opt/dind:/opt/dind debian bash -c 'apt update && apt -y install iproute2 iptables procps; /opt/dind/docker-ingress-routing-daemon --install --ingress-gateway-ips 10.0.0.2'

As to your remark about "grabbing the ingress network IPs", that step only needs to be done once, before DIND is launched, and in general it is not necessary or desireable to grab all ingress IPs, only those of the nodes one wishes to use as load-balancer(s), as it is this subset of IPs that one passes to --ingress-gateway-ips. As such, I'm not sure that step really benefits from automation. (Though your clever idea for monitoring logs could work). Certainly, It can easily be arranged that the wrapper service logs all nodes' ingress IPs.

And so finally, subject to these limitations, it seems it is possible to arrange for DIND to be launched in this way via a wrapper service.

Two potentially interesting ways to extend DIND, following this exercise, are:

  1. A DIND Dockerfile (and official Docker Hub image), that allows a DIND container to be launched on any host using a simple docker run --privileged [...]
  2. An option to use this image for a wrapper service that launches DIND on every host, using a simple docker service create command.

Please let us know whether (1) and/or (2) sound worth progressing.

@kaysond
Copy link
Author

kaysond commented Jun 15, 2022

However, without also launching the DIND container with --pid=host, the container does not have the ability to access the namespaces of those service containers that DIND needs to administer (as nsenter needs to access /proc//ns to do its job). Unfortunately docker service create does not support --pid=host.

You can bind mount /var/run/docker/netns on the host to /var/run/netns in the container to access the host network namespaces from within the parent service container:

https://github.com/kaysond/trafficjam/blob/da60781708e08d6f0681a713d041e809f85c7e00/trafficjam-functions.sh#L84

Further, it seems the DIND container is not permitted to use sysctl in the DIND container to access the relevant settings, either on the host, or in those service containers that DIND needs to administer, without --privileged. This functionality isn't needed in all setups, but if it is --privileged is currently needed. Unfortunately docker service create does not support --privileged either.

I'm less familiar with this one, but --privileged just grants all capabilities. Did you try --cap-add SYS_ADMIN? Actually... --privileged does also lift cgroup limitations; maybe that's what allows you to run sysctl.

In any case, it seems that you're only using sysctl for the performance settings, except for the rp_filter, but I think that defaults to 0 anyways. In cases where the distro sets it to 1, I think its reasonable for users to have to update the setting manually.

in general it is not necessary or desireable to grab all ingress IPs, only those of the nodes one wishes to use as load-balancer(s), as it is this subset of IPs that one passes to --ingress-gateway-ips

I'm assuming by ingress gateways you mean the host IP of the nodes you want to apply the iptables rules on. In that case, it would still be nice to have this happen automatically, because you could then leverage node labels to dynamically assign/unassign load balancer nodes to the daemon.


With the above, I think you can mirror my approach, which is a service that launches services. To me, that's preferable to a service that launches containers. This is closer to #2, but I think you still want an official image a la #1. In my case, I used the same dockerhub image both for the parent service (one per swarm), and the child service (global; one per node), using an env var to distinguish between the two.

@struanb
Copy link
Contributor

struanb commented Jun 16, 2022

Hi @kaysond. I've been considering your approach. The problem I see with it, is while bind-mounting /var/run/docker/netns (or, more broadly, /var/run/docker) will allow the container to access the host network namespaces, DIND also needs to be able to access the network namespaces for each service container it is responsible for configuring. Currently it does this, inside the route_ingress() function, by calling docker inspect -f '{{.State.Pid}}' on the container ID, to obtain the host PID for the container's root process. It then calls nsenter -n -t <PID> to enter the network namespace of that container. I am not sure this can be performed using the bind-mounted /var/run/docker/netns alone, which is why we found we needed --pid=host. Do you have a solution to this?

Otherwise, I believe it may be necessary - and indeed still satisfactory - to have the wrapper global service launching privileged containers. On this, you said you preferred a wrapper service that launches services: I still don't understand the benefits of this, over a wrapper service that launches containers. Could you elaborate on this?

@kaysond
Copy link
Author

kaysond commented Jun 16, 2022

DIND also needs to be able to access the network namespaces for each service container it is responsible for configuring

Let me dig into this a little bit this weekend and get back to you. If I remember correctly from my previous investigations, all of the network namespaces are available via /var/run/docker/netns, including ingress, and they were trivial to identify by network namespace "ID" without needing any container PIDs. I know that I was able to configure iptables rules on any docker swarm network without needing to know a PID.

On this, you said you preferred a wrapper service that launches services: I still don't understand the benefits of this, over a wrapper service that launches containers. Could you elaborate on this?

It's minor, but doing the way I suggested means you have a parent swarm service with one replica running on a manager node, and then another child service running globally or on selected nodes (or based on label, etc). Your suggestion would have a global parent service container on every node and also the child container on every selected node.

@struanb
Copy link
Contributor

struanb commented Jun 16, 2022

Given your show of confidence - thanks for that - I've done some digging myself, and I believe I've worked it out.

For a container <c>, run:

nsenter --net=$(docker container inspect <c> -f "{{.NetworkSettings.SandboxKey}}") <command>

This should allow much of DIND to operate without --pid=host. This is good to know, however I'm still unsure of the benefits of running DIND inside a service container, as opposed to inside a privileged sub-container, since:

  1. The overhead of one extra sub-container per seems small in the big scheme of things.
  2. The privileged sub-container approach allows setting of sysctl values within the service container.

We now actually have a POC implementation of a DIND service that works this way, which I hope to commit to a branch soon.

P.S. Regarding automatic detection of node ingress network IPs, by node labels, I can see some value in this, and the idea of scraping them from service logs is a good one. But, we've experienced issues with docker service logs hanging so I'm concerned about the stability of this implementation. A workaround may be to arrange that each DIND service container logs its ingress IP every few seconds, and use docker service logs --since=5s, which is probably stable. This procedure could be performed by each DIND service container independently. Or, as you suggested, one could use a further wrapper service with one replica (or just a standalone container).

@struanb
Copy link
Contributor

struanb commented Jun 16, 2022

The POC implementation of a DIND service can now be found at https://github.com/newsnowlabs/docker-ingress-routing-daemon/tree/service-v1.

P.S. On further consideration of the automatic detection of node ingress network IPs, any process calling docker service logs must run on a manager node. I think this must be a (and perhaps your) rationale for having an outer-wrapper service with one replica (or a standalone container). Will give this more thought.

@kaysond
Copy link
Author

kaysond commented Jun 17, 2022

For a container <c>, run:

nsenter --net=$(docker container inspect <c> -f "{{.NetworkSettings.SandboxKey}}") <command>

Nice! That's basically what I was doing but I was doing the doing network's namespace via the network ID. Am I correct in understanding that you need the container's namespace because you have to add rules there for the outbound packets?

we've experienced issues with moby/moby#38640 so I'm concerned about the stability of this implementation. A workaround may be to arrange that each DIND service container logs its ingress IP every few seconds, and use docker service logs --since=5s, which is probably stable

Interesting. I never ran into that issue in all my (admittedly limited) testing. Your workaround seems like a good idea.

  1. The overhead of one extra sub-container per seems small in the big scheme of things.

Agreed

  1. The privileged sub-container approach allows setting of sysctl values within the service container

any process calling docker service logs must run on a manager node

IMO the usefulness of being able to have node IP's get picked up automatically and dynamically outweighs the loss of automatic sysctl tweaks. Also, I would say sysctl tweaks are outside the scope of a container orchestration system, but determining the selected gateway nodes and their IP's automatically based on node labels seems to be more in the realm of a container routing daemon!

The POC implementation of a DIND service can now be found at https://github.com/newsnowlabs/docker-ingress-routing-daemon/tree/service-v1.

I'll try to give this a shot over the weekend!

@struanb
Copy link
Contributor

struanb commented Jun 18, 2022

Hi @kaysond. I've mostly implemented the outer-wrapper service with autodetection of node IPs, and cross-checking of these IPs against node labels, with no loss of sysctl functionality. I expect to push what I have to the branch later today.

To your questions:

Am I correct in understanding that you need the container's namespace because you have to add rules there for the outbound packets?

Yes, for mapping the TOS byte (identifying the load balancer node) on incoming packets to a connection mark, then mapping the connection mark on outgoing packets to a firewall mark, then matching the firewall wark to a routing table back to the correct load balancer node.

IMO the usefulness of being able to have node IP's get picked up automatically and dynamically outweighs the loss of automatic sysctl tweaks. Also, I would say sysctl tweaks are outside the scope of a container orchestration system, but determining the selected gateway nodes and their IP's automatically based on node labels seems to be more in the realm of a container routing daemon!

Regarding the sysctl tweaks, you might be correct that these would not be within scope of DIND, except that they proved necessary to make the performance of the system viable (and prevent issues on some kernels), and DIND is a convenient tool for applying them (particularly the service containers). Should we remove that functionality, the user would have to use docker's --sysctl option on all their production services instead, which in my view would make DIND harder to deploy.

@struanb
Copy link
Contributor

struanb commented Jun 19, 2022

Hi @kaysond.

I'm pleased to let you know that I've now pushed a more complete implementation to the branch, and a ready-build image to Docker Hub. I'd appreciate you trying it out, and commenting on the code. There are several ways to launch.

To launch as a manager container, with connected terminal (easiest way to demo and see what's going on):

docker run --rm -it --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock newsnowlabs/dind:latest --install

To launch as a single-replica manager service:

docker service create --name=dind --replicas=1 --constraint=node.role==manager --mount=type=bind,src=/var/run/docker.sock,dst=/var/run/docker.sock newsnowlabs/dind:latest --install

(To simplify launching using these commands, please also see the https://github.com/newsnowlabs/docker-ingress-routing-daemon/blob/service-v1/docker/create-service.sh launch script.)

N.B. In the default mode (which is currently hard-coded) the manager service will assume all nodes to be load balancers, unless the DIND-LB=1 label is found on at least one node, in which case the load balancers will be restricted to those nodes with this label.

I've tested this out in a vanilla Docker Swarm launched on Google Cloud:

  1. With a variety of different cluster sizes
  2. Growing the cluster size on-the-fly
  3. Adding the DIND-LB=1 label to some nodes, then to all nodes, then removing the label (or setting DIND-LB=0)

Looking forward to hearing from you, once you've had a chance to look at this.

@kaysond
Copy link
Author

kaysond commented Jun 23, 2022

I haven't forgotten about this! Just haven't had time. Should be able to take a look this weekend.

@struanb
Copy link
Contributor

struanb commented Jun 23, 2022

Thanks so much @kaysond. Given the changes in this branch do not remotely touch the core code, and that the documentation for the new service added to README is marked 'experimental', I think the service-v1 branch is probably safe to merge into main soon, but I would nevertheless appreciate your feedback should you get time to look at it.

P.S. See https://github.com/newsnowlabs/docker-ingress-routing-daemon/tree/service-v1 for the updated documentation.

@struanb
Copy link
Contributor

struanb commented Jun 27, 2022

Following further testing, I have applied several fixes and improvements to the branch:

  • fix bug in passing parameters to 'docker service update' that prevented updates to the dind-global service from occurring
  • add --update-parallelism=0 to the dind-global global service - makes updates to the service take effect in constant time (i.e. much faster)
  • reinstate container ingress policy routing tables from scratch: when DIND is restarted with a --ingress-gateway-ips list that is incompatible with the prior list (e.g. in --indexed-ids mode, if the IP list ordering has changed, causing the indexes and thus the routing table numbers for the listed IPs to change) this allows --preexisting to correctly recreate container policy routing tables for containers that already had tables configured

@kaysond
Copy link
Author

kaysond commented Jun 27, 2022

Still on my radar, sorry! I am very excited to play around with this when I get a chance.

@kaysond
Copy link
Author

kaysond commented Jul 4, 2022

@struanb - Alright I finally had a chance to test this out! It worked just fine with a dummy traefik and whoami service. Here are my thoughts:

  • There seems to be a hole/bug in the iptables rules (probably unrelated to this new method of deployment) where if I make a request from one of the hosts to itself, it comes from some docker network IP rather than 127.0.0.1 or the host's IP (see below). I do have some services that make requests to other services via my domain name, which maps to the host IP, so I'd like that IP to be correct for logs, etc. I wonder if you're getting caught by something like what I ran into in trafficjam, which is that requests from local processes (i.e the docker port mapping daemon) hit different chains (INPUT/OUTPUT) than something coming over an interface (PREROUTING/POSTROUTING).

  • I would suggest changing the image/service names. dind is pretty generic for docker-in-docker, so it doesn't really tell me what the image or services are. docker-ingress-routing-daemon, docker-ingress-routing-daemon-global, etc, while long, would be totally fine IMO.

  • I noticed that you're using --pid=host, but do you really need the host's process namespace? Is the network namespace not enough? You could do --network host instead, and it would be more secure...

  • Is there a reason you're using debian as the image base and installing docker manually? You could just use the official docker image, which also runs on alpine so its a bit leaner (see my Dockerfile)

  • The documentation was a little confusing. Even though I already knew how it worked, there were so many deployment options that it threw me for a loop. I have two suggestions here. The first is to remove the command for running a manager container. It goes against the paradigm of swarm, which is services. The second is to add docker-compose examples! Finally, though it's dependent on my final thought, you might also consider removing the command to run the global service directly, without a manager service.

  • To me, the entire deployment is a little clunky. I create a manager-constrained 1-replica service that creates a global service, that in turn creates a child container on every node.

    • I don't like the idea of creating vanilla containers on swarm because I can't observe them all through normal swarm means on a manager node.
    • I think the manager functionality could just be built-in to the global service, rather than being a separate optional service. You could, from within the global service, check if you're on a manager node (by doing docker node ls, for example).
    • Ideally, I'd deploy a manager-constrained, 1-replica service that creates a global service that does everything else that's needed

This gets back to the sysctl discussion, and I still don't think the complexity is worth it (especially since, as far as I can tell, what you're setting is default in modern debian releases). It's worth noting that at first, I didn't realize the sysctls are set in the ingress namespace; I see no issue with setting it automatically, if only it could be done without requiring --privileged... I did a little bit of research to see if there's a workaround, and mostly came up empty-handed. It looks like /proc/sys gets mounted read-only in containers which prevents sysctl -w from working. Unmounting it to expose the parent /proc mount, which is rw, doesn't help because then you get a permission denied. From this it seems like its related to apparmor/selinux, though in my case disabling apparmor altogether didn't do anything. There are a handful of PR's/issues related to getting --privileged or other security-related CLI options, but none so far have actually been completed. There is a method for manipulating selinux via the API, though.

It may be worth a little bit more research (or maybe another github issue on moby) to see if there is a workaround to setting host sysctls in a network namespace

Example for the bug (same thing happens if I use the host's local ip instead of 127.0.0.1)

administrator@mercy:~$ curl --header "Host:whoami.fake.com" http://127.0.0.1
Hostname: 617d06a272c4
IP: 127.0.0.1
IP: 10.0.3.3
IP: 172.18.0.3
RemoteAddr: 10.0.3.6:52788
GET / HTTP/1.1
Host: whoami.fake.com
User-Agent: curl/7.74.0
Accept: */*
Accept-Encoding: gzip
X-Forwarded-For: 172.18.0.1
X-Forwarded-Host: whoami.fake.com
X-Forwarded-Port: 80
X-Forwarded-Proto: http
X-Forwarded-Server: 098f864fa5e7
X-Real-Ip: 172.18.0.1

@struanb
Copy link
Contributor

struanb commented Jul 5, 2022

Hi @kaysond. Thanks ever so much for looking at the branch, and for your thoughtful feedback.

  • There seems to be a hole/bug in the iptables rules (probably unrelated to this new method of deployment) where if I make a request from one of the hosts to itself, it comes from some docker network IP rather than 127.0.0.1 or the host's IP (see below). I do have some services that make requests to other services via my domain name, which maps to the host IP, so I'd like that IP to be correct for logs, etc. I wonder if you're getting caught by something like what I ran into in trafficjam, which is that requests from local processes (i.e the docker port mapping daemon) hit different chains (INPUT/OUTPUT) than something coming over an interface (PREROUTING/POSTROUTING).

This hole is very real, we experience it too. At the time we created DIND, we decided it was safer to work around it by allowing access to the relevant services via the Docker bridge IP range. I would agree this is not the best approach. According to our records, the alternative was to install one of these example iptables rules: iptables -t nat -A INPUT -i docker+ -j SNAT --to-source <node-ip> or iptables -t nat -A INPUT -s 172.0.0.0/24 -j SNAT --to-source <node-primary-ip> (neither is perfect, as they assume the name, and IP range, of docker bridges (respectively) both of which can be changed.

I think it may be appropriate to extend DIND (at least provide an option) to add a firewall rule of this sort to each host. What do you think, and what form of rule should it take?

  • I would suggest changing the image/service names. dind is pretty generic for docker-in-docker, so it doesn't really tell me what the image or services are. docker-ingress-routing-daemon, docker-ingress-routing-daemon-global, etc, while long, would be totally fine IMO.

I agree dind may be confused with docker-in-docker, and now you draw our attention to it, it is so unnecessary... at the very least, DIND looks like a typo that became habitual, as it isn't an acronym for Docker Ingress Routing Daemon anyway. I suspect the confusion arose from an alternative name we may have used, Docker Ingress Network Daemon.

We had better at least rename dind/DIND to dird/DIRD everywhere, and we'll consider where appropriate to use the longer form name docker-ingress-routing-daemon too. Thanks for spotting this!

  • I noticed that you're using --pid=host, but do you really need the host's process namespace? Is the network namespace not enough? You could do --network host instead, and it would be more secure...

I believe you're right, judging by my earlier comment (#23 (comment)). Thanks for highlighting. We will look at this again.

  • Is there a reason you're using debian as the image base and installing docker manually? You could just use the official docker image, which also runs on alpine so its a bit leaner (see my Dockerfile)

It was convenient. I agree an alpine-based image would be leaner. Thanks for suggesting. We can certainly try it.

  • The documentation was a little confusing. Even though I already knew how it worked, there were so many deployment options that it threw me for a loop. I have two suggestions here. The first is to remove the command for running a manager container. It goes against the paradigm of swarm, which is services.

I agree this is unecessary, though I found it a very useful way to run the manager container when debugging.

The second is to add docker-compose examples!

Good idea. Would you like to contribute some?

Finally, though it's dependent on my final thought, you might also consider removing the command to run the global service directly, without a manager service.

This would be a simplification; but it might be an over-simplification. After all, if you don't need node load-balancer autodiscovery, you don't need the manager service at all. I suspect it could be useful to retain this bit of explanation for people whose load-balancer node(s) are fixed and prefer to avoid the risk of any mis-configuration leading to inadvertent restart of the global service? (NewsNow's systems fall into this category).

  • To me, the entire deployment is a little clunky. I create a manager-constrained 1-replica service that creates a global service, that in turn creates a child container on every node.

    • I don't like the idea of creating vanilla containers on swarm because I can't observe them all through normal swarm means on a manager node.

I agree it would be neater not to need the subcontainer, but I'm still not sure I see an objective concern here. A subcontainer is a bit like a subprocess, and you can't observe all your swarm service container's subprocesses from manager nodes either. So this preference seems subjective.

  • I think the manager functionality could just be built-in to the global service, rather than being a separate optional service. You could, from within the global service, check if you're on a manager node (by doing docker node ls, for example).

It's correct a global service container can check to see if it is running on a manager node. However, we need the manager process to run on one and only one manager node. I can't see how that can be done from within the global service. This is why we need the manager-constrained 1-replica service.

  • Ideally, I'd deploy a manager-constrained, 1-replica service that creates a global service that does everything else that's needed

So I agree with this.

This gets back to the sysctl discussion, and I still don't think the complexity is worth it (especially since, as far as I can tell, what you're setting is default in modern debian releases). It's worth noting that at first, I didn't realize the sysctls are set in the ingress namespace; I see no issue with setting it automatically, if only it could be done without requiring --privileged

I'm not convinced that we can rely on correct values being set by the operating system. These defaults may be different between distributions, and between distribution versions, and also subject to change over time. We can't possibly test them all.

And because of the early-day performance issues we experienced, I remain convinced that setting these sysctl values is really important.

Although I'm sensitive to security concerns, right now I'm still not sure I see an objective concern here. The original mode of operation of the daemon is to run it as root on the host anyway, and so running it in a privileged container does not seem to grant any additional privileges to the daemon, and avoiding --privileged - while use of this option is frowned upon - doesn't seem to present any practical security benefit here.

It may be worth a little bit more research (or maybe another github issue on moby) to see if there is a workaround to setting host sysctls in a network namespace

If we can discover a more restricted capability that can be granted to allow the subcontainer to still run sysctl within the service container namespace - or better still, to allow the global service to do so - I would agree with dropping the need for the subcontainer then.

The only way I see to avoid needing the sysctl now would be to require the administrator to (re)launch all their services with --sysctl/--sysctl-add instead. That's not a ridiculous idea, but it does have a major downside: it would mean installing DIRD was no longer plug-and-play, as you'd have to update all your services using --sysctl-add first. Is this really preferable?

@kaysond
Copy link
Author

kaysond commented Jul 5, 2022

I think it may be appropriate to extend DIND (at least provide an option) to add a firewall rule of this sort to each host. What do you think, and what form of rule should it take?

Agreed. If you want to keep things simple, I would suggest using the interface-based rule. I think docker internally rotates the subnets it uses, and once it run through some set of the 172.16.0.0/12 subnets, it starts using 192.168.0.0/16. Interfaces are much less likely to be different, and you could have an environment variable option to change the interface name.

If you wanted to get fancy, since you already have access to the docker socket, you could actually scan for all the subnets being used, or the interface names even. Not sure it's worth the effort, though.

The second is to add docker-compose examples!

Good idea. Would you like to contribute some?

Happy to. Once things are a little more finalized I can whip them up.

After all, if you don't need node load-balancer autodiscovery, you don't need the manager service at all. I suspect it could be useful to retain this bit of explanation for people whose load-balancer node(s) are fixed and prefer to avoid the risk of any mis-configuration leading to inadvertent restart of the global service? (NewsNow's systems fall into this category).

If you combined the manager service functionality into the global service, presumably it would only perform autodiscovery when the load balancer ips are not specified.

I agree it would be neater not to need the subcontainer, but I'm still not sure I see an objective concern here. A subcontainer is a bit like a subprocess, and you can't observe all your swarm service container's subprocesses from manager nodes either. So this preference seems subjective.

I may not be able to see all of the subprocesses, but I can see see all of the containers across the swarm, and the swarm has visibility into every container as well because the nodes report back. The problem with manually deploying a container is that now you're responsible for handling orchestration yourself, which to me is bad practice. What happens if the container crashes or gets killed? Or updates to a new version that doesn't work? What about healthchecks to make sure its still running? Sure, you can probably try to address these things from within the global service, but the whole point of docker swarm is to do this all for you.

It's correct a global service container can check to see if it is running on a manager node. However, we need the manager process to run on one and only one manager node. I can't see how that can be done from within the global service. This is why we need the manager-constrained 1-replica service.

I think I mixed myself up a bit here because I was trying to extend my methodology from trafficjam. I also had a single 1-replica, manager-constrained service that performed the autodiscovery, but it could just launch a global service with the necessary capabilities to create firewall rules on each node (since I didn't need --privileged).

I'm not convinced that we can rely on correct values being set by the operating system. These defaults may be different between distributions, and between distribution versions, and also subject to change over time. We can't possibly test them all.

That's fair.

And because of the early-day performance issues we experienced, I remain convinced that setting these sysctl values is really important.

I don't doubt it, I'm just not sure its appropriate for a firewall/routing daemon to configure the kernel settings, and I still don't think it's worth the complexity/confusion and lack of orchestration I mentioned above.

Most (all?) of the enterprise software I've worked with recommends the necessary kernel settings, and even warns when they're not set, but doesn't actually set them. Though, to be fair, since this is only being set in the ingress network namespace... I wouldn't really be opposed to that happening automatically if it had less of a cost.

Although I'm sensitive to security concerns

I actually have no problems with running --privileged on a container/service like this that's isolated from external networks! My concern is the complexity/orchestration/philosophy of how/where kernel settings are changed.

If we can discover a more restricted capability that can be granted to allow the subcontainer to still run sysctl within the service container namespace - or better still, to allow the global service to do so - I would agree with dropping the need for the subcontainer then.

I created moby/moby#43769. We can see if there are any useful responses.

The only way I see to avoid needing the sysctl now would be to require the administrator to (re)launch all their services with --sysctl/--sysctl-add instead. That's not a ridiculous idea, but it does have a major downside: it would mean installing DIRD was no longer plug-and-play, as you'd have to update all your services using --sysctl-add first. Is this really preferable?

Does that actually work? Isn't that setting the sysctl inside each service's namespace where you are setting the sysctl in the ingress_sbox namespace? It would definitely be clunkier than the extra container. An alternative could be to just have the user set the sysctl in the right namespace via the host.

@kaysond
Copy link
Author

kaysond commented Jul 5, 2022

Ok so I think I found a good "workaround" that addresses some of my concern with the extra child container. (Maybe you're doing this or something similar already, but from what I remember in the code this is not the case). Take a look at moby/swarmkit#1030 (comment)

Rather than running a global service of your own image which then manages the child container, you use the global service as a dummy that simply launches the child container as its command. So the manager service would execute something like:

docker service create --global --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock docker:20.10.11 docker run --rm --privileged newsnowlabs/dird --some-flags --some-more-flags. I'm not sure if the official docker image's entrypoint supports commands like this, but it would be easy to implement the principal. The point is that the main process of the service is actually the docker run command, and since its in the foreground, it is the child container process!

@kaysond
Copy link
Author

kaysond commented Jul 7, 2022

Ok so the conclusion of moby/moby#43769 is basically that we're not going to see privileged services anytime soon, but the workaround in the above comment is the generally accepted solution.

So I was thinking a good way to do this all is:

  1. User deploys manager-constrained 1-replica service
  2. Manager creates a global service whose command runs a privileged docker container that only modifies the sysctls as needed
  3. The above docker container will exit (and also remove the corresponding global service)
  4. Manager then creates another global service with only cap_add's that handles all of the iptables stuff
  5. The second global service stays active

This is more secure because it just uses the privileged docker container for sysctl settings, but the persistent service has limited capabilities. It's clean, because you only have the privileged container run once on startup, then it exits. And its robust because you get the docker swarm orchestration monitoring the whole thing!

Let me know what you think

@struanb
Copy link
Contributor

struanb commented Mar 19, 2023

Hi @kaysond. Apologies for the long delay in resuming work on this issue. We have been affected by illness in our team.

For now, we are happy enough with the service model implemented in the service-v1 branch, and do not believe reimplementing it to eliminate child containers is worth it (at least, for us), considering the gains.

We do still want to bugfix the service-v1 branch and merge it into main, and this may be something we come back to later this year.

On this basis, we will keep this issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants