Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closing the terminal doesn't kill the node in ROS2 Humble running on Docker #721

Closed
audrow opened this issue Feb 22, 2024 · 13 comments
Closed

Comments

@audrow
Copy link

audrow commented Feb 22, 2024

Copied from issue on ros2/ros2 by @maxkonrad

Also see @fujitatomoya's comment.

Bug report

Required Info:

  • Operating System:
    • Docker running Ubuntu 22.04
  • Installation type:
    • via docker
  • Version or commit hash:
    • ros:humble repo (using FROM ros:humble command in Dockerfile)
  • DDS implementation:
    • idk about this I think it is default
  • Client library (if applicable):
    • rclpy

Steps to reproduce issue

1- I connected to jetson nano host via ssh using my Ubuntu22.04 pc's Terminator terminal.
2- I ran a docker instance with the following Dockerfile


RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/

ARG USERNAME=ros
ARG USER_UID=1000
ARG USER_GID=$USER_UID

RUN groupadd --gid $USER_GID $USERNAME \
 && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
 && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

RUN apt-get update \
 && apt-get install -y sudo \
 && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
 && chmod 0440 /etc/sudoers.d/$USERNAME \
 && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

COPY /my_py_pkg /src/my_py_pkg

ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"] 

3- There are two std_msgs.msg int64 publishers I am using on the my_py_pkg python package, one of them publishes to /number_count topic and both of them publishes to /number topic. (idk if two nodes publishing to one topic is a problem)

4- Close the terminal or change the network.

Expected behavior

I expected running nodes to kill.

Actual behavior

Running nodes show when I run ros2 node list but when I run ros2 lifecycle set <topic name> shutdown it returns Node not found on terminal. I don't know if node is alive or not.

Additional information

Screenshot from 2024-02-13 14-24-03

@audrow
Copy link
Author

audrow commented Feb 22, 2024

Also, from some discussion at our weekly triage meeting, it seems like the issue may be in how signals are handled by the entry point.

@mikaelarguedas
Copy link
Contributor

Thanks for reporting. @maxkonrad @audrow

ros:humble repo (using FROM ros:humble command in Dockerfile)
it seems like the issue may be in how signals are handled by the entry point.

Is it possible to provide a reproducible example by providing a full dockerfile and other files copied in the container (e.g. the entrypoint.sh)?

Can you reproduce the issue with a vanilla ros:humble image without extra custom configs and files ?

@maxkonrad
Copy link

I will try to reproduce with both again today and share the process I followed sorry for late answer I was busy these days.

@fujitatomoya
Copy link

1st of all, container is still running state after close the terminal? in other word, how did you start the docker e.g docker run xxx? can you provide the all options. if container is daemonized, it should be running the application after killing the ssh session.

a couple of more questions.

  • do you use the host network? means containers are running in the same host network like localhost communication? (this can be answered by docker run option question above.)
  • can you observe the node and topics after certain time like 1 min later? i think it takes some time to un-discover the participant and endpoint.

thanks,

@maxkonrad
Copy link

maxkonrad commented Mar 4, 2024

Thanks for reporting. @maxkonrad @audrow

ros:humble repo (using FROM ros:humble command in Dockerfile)
it seems like the issue may be in how signals are handled by the entry point.

Is it possible to provide a reproducible example by providing a full dockerfile and other files copied in the container (e.g. the entrypoint.sh)?

Can you reproduce the issue with a vanilla ros:humble image without extra custom configs and files ?

entrypoint.sh:

#!/bin/bash

set -e

source /opt/ros/humble/setup.bash

echo "Provided arguments: $@"

exec $@

bashrc:

source /opt/ros/humble/setup.bash
source /usr/share/colcon_argcomplete/hook/colcon-argcomplete.bash

dockerfile:

FROM osrf/ros:humble-desktop-full

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/

ARG USERNAME=ros
ARG USER_UID=1000
ARG USER_GID=$USER_UID

# Creating a non-root user
RUN groupadd --gid $USER_GID $USERNAME \
  && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
  && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

# Set-up sudo
RUN apt-get update \
  && apt-get install -y sudo \
  && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
  && chmod 0440 /etc/sudoers.d/$USERNAME \
  && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

COPY /my_py_pkg /src/my_py_pkg
ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"]

my_py_pkg simply contains basic number publisher and subscriber scripts to test connection.
build command:
sudo docker image build -t jetson_docker .

run command:
sudo docker run -it --user ros --network=host --ipc=host -v $PWD/source:/my_py_pkg jetson_docker

!!! important -> I am connected to jetson via ssh and closed the terminal on my host.

I will try to reproduce the issue again with ros/humble image in a few minutes

@maxkonrad
Copy link

1st of all, container is still running state after close the terminal? in other word, how did you start the docker e.g docker run xxx? can you provide the all options. if container is daemonized, it should be running the application after killing the ssh session.

a couple of more questions.

* do you use the host network? means containers are running in the same host network like localhost communication? (this can be answered by `docker run option` question above.)

* can you observe the node and topics after certain time like 1 min later? i think it takes some time to un-discover the participant and endpoint.

thanks,

docker running code: sudo docker run -it --user ros --network=host --ipc=host -v $PWD/source:/my_py_pkg <img_name>

yes they are on the same network

I will try again today to reproduce the issue, again as you said: maybe it takes time to un-discover because of ssh connection or docker??

@maxkonrad
Copy link

I quickly prepared a video for this link to youtube video

@fujitatomoya
Copy link

maybe it takes time to un-discover because of ssh connection or docker??

besides this, can you check that container status with docker ps -a? i think the container is supposed to be exited status after closing the terminal.

@maxkonrad
Copy link

maxkonrad commented Mar 4, 2024

maybe it takes time to un-discover because of ssh connection or docker??

besides this, can you check that container status with docker ps -a? i think the container is supposed to be exited status after closing the terminal.

No, actually I only close one instance of docker terminal I created with exec command. Docker container still runs. @fujitatomoya

@maxkonrad
Copy link

maxkonrad commented Mar 4, 2024

And also I just realized I wasn't using osrf's desktop image on jetson (besides there is no arm image for osrf ros2 desktop afaik) I, by mistake copied the wrong code from private repo, there only FROM command and all its line should be changed to FROM ros:humble. I know that makes it irrelevant to osrf and it is about ros maybe I should move this issue again. Sorry again for mistake. @audrow

after corrections the Dockerfile should be as the following:

FROM ros:humble

RUN apt-get update && apt-get install -y nano && rm -rf /var/lib/apt/lists/*

COPY config/ /site_config/

ARG USERNAME=ros
ARG USER_UID=1000
ARG USER_GID=$USER_UID

# Creating a non-root user
RUN groupadd --gid $USER_GID $USERNAME \
  && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
  && mkdir /home/$USERNAME/.config && chown $USER_UID:$USER_GID /home/$USERNAME/.config

# Set-up sudo
RUN apt-get update \
  && apt-get install -y sudo \
  && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME\
  && chmod 0440 /etc/sudoers.d/$USERNAME \
  && rm -rf /var/lib/apt/lists/*

COPY entrypoint.sh /entrypoint.sh
COPY bashrc /home/$USERNAME/.bashrc

ENTRYPOINT [ "/bin/bash", "/entrypoint.sh" ]

CMD ["bash"]

Sorry again for the mistake I am new to software and open source world :(

@fujitatomoya
Copy link

fujitatomoya commented Mar 4, 2024

i can reproduce this issue on my env without ros. i say current work-around is to make sure exit the process spawned by docker exec before closing the terminal, that also said this is the issue for docker but ROS.

### start container
tomoyafujita@~/DVT/work >docker run -it --network=host --ipc=host test
Provided arguments: bash
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          46       1  0 22:30 pts/0    00:00:00 ps -ef
root@tomoyafujita:/#

### start another session
tomoyafujita@~/DVT >docker exec -it c1922deefec2 /bin/bash
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          54      47  0 22:31 pts/1    00:00:00 ps -ef

### closing terminal without exit
root@tomoyafujita:/# sleep 60

root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          56      47  0 22:32 pts/1    00:00:00 sleep 60
root          57       1  0 22:32 pts/0    00:00:00 ps -ef

### give it 60 seconds
root@tomoyafujita:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 22:30 pts/0    00:00:00 bash
root          47       0  0 22:31 pts/1    00:00:00 /bin/bash
root          59       1  0 22:32 pts/0    00:00:00 ps -ef

the problem is PID 47, still alive that is why child process sleep (this can be ros2 command) was alive for 60 seconds until cyclic expires.

moby/moby#9098 seems related.

@maxkonrad
Copy link

maxkonrad commented Mar 5, 2024

@fujitatomoya thanks so much, I think mods can close this issue then.

@tfoote
Copy link
Contributor

tfoote commented Mar 5, 2024

Yeah, looks like an upstream issue with exec. Closing here.

@tfoote tfoote closed this as completed Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants