Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket Connect hangs until timeout runs out in mirrored mode. #10855

Open
1 of 2 tasks
MichiBab opened this issue Nov 30, 2023 · 20 comments
Open
1 of 2 tasks

Socket Connect hangs until timeout runs out in mirrored mode. #10855

MichiBab opened this issue Nov 30, 2023 · 20 comments
Labels

Comments

@MichiBab
Copy link

MichiBab commented Nov 30, 2023

Windows Version

Microsoft Windows [Version 10.0.22631.2715]

WSL Version

2.0.9.0 and 2.0.14 tested

Are you using WSL 1 or WSL 2?

  • WSL 2
  • WSL 1

Kernel Version

5.15.133.1-1

Distro Version

Ubuntu 22.04

Other Software

No response

Repro Steps

Call this script in mirrored mode as normal user or sudo (try both):

import socket
import time
import os

def is_port_in_use(port):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        # s.settimeout(2)  # set timeout directly

        print(f"Socket type: {s.type}")
        print(f"Socket family: {s.family}")
        print(f"Socket timeout: {s.gettimeout()} seconds")
        # Get and print socket options
        print(f"SO_REUSEADDR option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR)}")
        print(f"SO_KEEPALIVE option: {s.getsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE)}")
        print(f"TCP_NODELAY option: {s.getsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY)}")
        print("checking: ", port)
        print("Timeout: ",s.gettimeout())
        return s.connect_ex(("localhost", port)) == 0
    
    
print("Uid: ", os.getuid())
start_time = time.time()
print("Is Port in use:", is_port_in_use(6006))
print("--- Runtime: %s seconds ---" % (time.time() - start_time))

Expected Behavior

It directly checks if the socket is in use or not.

Actual Behavior

Sometimes it takes over 132 seconds to return True or False. In not mirrored mode it always directly returns. I came across this issue using ros, where the roscore takes about 2 minutes to start, or starting tensorboard takes around 2 minutes aswell, so i build this script to reproduce the error.

If it directly returns true or false in mirrored mode, i start the script as sudo and then the problem occurs again.
What is interesting, is that if i set the socket timeout to a specific second, then it will return after that. Its like its blocked for the default timeout.

Diagnostic Logs

Without networking mode mirrored:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: None seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 0.0001404285430908203 seconds ---

With networking mode mirrored:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: None seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 132.54164338111877 seconds ---

Sometimes it works as fast as without mirrored. But sometimes not, even tho it results in the port is not in use.
If it works directly, starting the script with sudo will result again in this timeout.

With networking mode mirrored and setting timeout to a second with:
s.settimeout(1) before the connect_ex call

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 1.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: None
Is Port in use: False
--- Runtime: 1.001239538192749 seconds ---

Or with settimeout to 2 seconds:

❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 2.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Timeout: 2.0
Is Port in use: False
--- Runtime: 2.0022153854370117 seconds ---

Copy link

Hi I'm an AI powered bot that finds similar issues based off the issue title.

Please view the issues below to see if they solve your problem, and if the issue describes your problem please consider closing this one and thumbs upping the other issue to help us prioritize it. Thank you!

Closed similar issues:

Note: You can give me feedback by thumbs upping or thumbs downing this comment.

@OneBlue OneBlue added the network label Dec 1, 2023
@chanpreetdhanjal
Copy link

Could you please follow the steps below and attach the diagnostic logs? https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

@MichiBab
Copy link
Author

MichiBab commented Dec 5, 2023

Hi, i collected them with a 10 sec forced timeout, as described in my ticket.
❯ python3 test_connect.py
Uid: 1000
Socket type: 1
Socket family: 2
Socket timeout: 10.0 seconds
SO_REUSEADDR option: 0
SO_KEEPALIVE option: 0
TCP_NODELAY option: 0
checking: 6006
Is Port in use: False
--- Runtime: 10.010206699371338 seconds ---
https://github.com/microsoft/WSL/files/13559172/WslLogs-2023-12-05_12-02-48.zip

@chanpreetdhanjal
Copy link

Please follow the networking diagnostic script. https://github.com/microsoft/WSL/blob/master/CONTRIBUTING.md#collect-wsl-logs-for-networking-issues

It should create a folder like [WslNetworkingLogs-date_ver.zip]

@MichiBab
Copy link
Author

MichiBab commented Dec 7, 2023

WslNetworkingLogs-2023-12-07_21-14-32.zip
Updated, here are the new logs @chanpreetdhanjal .

@markovaljaots
Copy link

I have the same issue with WSL 2.0.14.0

@MichiBab
Copy link
Author

MichiBab commented Jan 9, 2024

Still the same problem with 2.1.0

@derMatze82
Copy link

same problem here with 2.1.0.
It is frustrating because some applications expect a conneciton refused when performing port checks with telnet

@MichiBab
Copy link
Author

Still the same problem with 2.1.1

@linG5821
Copy link

linG5821 commented Jan 24, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities
PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command
netsh interface ipv6 reset
PS: reboot your system

@ZedG2
Copy link

ZedG2 commented Jan 25, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6

netsh interface ipv6 show prefixpolicies

Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast

But I don't know what other problems such a change would cause

It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

@linG5821
Copy link

linG5821 commented Jan 26, 2024

When I go through windows telnet localhost [port] the access is very slow, but when I go through telnet 127.0.0.1 [port] or telnet native ip [port] it is very fast, I think it might be because there is a problem with localhost resolution, so I found a blog I looked at the prioritization policy for IPv6
netsh interface ipv6 show prefixpolicies
Default policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::1/128
        40      1  ::/0
        35      4  ::ffff:0:0/96
        30      2  2002::/16
         5      5  2001::/32
         3     13  fc00::/7
         1     11  fec0::/10
         1     12  3ffe::/16
         1      3  ::/96

I've changed the order of their priorities PS: required Administrator

netsh int ipv6 set prefix ::/96 50 0
netsh int ipv6 set prefix ::ffff:0:0/96 40 1
netsh int ipv6 set prefix 2002::/16 35 2
netsh int ipv6 set prefix 2001::/32 30 3
netsh int ipv6 set prefix ::1/128 10 4
netsh int ipv6 set prefix ::/0 5 5
netsh int ipv6 set prefix fc00::/7 3 13
netsh int ipv6 set prefix fec0::/10 1 11
netsh int ipv6 set prefix 3ffe::/16 1 12

Modified policy

Priority    label  prefix
----------  -----  --------------------------------
        50      0  ::/96
        40      1  ::ffff:0:0/96
        35      2  2002::/16
        30      3  2001::/32
        10      4  ::1/128
         5      5  ::/0
         3     13  fc00::/7
         1     12  3ffe::/16
         1     11  fec0::/10 

now telnet 127.0.0.1 [port] is getting fast
But I don't know what other problems such a change would cause
It can be reset by the following command netsh interface ipv6 reset PS: reboot your system

not work for me.

image

Are you using powershell or cmd on windows ?

On WSL use python open a http server
image

  • CMD
    image

image

  • Powershell
    image

@ZedG2
Copy link

ZedG2 commented Jan 29, 2024

@linG5821
We have a different issue. The problem is that the port is not listening, and it not refusing connections immediately.

@linG5821
Copy link

I did find problems with changing the default NIC policy in practice

I mapped ports when using Docker, I tried '127.0.0.1:port' and 'localhost:port' both were accessible, but I used the Intranet ip 'ipv4ip:port' and couldn't access it, when I reset the network, Intranet ip 'ipv4ip:port' and '127.0.0.1:port' are accessible, and access to 'localhost:port' comes back to the issue itself.

@MichiBab
Copy link
Author

Still the same issue with 2.1.3.0

@MichiBab
Copy link
Author

Still the same issue on 2.1.5. @chanpreetdhanjal is there any way to help? I have a workaround for ROS but it's a bit annoying on tensorboard to wait that long after fetching plots.

@chanpreetdhanjal
Copy link

Someone from our team is looking at it. We will get back to you as soon as we make further progress. Thank you for your patient.

@ZedG2
Copy link

ZedG2 commented May 14, 2024

Can't believe MS team still not resolve this bug. It's pretty annoying problem, especially when I am going to using the vscode JS debug terminal in WSL2 mirrored network mode.

@chanpreetdhanjal
Copy link

We are actively looking at it as per other items in our priority list. Will provide an update as soon as we have something tangible to share.

@keith-horton
Copy link
Member

To update this thread: there are 2 possible issues.

  1. when attempting to connect by name "localhost", the IPv6 address ::1 is likely resolved first (v4 vs. v6 is configurable in Linux). Linux does not support the option for IPv6 that we use to allow sending localhost traffic to the host - it's only supported with IPv4. So unfortunately we cannot use the :;1 address to connect from the Linux container to the host.

  2. when a connection is attempted by IPv4 127.0.0.1 to the host, but nothing is listening on the host, a RST is returned. We see that something has attempted to port-forward that response because the Linux stack eventually sees this with a different source port (which of course is dropped). We're currently working with Linux folks as we have tracked the response through a kernel debugger to the point where we send it over the vmBus to the Linux container.

Sorry, the RST issue is taking much longer to root cause than expected :(

Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants