Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dial queue timeout for multiple multiaddresses incorrectly used for each connection separately #2368

Open
akim-bow opened this issue Jan 19, 2024 · 6 comments
Labels
need/triage Needs initial labeling and prioritization

Comments

@akim-bow
Copy link

akim-bow commented Jan 19, 2024

Version:
libp2p@1.2.0

Severity:

Medium

Description:

If remote peer proposes multiple multiaddresses, e.g. [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] (only the last one is reachable by default), libp2p will try to connect to them one by one. As you can see in the source code, the shared signal is created and used in every attempt to dial multiaddresses. Hence, the first address could take up all the time until the signal aborts and other addresses would be dropped without an attempt to connect to them.

I'm proposing to use 2 separate signals - first one for batch of multiaddresses, the second one for each multiaddress separately. It will solve the case when a peer provides multiple adresses and only some of them are actually valid. If the description isn't clear enough i can try to build simple reproduction example.

@akim-bow akim-bow added the need/triage Needs initial labeling and prioritization label Jan 19, 2024
@akim-bow
Copy link
Author

akim-bow commented Feb 8, 2024

@achingbrain can you please look into it?

@zeroxbt
Copy link
Contributor

zeroxbt commented Mar 5, 2024

@achingbrain could you take a look at this ? Most auto dials are failing because of it

@achingbrain
Copy link
Member

achingbrain commented Mar 6, 2024

@akim-bow what kind of addresses are you having problems with?

The example you gave was [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] - DNS addresses take time to resolve, so that can be bad, fair enough, but the other two are private local addresses so assuming TCP, they should be fast to fail?

If you know certain addresses are going to be unreliable or slow, you can also pass an addressSorter as part of the connectionManager config to ensure they are dialled last, or a connectionGater to ensure they are not dialled at all.

@zeroxbt
Copy link
Contributor

zeroxbt commented Mar 6, 2024

@achingbrain I'm having the same issue when running nodes in local environment.

When dialing a peer, these are the addresses found in the peer store:

[/ip4/10.2.0.2/tcp/9106, /ip4/127.0.0.1/tcp/9106, /ip4/172.20.5.94/tcp/9106]

The first address is not reachable, while the other two are. The dialer fails after dialTimeout milliseconds because of the first address, without ever attempting to dial the other two. All auto dials are also failing for the same reason.

@achingbrain
Copy link
Member

My concern with the proposed solution is that having separate timeouts for individual addresses could lead to very long dial times.

I wonder if we could be smarter with the dialling, for example if the IP address of the target multiaddr is a class A private network address, deprioritise or entirely skip dialling it unless the current node also has a class A private network address in the same logical network?

@zeroxbt
Copy link
Contributor

zeroxbt commented Mar 7, 2024

Although sorting and filtering addresses is a good option, as a user I would still expect the dial function to attempt dialing all provided addresses.
Why not give users the option to choose a maxParallelDialsPerPeer parameter with a default value of 1 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

3 participants