Production go-ipfs going bananas #39

victorb · 2019-05-16T10:13:42Z

Seemingly, in the last 24 hours, our deployed go-ipfs instance went from being connected to 1k peers, to hovering around 15k. This is putting a lot of load on the server and performance is being affected by this.

go-ipfs is currently using all of the 8 CPUs to the max, while only doing around 1MB/s receive/transmit, probably due to the amount of connections open.

ConnMgr (The Connection Manager) is set to the following config, although it seems to not be working (as we're connected to 15k peers!)

"ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2000,
      "LowWater": 1500,
      "Type": "basic"
    },

Dashboard Graph:

https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=now-24h&to=now

The text was updated successfully, but these errors were encountered:

victorb · 2019-05-16T10:16:40Z

We're one version behind latest stable version of go-ipfs (we're on 0.4.19, latest is 0.4.20). Worth trying to upgrade to see if it helps as it's supposed to have performance improvements.

victorb · 2019-05-16T10:18:47Z

According to @raulk from libp2p (discussion in Freenode #ipfs), seems related to the following issues:

victorb · 2019-05-16T10:19:27Z

@vyzo if you need any information trying to track down the issue (pprof dump or whatever), let me know!

vyzo · 2019-05-16T10:25:37Z

Can you try running with libp2p/go-libp2p-connmgr#43 ?
It has fixed our connection manager woes in the test relay, except the duplicate connections issue which is being worked on.

victorb · 2019-05-16T11:16:04Z

@vyzo I'll give that a try and report back. Thanks

victorb · 2019-05-16T13:37:58Z

Log:

Tried first to disable Circuit Relay-related experimental options:

ipfs config --json Swarm.EnableAutoNATService false
ipfs config --json Swarm.EnableAutoRelay false
ipfs config --json Swarm.EnableRelayHop false

Didn't do any change.

Experimented a bit with the LowWater/HighWater values to see if it would ease-off a bit of load. Seems it's not aggressive enough and cannot reach the LowWater because of all the new incoming connections.

We were one version behind go-ipfs (was on 0.4.19 while latest is 0.4.20), and the new version apparently should solve some performance issues. Deployed that version, no changes.

Tried manually running ipfs swarm peers | xargs ipfs swarm disconnect to disconnect all peers, went down to 4k peers (but seems it couldn't disconnect all of them, although no errors from ipfs swarm disconnect) but after 1-2 minutes jumped up to 15k again.

As a last, counterintuitive effort to reduce CPU usage, I disabled the connection manager completely. Peer count jumped to around 90k (!!!) but it did indeed reduce the CPU usage. Memory usage is way higher now, but at least the node can respond to requests again.

Now when things are at least working (although with reduced performance), I will try to patch in the PR linked above and deploy that version.

Once confirmed working (or not), we can enable back the relay stuff.

raulk · 2019-05-16T14:19:32Z

@victorb if your instance was acting as a relay node, it can take a while until the provider record is cleared from the network and peers stop attempting to connect to you for circuit relaying.

As a way to short-circuit the process, could you try to change your public key to provoke a mismatch between the multiaddr and your actual identity, so that the connection is aborted? Alternatively you could change your port.

I'd like to figure if the sudden inflow of connection is triggered by your node acting as a relay.

victorb · 2019-05-16T14:21:21Z

As a way to short-circuit the process, could you try to change your public key to provoke a mismatch between the multiaddr and your actual identity, so that the connection is aborted? Alternatively you could change your port.

Yeah, just did this, reset the PeerID to another one, as there were bunch of incoming connections I had no hope of stopping seemingly.

Peer count and CPU load is now normal again. Seems to relate to relay, as those were the options I turned off before resetting the Peer ID as well.

From https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&refresh=1m&from=now-30m&to=now

victorb · 2019-05-16T14:23:32Z

@raulk

I'd like to figure if the sudden inflow of connection is triggered by your node acting as a relay.

Most certainly. But I'm still bit lost on why the sudden influx ~20 hours ago. I activated the Relay options about a week ago.

Thanks to https://github.com/open-services/open-registry/issues/39<Paste> License: MIT Signed-off-by: Victor Bjelkholm <git@victor.earth>

victorb · 2019-05-16T20:35:02Z

Seems to start going into the same pattern again, currently connected to 20k peers and counting. Was working fine for a couple of hours, but then started getting more and more connections without the connection manager being able to keep up (seemingly).

https://dashboard.open-registry.dev/dashboard/snapshot/S41NOTTKSbudcGLRDmJd5S2nl7PM20DM

Swarm config looks like this (everything relay disabled):

"Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2000,
      "LowWater": 1500,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "DisableRelay": false,
    "EnableAutoNATService": false,
    "EnableAutoRelay": false,
    "EnableRelayHop": false
  }

Wasn't able to get a quick build of the PR linked above, less rushed since I thought I found another solution to the problem. But, seems not.

victorb · 2019-05-16T20:41:51Z

On the upside, CPU is not nearly as badly affected as previously. Peer count seems to fluctuate less as well. But time will time what will happen until tomorrow, I need to rest

victorb · 2019-05-17T20:48:59Z

I just deployed new version of go-ipfs built with libp2p/go-libp2p-connmgr#43 Let's see how it goes.

@vyzo I see you just force-pushed the branch, should I rebuild with new the changes?

vyzo · 2019-05-17T20:54:18Z

I updated a small thing on the base branch (the pr is on top of another pr), that necessitated the rebase. It's a small thing, but it potentially saves allocations, so yeah, update.

victorb · 2019-05-17T22:09:02Z

Alright. Around 22:45, the initial PR code was deployed. Now, around 00:07 the newly pushed changes were deployed.

Will let it run over night and report back.

victorb · 2019-05-18T09:18:40Z

@vyzo seems to be running better

Peer count is kept within 1500<>2000, CPU usage is much lower, memory is stable and transferring data is now performing alright.

vyzo · 2019-05-18T09:21:40Z

Excellent!

victorb · 2019-05-19T10:40:58Z

@vyzo things are much better now, but seems the connection manager still struggles to keep up sometimes. This happened about an hour ago:

https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=1558240837670&to=1558262437670

ConnMgr values from config is currently:

{
    "GracePeriod": "20s",
    "HighWater": 5000,
    "LowWater": 2500,
    "Type": "basic"
}

Seems while the spike happened, memory was taken but not given back later.

victorb · 2019-05-22T11:20:10Z

Update after 3 days (graph is last 7 days):

https://dashboard.open-registry.dev/d/7SmAxSzZz/general?orgId=1&from=1557920925634&to=1558523884539

@vyzo @raulk seems that go-ipfs is still not really working properly... There is now 40k peers connected, using 0.2 of the CPU available and memory is growing past 10GB.

Seems the connection manager still isn't disconnecting as many peers as it needs to, even after applying patch linked by @vyzo above.

Edit: CPU usage seems much better than before applying patch above, but still go-ipfs is basically taking over the servers resources as the connection manager doesn't respect the values/cannot close enough peers

vyzo · 2019-05-22T11:30:16Z

this is weird. the only possible explanation is that the connection manager gets stuck, which is an issue @Stebalien has identified.

vyzo · 2019-05-22T14:00:05Z

We are working on a fix with libp2p/go-libp2p-circuit#76

vyzo · 2019-05-22T18:30:59Z

see ipfs/kubo#6237 -- can you try building with go-ipfs master? It has the relevant patches applied.

victorb · 2019-05-22T19:53:18Z

@vyzo Thanks a lot. Will do a deploy of go-ipfs master tomorrow morning and see if it improves the situation.

victorb · 2019-05-24T11:46:25Z

Now ipfs/go-ipfs:v0.4.21-rc3 has been deployed. Let's see how it holds up.

victorb added breaking-production Issues that needs to be fixed ASAP as they affect production negatively bug Something isn't working labels May 16, 2019

victorb added a commit to open-services/bolivar that referenced this issue May 16, 2019

New PeerID again

6cd933f

Thanks to https://github.com/open-services/open-registry/issues/39<Paste> License: MIT Signed-off-by: Victor Bjelkholm <git@victor.earth>

victorb removed the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 16, 2019

victorb added the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 16, 2019

victorb mentioned this issue May 17, 2019

Use new Open-Registry instance for tests open-services/bolivar#7

Open

victorb removed the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 19, 2019

victorb added the breaking-production Issues that needs to be fixed ASAP as they affect production negatively label May 22, 2019

victorb mentioned this issue May 25, 2019

Decreased performance lately open-services/bolivar#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production go-ipfs going bananas #39

Production go-ipfs going bananas #39

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

vyzo commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

raulk commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 17, 2019

vyzo commented May 17, 2019

victorb commented May 17, 2019

victorb commented May 18, 2019

vyzo commented May 18, 2019

victorb commented May 19, 2019

victorb commented May 22, 2019 •

edited

vyzo commented May 22, 2019

vyzo commented May 22, 2019

vyzo commented May 22, 2019

victorb commented May 22, 2019

victorb commented May 24, 2019

Production go-ipfs going bananas #39

Production go-ipfs going bananas #39

Comments

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

vyzo commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

raulk commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 16, 2019

victorb commented May 17, 2019

vyzo commented May 17, 2019

victorb commented May 17, 2019

victorb commented May 18, 2019

vyzo commented May 18, 2019

victorb commented May 19, 2019

victorb commented May 22, 2019 • edited

vyzo commented May 22, 2019

vyzo commented May 22, 2019

vyzo commented May 22, 2019

victorb commented May 22, 2019

victorb commented May 24, 2019

victorb commented May 22, 2019 •

edited