Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on Windows Server 2019 #491

Open
SolSoCoG opened this issue May 7, 2020 · 12 comments
Open

Running on Windows Server 2019 #491

SolSoCoG opened this issue May 7, 2020 · 12 comments

Comments

@SolSoCoG
Copy link

SolSoCoG commented May 7, 2020

I was forced to use Windows Server 2019.
Chihaya runs perfectly well, for several seconds (established tracker with +80k hashes).

The last message im getting after normal traffic is:

time="2020-05-07T12:52:18+02:00" level=fatal msg="failed while serving udp" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress."

it might be this error:

WSAENETRESET10052 | Network dropped connection on reset. The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress. It can also be returned by setsockopt if an attempt is made to set SO_KEEPALIVE on a connection that has already failed.

I'm clueless, is there any way to patch out interrupting on that error somehow?

@jzelinskie
Copy link
Member

Thanks for the bug report!
It looks like this is an error that is safe to ignore, but right now Chihaya is treating it as critical. The fix will be very small; the only thing that has to be determined is the best error type to handle to avoid other non-critical errors on Windows.

@SolSoCoG
Copy link
Author

SolSoCoG commented May 7, 2020

You're welcome! If you have a way to fix the above failure, I'll keep it running and notify you of other non critical fatal errors once they appear. Not the best practice, but well. Relatively new to the windows server biz so I couldn't really help you figuring out the error with tracing or the like.

@jzelinskie
Copy link
Member

I opened #492, which could be easily applied locally if you don't want to wait for us to review and merge the code. This won't prevent any errors, but will log the type of the errors when they occur so that we can handle them properly.

@SolSoCoG
Copy link
Author

SolSoCoG commented May 7, 2020

Alright, added the patch.

The result:
time="2020-05-07T19:28:15+02:00" level=fatal msg="failed while serving udp" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"

@mrd0ll4r
Copy link
Member

Sorry for the late reply, I'm quite underwater at the moment.
From a quick glance, we already check if an error is temporary here, and net.OpError implements net.Error, so that actually probably works. Which means this error is not considered temporary.

What we could do is string match the error message, maybe in a file only built for windows, but I don't really want to go down that road....

Ideas?

@SolSoCoG
Copy link
Author

I'd be happy if you could make a patch ignoring wsarecvfrom somehow for now. It is happening in the timespan of 5 to at max 30 seconds and it resets the temp hash storage every time, causing me to run an evil disruptive chihaya tracker at the moment. Restarting the process on any crash.

It looks to me like a malformed request of some kind safe to ignore.

@mrd0ll4r
Copy link
Member

Hey, I wrote up a small patch that string-matches the error message and ignores it, but logs it to INFO.
You can try if this works and let us know what happens :)
There is a possibility that ignoring the error doesn't work, i.e. because Go doesn't think it's temporary, the UDP endpoint could be marked as defunct or something, but I don't know. We will see...

issue491.patch.txt

@SolSoCoG
Copy link
Author

SolSoCoG commented May 20, 2020

Alright thanks, seems to be working well from what I can tell,
my own seed/peer experiment worked and is stable, and the hash count rises steadily and reached 35k within several minutes.

Prometheus: http://217.160.246.211:6880/

this shows how often the error appears:

time="2020-05-20T11:15:46+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:03+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:08+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:08+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:39+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:47+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:49+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:16:59+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"
time="2020-05-20T11:17:27+02:00" level=info msg="ignoring keep-alive related UDP error" error="read udp 217.160.246.211:6969: wsarecvfrom: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress." type="*net.OpError"

@mrd0ll4r
Copy link
Member

oh, that's going to spam your logs then 😅 feel free to change the logging level...

In general however, I don't think we will merge this into master, because it seems quite strange and a windows server-only problem, and the whole string-matching-an-error business is also not best practice...

I'm glad it helped you though, and if anyone in the future has the same problem, we will have a solution!

@jzelinskie
Copy link
Member

@mrd0ll4r the net.Error interfaces also has a Timeout() method. Do you think that we should also check for that in the conditional where we check Temporary()?

I'm a bit confused by this error message because I thought UDP has no such concept as "keep-alive" unless it is implemented by the application.

@mrd0ll4r
Copy link
Member

Hmm, maybe we could check for that, not sure.. I'm also very confused about what is actually going on here because I too didn't know anything about UDP keep-alives :D

@SolSoCoG
Copy link
Author

SolSoCoG commented Jun 9, 2020

Actually managed to get high rates of retransmissions, about 7% congestion causing other running programs to lag. I've switched the tracker to linux again, way less messy it seems. UDP keep-alives also doesn't really make sense to me. Still open for future testing when needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants