Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

packets.go:66: unexpected EOF followed by packets.go:412: busy buffer followed by driver: bad connection followed by connection.go:319: invalid connection #583

Closed
tssajo opened this issue May 2, 2017 · 37 comments

Comments

@tssajo
Copy link

tssajo commented May 2, 2017

Issue description

MySQL SELECT queries fail unexpectedly at random times in a long running CLI script.

Example code

I can't really provide a sample code, but I can tell you that I only do simple SELECTqueries in my CLI program. I don't do any transactions, nor any UPDATEs. All the MySQL operations in my program are read-only (just a lot of SELECTs only). My program is doing just ONE query at a time, it runs on a single core virtual machine ($5 DigitalOcean Droplet) with only one go routine running at a time. I do not run out of RAM, I checked that, the swap file is only barely used on this virtual machine.

Error log

[mysql] 2017/05/02 16:48:51 packets.go:66: unexpected EOF
[mysql] 2017/05/02 16:48:52 packets.go:412: busy buffer
driver: bad connection
[mysql] 2017/05/02 16:48:52 connection.go:319: invalid connection

Configuration

Driver version (or git SHA):
git log shows:
commit 147bd02
Author: astaxie xiemengjun@gmail.com
Date: Sat Apr 29 00:02:31 2017 +0800

But the exact same issue is happening with a version from December, 2016. With that version the log is:

[mysql] 2017/05/01 09:55:47 packets.go:66: unexpected EOF
[mysql] 2017/05/01 09:55:47 packets.go:412: busy buffer
driver: bad connection
[mysql] 2017/05/01 09:55:48 connection.go:312: invalid connection

So this problem is not new. I first noticed this issue this year. Sometime last fall I was using the exact same program on the exact same database, compiled by the then latest stable version of Go (probably Go 1.7), and at that time I never ran into this issue. So this must be something new, it must be related to some change(s) that occurred right around or before Go 1.8 got released.

Go version:
go version go1.8.1 linux/amd64

Server version: mysql-5.5.55

Server OS: CentOS release 6.9 (but I compiled the program under Debian 8.7 (jessie), transferred the binary file over to the CentOS 6.9 droplet and ran it under CentOS 6.9 -- thanks to static builds this should not be a problem. I do the same with a bunch of other Go programs and this is the only one which fails)

I noticed today that someone else also ran into this error pattern ( "packets.go:66: unexpected EOF followwed by packets.go:412: busy buffer followed by driver: bad connection" ) recently, please see his Edit#2 on this StackOverflow page: http://stackoverflow.com/questions/43696607/mysql-to-json-inconsistent-extraction

This is a showstopper bug for me. I might need to switch back to Go 1.7.x and recompile my program and see if that fixes this issue or not. Or do you have a better idea on how to debug this?

BTW, my program is doing millions of SELECT queries and only a few of them fail. On every run different queries fail, it seems to me totally random.

@tssajo tssajo changed the title packets.go:66: unexpected EOF followwed by packets.go:412: busy buffer followed by driver: bad connection followed by connection.go:319: invalid connection packets.go:66: unexpected EOF followed by packets.go:412: busy buffer followed by driver: bad connection followed by connection.go:319: invalid connection May 2, 2017
@julienschmidt
Copy link
Member

julienschmidt commented May 2, 2017

Are these long-running queries or many short-lived queries?
The most likely reason for EOF errors are timeouts and abruptly closed connections.

I was never able to explain those busy buffer errors though.

@julienschmidt
Copy link
Member

It would be awesome if anyone would be able to provide an example program.
I would like to narrow down what exactly happens here.

A start might be to use an explicit panic() in packets.go:412 so we can get a stack trace.

@tssajo
Copy link
Author

tssajo commented May 2, 2017

@julienschmidt Thanks for the very quick response!
I believe these are all very simple short-lived queries. I never measured the time of the queries though... Maybe some of them are not that short-lived because some tables might have a lot of rows and missing the proper indexes. Unfortunately, I cannot modify those tables, I cannot add indexes. I try to increase timeouts in my.cnf , I report back if I managed to figure it out.

@tssajo
Copy link
Author

tssajo commented May 2, 2017

@julienschmidt That other person (see the StackOverflow link) provided his full program on github, but I'm afraid without a huge database you won't be able to reproduce the issue...

@tssajo
Copy link
Author

tssajo commented May 2, 2017

BTW, it always start with a "packets.go:66: unexpected EOF" , that is the first indication that there is a problem. Then immediately after that follows the "packets.go:412: busy buffer"

@methane
Copy link
Member

methane commented May 2, 2017

Have you tried DB.SetConnMaxLifetime()?

db.SetConnMaxLifetime(time.Second * 10) is good for most configuration.

@julienschmidt
Copy link
Member

Replacing packets.go:66 with the following should produce some useful information for debugging then:

fmt.Println(mc.buf)
panic(err.Error())

@tssajo
Copy link
Author

tssajo commented May 2, 2017

@methane Thanks for the suggestion! I try this in a moment.

@julienschmidt I modify packets.go as you suggested at line 66, recompile the program and rerun it.

If it still fails then we'll have a panic in about an hour or so. I get back to you guys soon!

@tssajo
Copy link
Author

tssajo commented May 2, 2017

So far so good. The program is running now without errors on two servers for over an hour.

There is a good chance that adding db.SetConnMaxLifetime(time.Second * 10) helped.

But it wasn't required before...

@tssajo
Copy link
Author

tssajo commented May 3, 2017

@julienschmidt I got a Panic in packets.go:66

Please see this Gist: https://gist.github.com/tssajo/2939a5ef8fe3909d41ab6ac094600453

Note: The first long line is the result of fmt.Println(mc.buf)

If you want, I can move these lines:

fmt.Println(mc.buf)
panic(err.Error())

to packets.go:412 and then start the program again. Then you'll have a backtrace of what happened next. Do you want me to do that?

@tssajo
Copy link
Author

tssajo commented May 3, 2017

I can provide some more information now.
This is the query I am running on a table with lots of rows (most likely hundreds of thousands) :

rows, err := db.Query("SELECT `ID`,`post_name`,`post_parent` FROM `wp_posts` WHERE `post_status`='publish' AND (`post_type`='page' OR `post_type`='post') ORDER BY `ID`")

I believe that here I do not receive an error, but I will add debug code to make sure of this.
However, what happens next is really interesting: While I am reading the rows with rows.Next() in a loop, I put data from each row into a channel. That channel is buffered but the buffer size is only 100. So I might not be able to go through all rows quickly enough. Because another go routine is taking out and processing data from that channel, but that processing can be very time consuming! Also, that processing involves issuing lots of other SELECT statements against the same table -- I'm not sure if this is relevant or not.

I believe that a timeout occurs at some point while inside the for rows.Next() { loop and from that point on I can no longer read any more data from that MySQL result set (the rows object). Eventually the loop completes anyhow, then I call rows.Err() and then rows.Close() is also being called because of a previous defer statement.

So I believe that rows.Next() then rows.Err() and then finally rows.Close() gets called AFTER a timeout occurred.

Can this be the cause of the problems I am having?
I might need to increase the wait_timeout and the interactive_timeout in my.cnf
I will report here if that helped or not.

@julienschmidt
Copy link
Member

@tssajo another stack trace from packets:412 would be great

@julienschmidt
Copy link
Member

A timeout is most likely, but it should be handled without weird errors by the driver

@tssajo
Copy link
Author

tssajo commented May 3, 2017

@julienschmidt Unfortunately increasing both the wait_timeout and the interactive_timeout in my.cnf did not help (of course I restarted MySQL server after changing the values). I still got a panic. Here is the stack trace, but this is still from packets.go:66 :

https://gist.github.com/tssajo/8acfedc789dc7282ff991bf5662cd4d6

Next I create one from the packets:412. I just made this change in packets.go :

func (mc *mysqlConn) writeCommandPacket(command byte) error {
        // Reset Packet Sequence
        mc.sequence = 0

        data := mc.buf.takeSmallBuffer(4 + 1)
        if data == nil {
                // can not take the buffer. Something must be wrong with the connection
                fmt.Println(mc.buf)
                panic("data is nil after data := mc.buf.takeSmallBuffer(4 + 1)")
                errLog.Print(ErrBusyBuffer)
                return driver.ErrBadConn
        }

        // Add command byte
        data[4] = command

        // Send CMD packet
        return mc.writePacket(data)
}

I am recompiling now.

BTW, my wait_timeout and my interactive_timeout values are 300 ( = 5 minutes ) in my.cnf now. Can this still be too low? I can set it to 1 hour and see if that helps.

@tssajo
Copy link
Author

tssajo commented May 4, 2017

@julienschmidt Here is the panic at packets:412

https://gist.github.com/tssajo/4da2568ea8421ff80c2e4aec300e203a

Unfortunately this happened with the following settings in my.cnf :

wait_timeout             = 3600
interactive_timeout      = 3600

Meaning that even with a 1-hour time out setting this is still happening! I can assure you that this did NOT happen last fall when the timeout settings were:

wait_timeout             = 30
interactive_timeout      = 30

So I am not sure it is caused by a timeout, anymore. Maybe it is is not a timeout related problem after all.

Here is what I'm going to do now: I'll recompile the program with the Aug 2, 2016 version of the go-sql-driver ( commit : 0b58b37 ) and try the program with that one. If that one is still "crashing" then I'll also downgrade to Go 1.7.x and compile the program with that. I just happen to know that this was all working last fall.

Please understand that it is a real showstopper bug for me and I must find a solution to this problem soon. If I need to use an older version of the driver and/or Go then I'll hesitantly (but with no other option) must do that. Anyway, thanks for trying to help!

@tssajo
Copy link
Author

tssajo commented May 4, 2017

Unfortunately, with the Aug 2, 2016 version of the go-sql-driver ( commit 0b58b37 ) I still get the same kind of errors logged, see:

[mysql] 2017/05/04 13:20:31 packets.go:59: unexpected EOF
[mysql] 2017/05/04 13:20:31 packets.go:386: busy buffer
driver: bad connection
[mysql] 2017/05/04 16:25:10 packets.go:59: unexpected EOF
[mysql] 2017/05/04 16:25:10 packets.go:386: busy buffer
driver: bad connection
[mysql] 2017/05/04 17:59:54 packets.go:59: unexpected EOF
[mysql] 2017/05/04 17:59:54 packets.go:386: busy buffer
driver: bad connection

The program doesn't stop because right now I do not have panic() added to packets.go
But many of the records ( rows ) are being skipped in the database because of this bug. I can confirm this when the program finished running -- probably tomorrow. Because I count how many records were processed successfully. I suspect that at the end I will have thousands of records not processed, because of this error.
Next I will compile the program with Go 1.7.x and then try again. I will report you the results.

@julienschmidt
Copy link
Member

Use an even earlier version of Go if possible, we support anything Go 1.2+

@tssajo
Copy link
Author

tssajo commented May 4, 2017

@julienschmidt Yes, if Go 1.7 fails then I go back one more version. But I believe that it worked with 1.7 last year in about September or October.

BTW, I noticed one difference already! When I use the latest version of the mysql driver with Go 1.8.1
I get 4 lines logged as a result of the issue, see:

[mysql] 2017/05/02 16:48:51 packets.go:66: unexpected EOF
[mysql] 2017/05/02 16:48:52 packets.go:412: busy buffer
driver: bad connection
[mysql] 2017/05/02 16:48:52 connection.go:319: invalid connection

However, when I use the Aug 2, 2016 version of the go-sql-driver ( commit 0b58b37 ) with also Go 1.8.1
I only get 3 lines, see:

[mysql] 2017/05/04 13:20:31 packets.go:59: unexpected EOF
[mysql] 2017/05/04 13:20:31 packets.go:386: busy buffer
driver: bad connection

As you can see above, I do not get an "invalid connection" message from connection.go when I use the older version of the driver.

@endophage
Copy link

I'm seeing this same error at line 33 in packets.go. We were trying to upgrade some of our dependencies, particularly the migration library we use. The PR can be found here notaryproject/notary#1151

If you have docker installed, you should just be able to run docker-compose build && docker-compose up to see the error.

@endophage
Copy link

I'm only actually able to repro this on Windows interestingly.

@tssajo
Copy link
Author

tssajo commented May 8, 2017

I managed to solve this problem! It had nothing to do with Go version or the version of the Go MySQL driver.

It turns out that it was a combination of things. In my case, the problem was very much resembling this issue: #281 So I added these lines to my.cnf :

net_read_timeout         = 14400
net_write_timeout        = 14400

I also changed back the following timeout settings to their MySQL default values (previously I had much lower values for these) :

wait_timeout             = 28800
interactive_timeout      = 28800

Besides, what was suggested by @methane was also required:

db.SetConnMaxLifetime(time.Second * 20)

I set it to 20 seconds instead of the suggested 10, it works for me perfectly.

Since I made the above changes I stopped seeing the error anymore.

When I said, I used the same program on the same database last year and at that time I did not see this error, that was also true. However, my program is generating static pages from a dynamic web site (WordPress based) and since last year the WP template became more complex and we also changed the web site to SSL. All these changes made the processing of every single page more time consuming. Due to that, recently I started to encounter MySQL timeouts in that long running loop, just like it is described here: #281
In my case processing became slower hence I started to see timeouts at places where I've never seen timeouts before.

@tssajo tssajo closed this as completed May 8, 2017
@tssajo
Copy link
Author

tssajo commented May 11, 2017

I'm sorry but I need to reopen this ticket. The error occurred once again:

[mysql] 2017/05/09 18:56:37 packets.go:66: unexpected EOF
[mysql] 2017/05/09 18:56:37 packets.go:412: busy buffer
driver: bad connection

Increasing MySQL timeouts really helped a lot. But that still does not fully resolve this problem.

@tssajo tssajo reopened this May 11, 2017
@methane
Copy link
Member

methane commented May 11, 2017

Do you mean it's reproduced with db.SetConnMaxLifetime(time.Second * 20) ?

@tssajo
Copy link
Author

tssajo commented May 11, 2017

@methane Yes, db.SetConnMaxLifetime(time.Second * 20) did not fully solve the issue. It definitely helped a lot, though.

In my case the timeout occurs in a long running loop, that loop runs for several days! The loop is similar to the one as described here: #281
Recently the timeout happened only on one of my servers and only after several days.

Increasing the timeout values in the my.cnf file definitely helped a lot. Before I increased the timeout values, I got to see this problem to surface on two other servers, and on this same server it surfaced just after a few hours of running the program. Right now the issue showed up on one server only and only after days of running completely trouble free.

I am thinking that maybe a MySQL logrotate happened or something which closed the MySQL connection prematurely. I will look into this possibility.

@methane
Copy link
Member

methane commented May 11, 2017

In my case the timeout occurs in a long running loop, that loop runs for several days! The loop is similar to the one as described here: #281

Oh, do you mean you're doing heavy job inside rows.Next() loop?
You should fetch all rows and call rows.Close() ASAP, and do your job with fetched rows.

@tssajo
Copy link
Author

tssajo commented May 11, 2017

@methane Yes, as a last resort I was already planning to make this change. Get all rows quickly, save the data to a temporary file and then do the processing by going through the lines of that text file. The reason I need to save the data to a text file because the data may not fit in memory. This program is running on $5 a month DigitalOcean droplets, those only have 512MB RAM ...

Thanks for the suggestion!

@methane
Copy link
Member

methane commented May 11, 2017

I feel this is just a behavior of MySQL server and there is nothing client (driver) can do.
Why did you reopen this issue?

@tssajo
Copy link
Author

tssajo commented May 11, 2017

I reopened it when my program completed after several days of running through all records and at the end I saw that many rows were not processed, skipped. I look at the logged output and I saw the same error. So I reopened.

I can close it but it is still happening.

Your suggested workaround will more than likely work. But that is just a workaround.

Thank you for your help!

@tssajo tssajo closed this as completed May 11, 2017
@shuhaowu
Copy link

Has anyone figured out why this is a problem and have a solution?

I've also noticed that if I don't set any timeout, the query will die with after a few of these busy buffer error with a bad connection error.

I don't think this should be closed as this is definitely not a resolved issue.

@tssajo
Copy link
Author

tssajo commented Aug 16, 2017

@shuhaowu In my opinion, it can be closed now, because we determined that it is a MySQL issue. In my case, the Go program ran for 24+ hours. We figured out that a "result set" cannot be held for that long, it is more than likely a MySQL issue, not a Go driver issue. So I changed my program: I go through the result set quickly in a loop and just store the data in a slice. This way I never get the busy buffer error or any other errors. Once I have the data stored in the slice, I walk through the elements of this slice and do the actual job on each element. My program definitely got fixed this way.
However, I am a little bit concerned about the memory usage of my program. Because now, I must hold all the results from the initial query in a slice. That query returned hundreds of thousands of rows! So I could ran out of memory. Fortunately, it wasn't the case for me. But if it would have happened, then I would have been storing all those rows in a temporary file on disk and then walk through the lines of that file.
All in all, workarounds will work. But it is clearly a MySQL issue, not a Go driver issue, in my opinion.
This is why I closed this.

@julienschmidt
Copy link
Member

If anyone has an idea how the driver could provide a more helpful error message, please let us know or submit a pull request.
The current errors are unfortunately not helpful at all.

@bduave
Copy link

bduave commented Nov 7, 2017

I face the sa,e problem but i fixed putting "DELIMITER $$" before mysql procedure or select and "END" on the end of the script. I hope it helps

@donatj
Copy link

donatj commented Jan 10, 2018

I am encountering this as well. I have a data collection program that is doing thousands and thousands of small queries. After a couple hours I always end up receiving

[mysql] 2018/01/10 07:43:08 packets.go:36: unexpected EOF
[mysql] 2018/01/10 07:43:08 packets.go:431: busy buffer
2018/01/10 07:43:08 invalid connection

@methane
Copy link
Member

methane commented Jan 10, 2018

@donatj Have you read this thread?
Did you tried this?

Your comment is very unhelpful. It lacks many information.
Log is just a log. It inform you about connection is killed. Driver can't know why it happens.
We and driver can't check your MySQL config, OS settings or looking tcpdump.
Only you (or your infra member) can do it.

Without complete, reproducible example, it's not issue report. Just a noise.

@donatj
Copy link

donatj commented Jan 10, 2018

@methane I read the entire thread, line by line, because this is a major impediment for me.

tssajo noted that

In my opinion, it can be closed now [...] the Go program ran for 24+ hours. We figured out that a "result set" cannot be held for that long

He says that that the problem only occured with long lived result sets - which is why I was noting that I was receiving it with short lived ones. I thought that was new information.

Sorry.

@methane
Copy link
Member

methane commented Jan 10, 2018

That is only one of many reasons TCP connection is killed.
I recommend everyone to use db.SetConnMaxLifetime(time.Second * 10)
unless they can solve their issue by themselves.

@winjeg
Copy link

winjeg commented Apr 25, 2019

The problem cannot be solved perfectly.
If the user have a table that is much bigger that 1TB, and the network is very fast,. even faster than the disk.
Fetching the rows as soon as possible and store it to a disk will not solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants