Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firestore: QuerySnapshotIterator.Next() can return iterator.Done even if Stop() hasn't been called #2601

Closed
cpick opened this issue Jul 14, 2020 · 8 comments
Assignees
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@cpick
Copy link

cpick commented Jul 14, 2020

Client

Firestore

Environment

alpine:3.11 Docker image running on GKE

Go Environment

Built within docker image: golang:1.14-alpine3.11
$ go version
go version go1.14.4 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build686252092=/tmp/go-build -gno-record-gcc-switches"

Code

$ grep firestore go.mod
cloud.google.com/go/firestore v1.2.0

func (db *DB) watchSnapshots(ctx context.Context, id string, t time.Time, cb ForEachCallback) error {
	qsnaps := db.collection().
		Query.
		Where(idField, "==", id).
		Where(timeField, ">", t).
		OrderBy(timeField, firestore.Asc).
		Snapshots(ctx)
	defer qsnaps.Stop()

	for {
		qsnap, err := qsnaps.Next()
		if err == iterator.Done {
			return errors.New("stop called") // never expected since we only call `Stop()` via defer
		}
		if err != nil {
			return fmt.Errorf("next: %w", err)
		}

		if err := cb(ctx, qsnap); err != nil {
			return err
		}
	}
}

Expected behavior

The provided watchSnapshots() method should never generate a "stop called" error because the documentation says:

Next never returns iterator.Done unless it is called after Stop.

And the watchSnapshots() method only calls Stop() from defer.

Actual behavior

We see occasional instances where iterator.Done is returned from QuerySnapshotIterator.Next().

My hypothesis is something like:
Under the hood, the firestore package turns any io.EOF into an iterator.Done.

It assumes this io.EOF came from the watchStream.stop() method, but I believe they can come from underlying network issues/disconnects and the library misinterprets it.

For example, perhaps watchStream.recv()s call to watchStream.open() failed with io.EOF (either in it’s call to Listen() or Send()). The latter eventually calls grpc.ClienltStream.SendMsg() which does say that it can return io.EOF. There's some effort to retry on "non-permanent watch errors", but that's only done on errors from the underlying gRPC connection's Recv(), but apparently intentionally not during the initial/re open (which makes sense since if the connection can't even be established that may be a permanent networking issue).

As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues, but I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.

Screenshots

NA

Additional context

NA

@cpick cpick added the triage me I really want to be triaged. label Jul 14, 2020
@product-auto-label product-auto-label bot added the api: firestore Issues related to the Firestore API. label Jul 14, 2020
@tritone tritone added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed triage me I really want to be triaged. labels Jul 15, 2020
@tritone
Copy link
Contributor

tritone commented Jul 15, 2020

Thanks for your report! @BenWhitehead any thoughts on this? We can definitely make the proposed docs fix in the immediate term.

@BenWhitehead
Copy link

@tritone I think getting the docs updated sounds good.

Then we can look at investigating what code change would need to happen to correct the misinformed Done.

tritone added a commit to tritone/google-cloud-go that referenced this issue Jul 29, 2020
It's been reported that QuerySnapshotIterator may return
iterator.Done before Stop, possibly due to EOF caused by
networking issues. Add a caveat about this to the docs.

Updates googleapis#2601
tritone added a commit that referenced this issue Jul 31, 2020
It's been reported that QuerySnapshotIterator may return
iterator.Done before Stop, possibly due to EOF caused by
networking issues. Add a caveat about this to the docs.

Updates #2601

Co-authored-by: Tyler Bui-Palsulich <26876514+tbpg@users.noreply.github.com>
tritone added a commit to tritone/google-cloud-go that referenced this issue Aug 25, 2020
It's been reported that QuerySnapshotIterator may return
iterator.Done before Stop, possibly due to EOF caused by
networking issues. Add a caveat about this to the docs.

Updates googleapis#2601

Co-authored-by: Tyler Bui-Palsulich <26876514+tbpg@users.noreply.github.com>
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels Oct 13, 2020
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jan 10, 2021
@tritone tritone assigned crwilcox and unassigned tritone and crwilcox Jan 22, 2021
@crwilcox
Copy link
Contributor

Hey @cpick your hypothesis is correct. QuerySnapshotIterator.Next, when an io.EOF is received from the underlying stream, will modify the err to be iterator.Done

I think the following portion of this is reasonable.

As a stopgap the documentation could be changed to indicate iterator.Dones can be produced due to networking issues,

The current docstring for Next() does state this:

// Next is not expected to return iterator.Done unless it is called after Stop.
// Rarely, networking issues may also cause iterator.Done to be returned.

Though it seems query.go has this where document.go doesn't. Perhaps we need to ensure that it is mentioned on all watch types.

Regarding a suggested change to behavior:

I think a better overall solution would be to propagate the io.EOF (or similar) error if a connection cannot be (re)opened so the caller can decide how to react to that situation.

@jba I know this is likely out of your mind, but could you possibly share why io.EOF is transformed to iterator.Done for the user, instead of surfacing the inner io.EOF?

@tritone
Copy link
Contributor

tritone commented Jan 27, 2021

The docs are due to the earlier PR I made on this issue. I haven't done any further work to dig into why or at what level this occurs.

gcf-merge-on-green bot pushed a commit that referenced this issue Jan 29, 2021
#2601

Upon inspection, it seems we added a disclaimer to one of the two watch docstrs.
@jba
Copy link
Contributor

jba commented Jan 29, 2021

Best I can guess, I assumed I only saw io.EOF when the underlying connection was closed legitimately.

From a distance, it seems to me that a reasonable solution is for stop to indicate it was called using some other mechanism, like setting a bool. Then the code that transforms io.EOF into Done could first check that bool.

crwilcox added a commit to crwilcox/google-cloud-go that referenced this issue Jan 30, 2021
…r locations of EOF that might be impacting how Next() states iterator.Done before end is reached, googleapis#2601
@crwilcox
Copy link
Contributor

crwilcox commented Feb 2, 2021

FYI: #3643 fixed a small error in how watch.stop was done. I believe this fix is tangential to this bug.

I also have a branch, https://github.com/googleapis/google-cloud-go/compare/master...crwilcox:refine-EOF?expand=1#diff-fbf8020e8989d7c99881c746391dda1da4706a1ce1216d16ca1dfce4f62b9284R453, where I have marked areas we might be returning done incorrectly. Though more debugging is needed.

@crwilcox
Copy link
Contributor

crwilcox commented Feb 24, 2021

I have had a watch running for ~3 weeks trying to find this issue. Unfortunately I haven't observed this error.

@crwilcox
Copy link
Contributor

While I am going to keep digging into this, as I can't manage to reproduce there isn't much we can do at this time. Please reopen (or create a new issue) if you come across this again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: firestore Issues related to the Firestore API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

6 participants