Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/api/v2/update consistently times out #154

Open
nharper opened this issue Jun 11, 2018 · 8 comments
Open

/api/v2/update consistently times out #154

nharper opened this issue Jun 11, 2018 · 8 comments

Comments

@nharper
Copy link
Collaborator

nharper commented Jun 11, 2018

This happens both on the production site and when running locally (e.g. via make serve).

For the local reproduction, the reproduction steps are:

  1. Run make serve
  2. Load http://localhost:8080/api/v2/update. About 40,000 entries will be loaded into the local database before the last datastore update fails: (rpc error: code = DeadlineExceeded desc = retry failed with context deadline exceeded; last error: context deadline exceeded)
  3. Load http://localhost:8080/api/v2/update. This times out: Internal error: could not retrieve domain names previously marked as preloaded. (rpc error: code = DeadlineExceeded desc = retry failed with context deadline exceeded; last error: context deadline exceeded)

When reproducing locally, it not only reproduces on master (a7b7953), but also on previous revisions such as 246e763 and f0da6bf.

@lgarron
Copy link
Collaborator

lgarron commented Jun 11, 2018

Sounds like the database is finally too large! This was definitely something I wanted to keep simple until we hit scaling issues, so I guess now is that time. Feel free to revise any architecture decisions. :-)

@nharper
Copy link
Collaborator Author

nharper commented Jun 11, 2018

I did some experimenting with increasing the timeout on my workstation and here are the results:

15 second timeout: First call to /api/v2/update succeeds in loading all 51,718 HSTS entries into the database. Subsequent requests still time out.
30 second timeout, 45 second timeout: Same behavior as 15 second timeout.
60 second timeout: First call succeeds in loading; second call results in server halting execution:

$ make serve
PASS: Current directory is in $GOPATH.
git rev-parse HEAD > ./frontend/version
go run *.go -local
Setting up local database... checking database connection... done.
Serving from: http://localhost:8080
2018/06/11 15:34:01 Metadata fetch failed: Get http://metadata/computeMetadata/v1/instance/attributes/gae_project: dial tcp: lookup metadata on 127.0.0.1:53: no such host
exit status 1
Makefile:71: recipe for target 'serve' failed
make: *** [serve] Error 1

If I reduce the 60 seconds to 59 seconds, that error goes away. That error appears to come from https://github.com/golang/appengine/blob/master/internal/metadata.go#L35; I haven't tracked down why that gets called from appengine.Main() in server.go. I also think that's a red herring for the problem here.

I find it hard to believe that this failure is just because of the size of the database. It takes about 1 second to fetch and parse the JSON file from chromium source, which contains the entire contents of the database (when running the server locally). Even /api/v2/debug/all-states times out, which is literally just asking the database for all of its contents. Maybe I'm being too naive to think that if I can write 50,000 records in under 15 seconds that I should be able to read all of them back out in a similar amount of time. (Based on the behavior with the /api/v2/debug/all-states endpoint, I don't think it's a problem with this app; rather it's a problem with Google Cloud Datastore.)

@nharper
Copy link
Collaborator Author

nharper commented Jun 12, 2018

I added some logging to PutStates and statesForQuery (the latter included a trial code change to use Client.Run and Next instead of Client.GetAll) to measure how many entries can be written to or read from the database per second. On my workstation, I can write between 526 to 4500 rows per second to the database, but I can only read 184 per second. (When writing, the first batch of 450 entries takes around 800ms, with the time per batch falling off quickly and reaching a steady state of less than 100ms per batch after about 25 batches.) I don't know how similar this is to the rates in production. Nevertheless, it seems kind of crazy that writes are processing 24 times faster than reads.

@lgarron
Copy link
Collaborator

lgarron commented Jun 12, 2018

Are you testing using the local emulator or a prod/test environment connecting to GCE?
(Let me know if you don't know how to do the latter. I don't have it on the top of my head, but it should be easy to reconstruct.)

The local emulator is written in Java, and I would caution against using it for any real benchmarks.

@nharper
Copy link
Collaborator Author

nharper commented Jun 12, 2018

That was testing with the local emulator. I’m thinking of next testing deploying to a test appengine instance and datastore. It should be easy enough to create a new project and deploy to that instead of the production project.

@lgarron
Copy link
Collaborator

lgarron commented Jun 12, 2018

Okay, this works for me:

# 1. Create a new project in Google Cloud.
# 2. Select project:
gcloud init
# 3. Get default credentials.
gcloud auth application-default login
# 4. Change the prod project ID to the project you created in step 1. (unless you want to connect to real prod):

prodProjectID = "hstspreload"

# 5. Run
make bulk-preloaded
go run *.go

You can then try:

There's only one local flag, so you'll have to separate the notion of "local app" from "local database" if you want to use the debug endpoints:

if *local {
server.HandleFunc("/api/v2/debug/all-states", a.DebugAllStates)
server.HandleFunc("/api/v2/debug/set-preloaded", a.DebugSetPreloaded)
server.HandleFunc("/api/v2/debug/set-rejected", a.DebugSetRejected)
}

However, it shouldn't be too hard to make a copy of the real prod environment using this setup that you can test using local code.

@lgarron
Copy link
Collaborator

lgarron commented Jun 12, 2018

(And you should also be able to deploy to your test App Engine using make deploy this way.)

nharper added a commit that referenced this issue Jun 15, 2018
This is a stop-gap measure to help with issue #154 until a more scalable
solution is designed. The logging in /api/v2/update times how long it
takes to fetch the preload list from Chromium and how long it takes to
load preloaded domains from the database. The latter currently takes
around 15 seconds for approximately 50,000 entries.
@nharper
Copy link
Collaborator Author

nharper commented Jun 26, 2018

Today, I had 5 unsuccessful attempts (4 500 errors, 1 502 nginx error) loading /api/v2/update before it succeeded. The successful run took 38.4s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants