Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrades with the size of the API #26

Open
mchlrch opened this issue Aug 9, 2018 · 2 comments
Open

Performance degrades with the size of the API #26

mchlrch opened this issue Aug 9, 2018 · 2 comments

Comments

@mchlrch
Copy link
Member

mchlrch commented Aug 9, 2018

It seems as if performance degrades, when the API gets bigger.

In my test-scenario, I looked at the difference between an integration and production environment:

  • production: Shape-API for 9 DataSets
  • integration: Shape-API for 801 DataSets

I also compared performance between going thru 1) the sparql-proxy and 2) hydraBox.
The query against the triplestore is in both cases the same.

With the given testdata, the size of the response is 40 KB ld+json if retrieval happens via sparql-proxy and 48 KB if retrieval is via hydraBox.

I also submitted requests in parallel (see results below). I noticed that retrieval via hydraBox with the larger API in integration is maxing out (only) one cpu on my multicore-system. This is different with the smaller API in production, where multiple cores get used.

Testresults for the given testdata:

parallel-requests sparql-proxy (prod) hydraBox (prod) sparql-proxy (integ) hydraBox (integ)
1 0.5s 0.2s 0.5s 11s
2 0.5s 0.2s 0.5s 22s
4 0.7s 0.2s 0.5s 52s
8 0.9s 0.3s 0.5s 2m 9s
16 0.9s 0.4s 1.1s 4m45
256 10s 3.4s 13s -
1024 17s 13s 51s -

This shows the context of the testdata:

mira@blinky:~$ ln -s ~/git/stat.stadt-zuerich.ch/api/AST-RAUM-ZEIT-BTA-shape.sparql.es6 AST-RAUM-ZEIT-BTA-shape.rq

mira@blinky:~$ time curl -s -i -H "accept:application/ld+json" --data-urlencode query@AST-RAUM-ZEIT-BTA-shape.rq http://localhost:8080/query -oout.json

real	0m0.385s
user	0m0.000s
sys	0m0.017s

mira@blinky:~$ du -sh out.json 
40K	out.json

mira@blinky:~$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/AST-RAUM-ZEIT-BTA -oout.json
real	0m10.516s
user	0m0.005s
sys	0m0.010s

mira@blinky:~$ du -sh out.json 
48K	out.json

This shows how I ran the requests in parallel. Switching targets (sparql-proxy or hydraBox) is done
by uncommenting the respective line in multi.sh:

mira@blinky:~$ cat multi.sh 
#!/bin/bash

task(){
    curl -s -i -H "accept:application/ld+json" --data-urlencode query@AST-RAUM-ZEIT-BTA-shape.rq http://localhost:8080/query -o/dev/null
  # curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/AST-RAUM-ZEIT-BTA -o/dev/null
 
  echo "done with #$1" 
}

echo $1 parallel ...

for i in $(seq 1 $1); do task "$i" & done; wait

mira@blinky:~$ time ./multi.sh 16
16 parallel ...
done with #9
done with #3
done with #11
done with #5
done with #1
done with #12
done with #15
done with #13
done with #14
done with #8
done with #10
done with #7
done with #16
done with #6
done with #4
done with #2

real	0m1.126s
user	0m0.131s
sys	0m0.094s
@mchlrch
Copy link
Member Author

mchlrch commented Aug 9, 2018

To clarify: If I reduce the API in integration to 1 DataSet, without changing anything else, I also get the 1024 requests done in 12 seconds.

@mchlrch
Copy link
Member Author

mchlrch commented Sep 4, 2018

The initial description to reproduce the issue was blending multiple scenarios that I had tested. Here is
now a simplified and more explicit description in order to make it easier to reproduce the issue.

Software under test: https://github.com/statistikstadtzuerich/stat.stadt-zuerich.ch

  • branch: testcases4stack
  • commit: 6bb79ca5f8ea9b43ca9bed15518cb8f7cad2c1e8

Start the API Backend:

$ npm run start-apidev

Without any modifications to the source, the Shape-API only contains one Dataset (BEW-RAUM-ZEIT-HEL). The API description is in the folder api_apidev.

Let's hit the API with curl:

$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/BEW-RAUM-ZEIT-HEL -oout.json

real	0m1.633s
user	0m0.009s
sys	0m0.009s

In order to be able to update the API afterwards, setting the password for the SPARQL Endpoint as environment variable is necessary. Endpoint and user are defined in api-config.apidev.js.

$ export SPARQL_ENDPOINT_PASSWORD=foopass!_987

Now we will include more Datasets in the API. Comment the WHITELISTING viewFilter and instead activate the BLACKLISTING viewFilter in the configuration-file api-config.apidev.js and update the API with the following command. The Shape-API will be generated for all Datasets matching the viewFilter.

$ npm run update-api-apidev

Restart the backend and hit the API again with curl. Startup takes a bit longer this time (this is a known issue):

$ npm run start-apidev
$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/BEW-RAUM-ZEIT-HEL -oout.json

real	0m10.182s
user	0m0.008s
sys	0m0.008s

In my setup, the response time increased from 1.6 to 10.2 seconds, for the same dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant