Performance degrades with the size of the API #26

mchlrch · 2018-08-09T21:42:36Z

It seems as if performance degrades, when the API gets bigger.

In my test-scenario, I looked at the difference between an integration and production environment:

production: Shape-API for 9 DataSets
integration: Shape-API for 801 DataSets

I also compared performance between going thru 1) the sparql-proxy and 2) hydraBox.
The query against the triplestore is in both cases the same.

With the given testdata, the size of the response is 40 KB ld+json if retrieval happens via sparql-proxy and 48 KB if retrieval is via hydraBox.

I also submitted requests in parallel (see results below). I noticed that retrieval via hydraBox with the larger API in integration is maxing out (only) one cpu on my multicore-system. This is different with the smaller API in production, where multiple cores get used.

Testresults for the given testdata:

parallel-requests	sparql-proxy (prod)	hydraBox (prod)	sparql-proxy (integ)	hydraBox (integ)
1	0.5s	0.2s	0.5s	11s
2	0.5s	0.2s	0.5s	22s
4	0.7s	0.2s	0.5s	52s
8	0.9s	0.3s	0.5s	2m 9s
16	0.9s	0.4s	1.1s	4m45
256	10s	3.4s	13s	-
1024	17s	13s	51s	-

This shows the context of the testdata:

mira@blinky:~$ ln -s ~/git/stat.stadt-zuerich.ch/api/AST-RAUM-ZEIT-BTA-shape.sparql.es6 AST-RAUM-ZEIT-BTA-shape.rq

mira@blinky:~$ time curl -s -i -H "accept:application/ld+json" --data-urlencode query@AST-RAUM-ZEIT-BTA-shape.rq http://localhost:8080/query -oout.json

real	0m0.385s
user	0m0.000s
sys	0m0.017s

mira@blinky:~$ du -sh out.json 
40K	out.json

mira@blinky:~$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/AST-RAUM-ZEIT-BTA -oout.json
real	0m10.516s
user	0m0.005s
sys	0m0.010s

mira@blinky:~$ du -sh out.json 
48K	out.json

This shows how I ran the requests in parallel. Switching targets (sparql-proxy or hydraBox) is done
by uncommenting the respective line in multi.sh:

mira@blinky:~$ cat multi.sh 
#!/bin/bash

task(){
    curl -s -i -H "accept:application/ld+json" --data-urlencode query@AST-RAUM-ZEIT-BTA-shape.rq http://localhost:8080/query -o/dev/null
  # curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/AST-RAUM-ZEIT-BTA -o/dev/null
 
  echo "done with #$1" 
}

echo $1 parallel ...

for i in $(seq 1 $1); do task "$i" & done; wait

mira@blinky:~$ time ./multi.sh 16
16 parallel ...
done with #9
done with #3
done with #11
done with #5
done with #1
done with #12
done with #15
done with #13
done with #14
done with #8
done with #10
done with #7
done with #16
done with #6
done with #4
done with #2

real	0m1.126s
user	0m0.131s
sys	0m0.094s

The text was updated successfully, but these errors were encountered:

mchlrch · 2018-08-09T21:53:50Z

To clarify: If I reduce the API in integration to 1 DataSet, without changing anything else, I also get the 1024 requests done in 12 seconds.

mchlrch · 2018-09-04T09:19:07Z

The initial description to reproduce the issue was blending multiple scenarios that I had tested. Here is
now a simplified and more explicit description in order to make it easier to reproduce the issue.

Software under test: https://github.com/statistikstadtzuerich/stat.stadt-zuerich.ch

branch: testcases4stack
commit: 6bb79ca5f8ea9b43ca9bed15518cb8f7cad2c1e8

Start the API Backend:

$ npm run start-apidev

Without any modifications to the source, the Shape-API only contains one Dataset (BEW-RAUM-ZEIT-HEL). The API description is in the folder api_apidev.

Let's hit the API with curl:

$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/BEW-RAUM-ZEIT-HEL -oout.json

real	0m1.633s
user	0m0.009s
sys	0m0.009s

In order to be able to update the API afterwards, setting the password for the SPARQL Endpoint as environment variable is necessary. Endpoint and user are defined in api-config.apidev.js.

$ export SPARQL_ENDPOINT_PASSWORD=foopass!_987

Now we will include more Datasets in the API. Comment the WHITELISTING viewFilter and instead activate the BLACKLISTING viewFilter in the configuration-file api-config.apidev.js and update the API with the following command. The Shape-API will be generated for all Datasets matching the viewFilter.

$ npm run update-api-apidev

Restart the backend and hit the API again with curl. Startup takes a bit longer this time (this is a known issue):

$ npm run start-apidev
$ time curl -s -i -H "accept:application/ld+json" http://localhost:8080/dataset/BEW-RAUM-ZEIT-HEL -oout.json

real	0m10.182s
user	0m0.008s
sys	0m0.008s

In my setup, the response time increased from 1.6 to 10.2 seconds, for the same dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degrades with the size of the API #26

Performance degrades with the size of the API #26

mchlrch commented Aug 9, 2018

mchlrch commented Aug 9, 2018

mchlrch commented Sep 4, 2018

Performance degrades with the size of the API #26

Performance degrades with the size of the API #26

Comments

mchlrch commented Aug 9, 2018

mchlrch commented Aug 9, 2018

mchlrch commented Sep 4, 2018