-
-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue relative to other access methods #188
Comments
I will need data to check but I think that the |
Yes, when I first did the benchmarking with
|
How much time does |
You should not need to recompile the library to profile it, JetBrains dotTrace should be able to show how long each takes |
My initial attempts couldn't see much of interest below the |
If you share parquet file (with wrangled data) I can try to look into it. By the way, calling async methods has no benefit because those methods aren't implemented and call the non-async code behind the scenes. |
Digging a little deeper into the |
|
Can you also try Line-by-line profiling? |
Here is the line by line output. Let me see if I can build you a synthetic data set that reproduces the issue. I'm still a bit paranoid that the python/duckdb cli equivalents are cheating somehow because the output looks realistic - the low level IO calls are taking all the time, but there must be some different and even if the python/cli tool are cheating, I'd like the benefit of the shortcut |
So with preface that there is a bit of rabbit hole here (more below), I built some code that can reproduce a dataset (since the dataset is big when made, it's probably kinder and more useful to give you the code rather than the output. There are three scripts in this gist
Examplegenerating data and querying via F#, python and the command line tool.
Rabbit holeOne thing I discovered while simulating the data, is that my parquet file is not sorted on the key I am using for retrieval. Due to the way the original file was generated, you get clumps of values with the same key but they aren't all in one place. The obvious thing for me to do here is to sort the table - that should overall be kinder to duckdb. The simulated data is more randomly ordered - not even a little bit clumped, so I think that's why the retrieval times go up. Python still has an advantage here for some reason and the command line tool is about as fast (4.2ish seconds). I am going to presort the table so it's at least presented in an optimal fashion and will see what things do but I'm still interested in where the speed gap is. |
Using a sorted version of the same parquet file and slightly smaller rowgroups (100k vs 1M rows per group) the dotnet version is way faster 8-10ms / query
and the python is comparable - note the first few queries are slower for both libs but thenspeed up and they seem very similar. So whatever the difference, it might only matter when there are lots of different pages getting retrieved.
|
@daz10000 I had to remove several columns from the query because I was getting OOM otherwise. By the way, have you joined the dotnet channel on the DuckDB Discord? We can chat there for faster communication. |
Does it look better? |
That looks way better - and looking forward to the new release coming up (thanks for the discord chat too). All appreciated. I think you can close this if you're satisfied and I'll grab the new release when it comes out. |
I'll close this when I publish the updated package. By the way, the fix is already available from the GitHub package feed if you want to try it. |
@daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time? |
I saw that last night and took it for a test drive it works beautifully. much faster. thanks again !DarrenAm 5/22/24 um 1:21 PM schrieb Giorgi Dalakishvili ***@***.***>:
@daz10000 Just pushed 0.10.3 to NuGet. Can you test it when you have time?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Nice! I'm interested in how much difference the streaming vs non-streaming approach makes on your data. If you have time can you run the following 4 benchmarks? 2 with your old data and compare Thanks. |
I have an application pulling blocks of data (typically 100-1000 rows from a 20ish million row parquet file) with ~ 15 columns of which I use maybe 10. I am querying the parquet file with a simple
SELECT col1,col2 etc from read_parquet('/path/to/parquet.pq') where key= 'value'
.The code was originally python and with the equivalent python bindings, wrapping into
duckdb.sql( <query>).df()
it takes ~ 150 milliseconds per query. I run 1000s of these. I'm seeing more like 2.5 seconds with duckdb.net, with the bulk of the time spent in the executereader call. I also tried just a command line invocation withduckdb -c '<query>'
and it's also around 150 milliseconds, so duckdb.net is the outlier. The data set isn't shareable but assuming I can recreate something similar, I can try to share something. The data aren't super interesting - lots of float and int columns mainly with a few small string cols. My access pattern is something likeI guess short of building and sending a synthetic data set, I had a few questions - is this surprising? Should this be more than 10x slowed than the underlying duckdb libraries. Am I doing something really dumb here that is hurting performance? I tried async versions of the API and saw no improvements. It's the
ExecuteReader
call that takes 99% of the time. My profiling couldn't get inside the duckdb libraries but I'm going to try to build them and profile the whole things as a next step.Any thoughts / help appreciated.
The text was updated successfully, but these errors were encountered: