Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rationale - The Second Deedle #13

Open
pkese opened this issue Jan 9, 2019 · 16 comments
Open

Rationale - The Second Deedle #13

pkese opened this issue Jan 9, 2019 · 16 comments
Labels
further discuss need further discuss to find the best solution

Comments

@pkese
Copy link

pkese commented Jan 9, 2019

Please provide rationale somewhere in main README, why using/developing/participating in this project rather than Deedle.

@Oceania2018
Copy link
Member

Good question. I know there will be some duplication of work, but there are some differences between the two projects, and users can choose according to their own situation. I just came the some reasons.
We want to make the Pandas.NET:

  1. be more like Pandas, I mean everything (99%), function name, letter case.
  2. friendly for C#.
  3. better performance (many people complain the Deedle's performance).
  4. work closer with SciSharp projects.
  5. We use dtype instead of generic design.
  6. Original in .netstandard 2.x

@dotChris90 Do you have more?

@pkese Hope we can find some common ground and cooperate, we are all advocates of .net.

@pkese
Copy link
Author

pkese commented Jan 9, 2019

@Oceania2018 Good. I'm all in if you wish to improve upon Deedle.

I have however found that the amount of annoyances in Deedle are approximately the same as in Pandas. With Pandas you have a bit larger surface area, whereas with Deedle you have to deal with types bit more. Deedle can have smaller surface area, because you can simply do a for-loop data interaction in .Net without performance penalty, so there really isn't much need to provide 100% feature compatibility with Pandas (often times, Pandas code can be quite hard to read and a for-loop over data would be much more legible, albeit slow in Python).

Regarding performance issues Deedle is on average good enough and I'm not sure you can beat it in any substantial way. The main thing is that Deedle is way faster than Pandas (in my experience, what took 4 hours in Pandas took just 5 minutes in Deedle). Adding a few percent more on that is negligible.

My main worry however is that there are 5 or 6 big SciSharp projects (Pandas.Net, Tensorflow, NumSharp, SciSharpLearn, etc...) with something like 3 active developers behind. It is a rather large surface area to cover by such a small team and there should be a solid reason for people to switch or start participating on your project rather than on more established projects with existing larger community participation and solid documentation. And with the word 'reasons', I mean reasons besides not-invented-here, or not-Pythonic-enough.

If you can't provide a good answer to such questions, you will be unlikely to gain much community support and without community you will eventually give up - wasting your (and even other people's) time. On the other hand, if you provide excellent answers to those questions, people might prefer to contribute their time and code to your project rather than to Deedle.

@Oceania2018
Copy link
Member

Pandas.NET is based on NumSharp like Pandas is based on Numpy. Where is Deedle's numpy?

NumSharp adopt serveral providers, default is implemented by pure C# (worst performance).
Imported LAPACK, and working on MKL.
Plan to use C++ to optimize the for..loop issue.

Deedle use object everywhere, obviousely, the performance won't be good. Check our NumSharp's Benchmark project. We even give up the generic design for somewhere, because generic bring performance defect.

@dotChris90
Copy link
Member

The only thing I know now here is that NumSharp using specific NDArrays - not .NET Arrays.
Most .NET Numeric projects using .NET arrays and do not create their own.
We made NDArrays which stores elements of NDArray in 1 single 1D array (row wise or column wise - both possible). Made this to easily shape an array to specific form and because C++ Libs like LAPACK using 1D arrays instead of matrix, Tensor, etc.

Deedle using .NET arrays (correct me if I am wrong).
So people could use Deedle and cast the .NET arrays to NumSharp NDArrays. possible.

Also could maybe talk to the FSLab community in general if they are interested in a Scipy like stack.

And I think when tried out Deedle it did not work well in Powershell (so an other .NET language) - but before somebody complain - This could be related to Powershell import mechanism - not sure.

@dotChris90
Copy link
Member

@Oceania2018 lol want to say the same like you now.

@pkese
Copy link
Author

pkese commented Jan 9, 2019

Wonderful. That's exactly the stuff that you need to expose a bit more and put in front.

@Oceania2018
Copy link
Member

@pkese Deedle use object and generic everywhere, NumSharp use dtype, that's the biggest difference, more elegent, work exactly same as python style. I really like NumSharp's dtype design.
Try the unit test, think about it and do some benchmark.
Welcome to discuss.

@dotChris90
Copy link
Member

@Oceania2018 yes yes - but i have to admit @pkese is right - we need to extend the readme. Otherwise people think "yes this is a 2nd Deedle"

@dotChris90
Copy link
Member

or people think "why the guys make a 2nd deedle"

@Oceania2018
Copy link
Member

We pursue a Python-like experience, just as smooth as python when you do Machine Learning in .NET. @pkese The other point. @dotChris90 Yes, we use explain more in ReadMe.

@Oceania2018 Oceania2018 changed the title Rationale Rationale - The Second Deedle Jan 9, 2019
@tpetricek
Copy link

I don't have enough time to join a detailed discussion, but saying Deedle uses objects everywhere is not right. When you have a column of floating point values, the data is actually stored as float[] - the public interface hides that somewhat, but when you get a column as type Series<DateTime, float>, you get pretty direct access to the underlying array of floats.

@totalgit74
Copy link

totalgit74 commented Feb 10, 2019

Deedle is ok but it suffers from poor performance for larger datasets. I found the Extreme Optimization library to be far more performant (order of magnitude at least) when I last compared them. However one is free, the other licensed. Deedle is a small dataset only solution and in no way comparable to Pandas or where Pandas is headed.
With regards Pandas and copying it in .Net I would make sure that you are copying where Pandas is headed and not where it has been. Wes McKinney has pointed out some major warts/flaws in Python and its implementation under the hood here. I would aim for that same end-point of Apache Arrow usage else you'll just be a poor man's Pandas in .Net. Parquet file usage would be a requirement. The last thing you want to do is spend a lot of time and effort creating the Pandas of 2017 in .Net in 2019.

NB When I'm talking large datasets I'm only looking at millions of rows so not even big data. Deedle is palatable for perhaps thousands/tens of thousands of rows.

@lidanger
Copy link

Interfaces of Deedle is so different from Pandas, a .Net ported verison of Pandas is absolutely necessary to use achievements of Python.After all, IronPython cannot be used as a version of Python.

@lidanger
Copy link

Recently, I found it seems the next version of project pythonnet my solve many problems about interoperability of C# with Python.

@Oceania2018 @Esther2013

@Oceania2018
Copy link
Member

@lidanger have you set it up in pythonnet?

@Oceania2018 Oceania2018 added the further discuss need further discuss to find the best solution label Apr 29, 2019
@lidanger
Copy link

lidanger commented Apr 30, 2019

I have used pandas and other Python packages in pythonnet for several months. The version 2.3 is not so good for multi-platform, but 2.4 has made great progress. Althouth it has not been released, I used it well in my project with target framework .net core 2.1 and .net framework 4.6.1 these days.

@Oceania2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
further discuss need further discuss to find the best solution
Projects
None yet
Development

No branches or pull requests

6 participants