Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap #290

Closed
14 of 35 tasks
st-pasha opened this issue Aug 30, 2017 · 6 comments
Closed
14 of 35 tasks

Roadmap #290

st-pasha opened this issue Aug 30, 2017 · 6 comments

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Aug 30, 2017

  • Rollup stats for each column (Add additional utility stat functions for a numerical datatable column #276)
  • Write a CSV file
  • Generalized computational graph (DT[i, j] evaluation should be done in C++ #1477)
  • Grouping by a single column (integer type)
  • Group-by reduction operators: sum, min, max, mean, etc.
  • Grouping by a single column (other types)
  • Modify data in a datatable in-place
  • Sorting by multiple columns
  • Grouping by multiple columns
  • Joining (inner, left outer)
  • Insert / delete rows
  • Computation on variable-width columns (strings)
  • String operations
  • Rolling join
  • Group-by operators that involve sorting: median, quantile, rank (median also in implement median function #1530)
  • rbind disk-based columns
  • Keys to optimize performance
  • Non-equi joins
  • Compute rollup stats during fread
  • Categorical (enum) type
  • Datetime type
  • Time window functions
  • Full join
  • fread auto-detected parse format save into FReader object
  • skipna parameter for reduction operators
  • Repro of NYCTaxi analytics using datatable
  • Benchmark Szilard's test (grouping vs joining)
  • Benchmark against fst & paratext
  • Benchmark parity with R data.table 5 grouping tests - https://h2oai.github.io/db-benchmark/
  • Benchmark against TPC-H
  • Finish fread outstanding bug master list (#2247)
  • Comparison of fread capabilities against other read csv frameworks (esp. smart capabilities)
  • Demo of interoperability with other frameworks
  • Log operations with datatable (data provenance)
  • Read/write into feather (Read/write into feather #1461)

Out of scope

  • Multi-user access, including permissions
  • Sharded data files, growing file size
@jangorecki
Copy link
Contributor

jangorecki commented Jan 10, 2019

Marked Benchmark parity with R data.table 5 grouping tests as resolved.
What about cor, cov, var functions, are they going to be in scope? Asking because cor or cov+var is required to answer q9 in groupby benchmark.

edit: filled #1543

@st-pasha
Copy link
Contributor Author

@jangorecki Of course, cor, cov, var are all quite straightforward, no reason not to add them.

@st-pasha
Copy link
Contributor Author

Replaced with #2281

@mdancho84
Copy link

Hi @st-pasha awesome work! I was wondering what the schedule is for some of the "Missing Functionality". I'm particularly interested in the pivoting and wide to long functions, which are the most apparent gaps between Pandas and datatable. Anyways, nice work on the package. It's very fast. Yay!

@st-pasha
Copy link
Contributor Author

@mdancho84 Our current immediate focus is date-time functionality. After that we can get to pivoting/melting. See also this discussion: #2677

@mdancho84
Copy link

@st-pasha Thanks a lot for your response. Much appreciated. Date/time is super important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants