Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dframe proposal #19

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

Dframe proposal #19

wants to merge 26 commits into from

Conversation

sbinet
Copy link
Member

@sbinet sbinet commented Jan 11, 2019

No description provided.

dframe/README.md Outdated Show resolved Hide resolved
Copy link
Member

@kortschak kortschak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just general comments.

dframe/README.md Show resolved Hide resolved
dframe/README.md Outdated Show resolved Hide resolved
dframe/README.md Outdated Show resolved Hide resolved
dframe/dfmat/dfmat.go Outdated Show resolved Hide resolved
dframe/dframe.go Outdated Show resolved Hide resolved
dframe/dframe_example_test.go Outdated Show resolved Hide resolved
dframe/dframe.go Outdated Show resolved Hide resolved
dframe/dframe.go Show resolved Hide resolved
dframe/dfmat/dfmat.go Outdated Show resolved Hide resolved
dframe/dfmat/dfmat.go Show resolved Hide resolved
// located at the provided source.
//
// Possible drivers: hdf5, npyio, csv, json, hdfs, spark, sql, ...
func Open(drv, src string) (*Frame, error) { ... }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a version that accepts a Reader instance here? Or, to simply make the src type more liberal (an empty interface under the hood I suppose...)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about having a dfcsv package where some appropriate function would take the io.Reader:

package dfcsv // import "gonum.org/v1/exp/dframe/dfcsv"

func Read(r io.Reader, opts ...Option) (*Frame, error) { ... } // or name it Open?

and some database/sql-like driver registration code so one could write:

df, err := dframe.Open("csv", "file.csv")

@btracey
Copy link
Member

btracey commented Jan 12, 2019

Dataframes are clearly important for the Go ecosystem. However, I do wonder if Gonum is really the right place for this package. Gonum is nice in that it is self-contained, relying only on the standard library and exp/rand. This builds off several core pieces outside of what Go and Gonum provide, and thus would complicate the story of what Gonum "is". Beyond that, it seems to me the lessons of history point to keeping things contained and independent. The Go team has expressed lament that the standard library is as big as it is. The python ecosystem is just that, an ecosystem. Scipy, Sci-kit learn, pandas are all independent packages. Plotting in python has seen a number of different attempts by different groups and make the story exciting. We even see this story within the scientific Go ecosystem. You cite two previous versions of dataframe packages, and Gorgonia has been very successful, and it while it is outside Gonum it can still work with Gonum.

In a different repository there's a lot more freedom, even if it's many of the same people working to develop the codebase. I think the small set of dependencies is important to Gonum, and there are many API consistency considerations, especially involving memory allocation and the ability to make code run in parallel. A package outside of Gonum can easily break these constraints where reasonable to satisfy that package's goals. One idea for a name would be kugos which is the word for "bear" in Cebuano (and has an open github address), another idea would be "makwa" (though not open) which means bear in Algonquin.

@sbinet
Copy link
Member Author

sbinet commented Jan 14, 2019

@btracey

I think the small set of dependencies is important to Gonum

keeping the amount of dependencies to a minimum is good practice.
but I wouldn't use it to prevent dframe to find Gonum as a home.

see:

$> function show-deps() {
for pkg in $(go list ./...); do 
        go list -f "{{range .Deps}}
{{.}}
{{- end}}" $pkg | grep  "\." | grep -v "gonum.org"; 
   done | sort | uniq
}

$> cd $GOPATH/src/gonum.org/v1/gonum
$> show-deps
golang.org/x/exp/rand
golang.org/x/tools/container/intsets

$> cd ../plot
$> show-deps
github.com/ajstarks/svgo
github.com/golang/freetype
github.com/golang/freetype/raster
github.com/golang/freetype/truetype
github.com/jung-kurt/gofpdf
github.com/llgcode/draw2d
github.com/llgcode/draw2d/draw2dbase
github.com/llgcode/draw2d/draw2dimg
golang.org/x/image/draw
golang.org/x/image/font
golang.org/x/image/math/f64
golang.org/x/image/math/fixed
golang.org/x/image/tiff
golang.org/x/image/tiff/lzw
rsc.io/pdf

$> cd ../exp/dframe
$> show-deps
github.com/apache/arrow/go/arrow
github.com/apache/arrow/go/arrow/array
github.com/apache/arrow/go/arrow/internal/bitutil
github.com/apache/arrow/go/arrow/internal/cpu
github.com/apache/arrow/go/arrow/internal/debug
github.com/apache/arrow/go/arrow/memory
github.com/pkg/errors

my reasoning for trying to have dframe under the Gonum umbrella is:

  • to make it adhere to the high standards of Gonum
  • to make it as an entry point for people to hopefully then diffuse in other Gonum repos
  • to somewhat gather a critical mass of Gonum developers.

right now, under Gonum, we have the following active repos:

  • gonum.org/v1/exp
  • gonum.org/v1/gonum
  • gonum.org/v1/hdf5
  • gonum.org/v1/netlib
  • gonum.org/v1/plot
  • gonum.org/v1/tools

hardly a super crowded place :)

that said, if many Gonum devs feel like dframe has no business to live under gonum.org, well, I'll try to find another home for it.

@sbinet sbinet force-pushed the dframe-proposal branch 2 times, most recently from f593cdb to 0838c52 Compare January 14, 2019 17:19
func (df *Frame) Exec(f func(tx *Tx) error) error { ... }

func example(df *dframe.Frame) {
err := df.Exec(func(tx *dframe.Tx) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs since it's different from how memory is handled in Go normally. This would probably be a bigger issue in the case of an immutable dataframe than a mutable but it will always be present (right?).

I like the idea of a transaction, chained calls or not, immutable or not. Your suggestion would probably give the greatest flexibility, at the cost of some noise.

This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.

// Client code
df.Exec(dframe.Slice(0, 10), dframe.Select("col1", "col2"), dframe.Apply("col1 + col2"))

// In package dframe
type ExecFunc func(df *Frame)

func(df *Frame) Exec(ff... ExecFunc) {
	/* apply functions, manage reference counts, etc */ 
}

func(df *Frame) doSelect(cols... string) {
	// Select columns
}

func Select(cols... string) ExecFunc {
	return func(df *Frame) {
                // Either manipulate df directly, or, as below, call package private method on df to do the real work.
		df.doSelect(cols...)
	}
}

func Slice(cols... string) ExecFunc {
	// Similar to above
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need for reference counting in Arrow will probably incur some mental overhead and be a source of bugs

yeah... I tried at some point to get rid of it in Go Arrow, but I didn't find a way to support GPGPU (or any alternate memory backing store) and drop this ref-counting thing.

hiding the ref-counting behind the dframe.Tx transaction was my way to handle this (so one gets only 1 dframe.Frame that needs to be defered with Release(), like the usual os.File.Close. This greatly reduces the amount of possible code paths that can go wrong.)

This is an alternative which removes some noise (function declaration and return) in client code at the cost of sligthly less flexibility.

that's an interesting alternative. thanks for bringing it up!
it indeed reduces the amount of user-code "noise".
it creates and returns one closure/function per operation, compared to one big closure/function.
I must say I don't know whether that really matters performance wise (CPU, Mem) though...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume it should not matter performance wise for any but the smallest dataframes. But one should not assume things. :-)

@nickpoorman
Copy link

Wow this looks incredibly promising so far!

}

// Column returns the i-th column of this Frame.
func (df *Frame) Column(i int) *array.Column {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to consider here, is the signatures:

func (df *Frame) NumCols() int64
func (df *Frame) Column(i int)
func (df *Frame) Name(i int)

If someone calls NumCols() to get the number of columns and then uses that int64 to iterate over the columns by way of Column(i int), they are going to have a problem because it accepts an int.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right.

I've always been a bit split b/w using "natural" primitive type (such as int) and also being able to use and follow the arrow standard (ie: int64).

I put Column(int) because I'd somehow expected the columns would be stored as []someType...

perhaps one should just have NumCols() int.

@sbinet
Copy link
Member Author

sbinet commented Apr 16, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants