Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal (WIP): PushReader, a simpler reader API #113

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

josh-newman
Copy link
Contributor

This is a concept for making data reading simpler than ReaderFunc by allowing "normal" Go state (in a closure).

slice := bigslice.PushReader(Nshard, func(shard int, push func(string, int)) error {
	fuzzer := fuzz.NewWithSeed(1)
	var row struct {
		string
		int
	}
	for i := 0; i < N; i++ {
		fuzzer.Fuzz(&row)
		push(row.string, row.int)
	}
	return nil
})

The performance cost of this may be significant; I haven't measured yet. I wanted to start by having a concrete example of what user code will look like.

@mariusae
Copy link
Collaborator

Shouldn't this be called WriterReader? "Push" seems to imply streaming to me.

@mariusae
Copy link
Collaborator

Shouldn't this be called WriterReader? "Push" seems to imply streaming to me.

or "WriteReader" maybe

@josh-newman
Copy link
Contributor Author

I'd worry about WriteReader looking confusing to a newcomer, especially if we add a similar variant of WriterFunc (since those users may want to use defers, too), and we end up with ReaderFunc, WriteReader, ReadWriter, WriterFunc 😄.

@josh-newman
Copy link
Contributor Author

But I'll think more about naming.

@mariusae
Copy link
Collaborator

I agree. We should probably be consistent about what the prefix and suffix is. For example, if the suffix indicated the data flow (i.e., we should have called it FuncReader instead of ReaderFunc), then this would be clearer I think..

@jschellenberger
Copy link

I would fully support having this functionality. It vastly simplifies ReaderFunc()

@josh-newman
Copy link
Contributor Author

I haven't thought more about naming yet, but I wrote a basic in-memory benchmark with a trivial reader task. Very roughly, it seems ~30% slower. Perhaps for a non-trivial program, this 30% extra overhead may be insignificant compared to the readability advantage.

@josh-newman
Copy link
Contributor Author

I vectorized channel operations, which has a minor (if any) effect on the existing benchmarks, but intuitively may help when there's more parallelism.

I added a benchmark to demonstrate that the overhead involved is minimal if each row computation does more work, which might make this suitable for several GRAIL-internal usages. As @jcharum pointed out, there's still some reflect overhead, but it's the same situation in other bigslice ops, so it's probably ok here.

In terms of naming: most of the existing Slice-producing API functions have active/verb names. In that spirit, we could call this bigslice.Read, and it can be the default choice for users since it's easier to use. Then we can mark ReaderFunc (or actually rename it) as a second choice for special situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants