Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initializing a workbook #8

Open
jennybc opened this issue May 18, 2016 · 1 comment
Open

initializing a workbook #8

jennybc opened this issue May 18, 2016 · 1 comment

Comments

@jennybc
Copy link
Member

jennybc commented May 18, 2016

@richfitz

From my onion-peeling adventure, I gather that if I read a worksheet via rexcel_read(), I drop into rexcel_read_workbook() and then into rexcel_read_worksheet(). At least, those are the exported functions called. It feels like there's one more layer or one more function that necessary? Q1: Can you help me understand the role of rexcel_read()? I think it's the one whose purpose isn't clear.

In googlesheets, for better or worse, there's an explicit registration step, that creates an R object with metadata about a Google Sheet. Only with that in hand can you start reading stuff back out of it. With Google Sheets, this is practically a requirement vs. a voluntary design decision. But would a similar workflow make sense for rexcel?

I think I'm proposing that most of what's in rexcel_read_workbook() get moved into a workbook "registration" function. So that it's possible to get set up to read a workbook w/o actually diving down into any worksheets (currently not possible, I believe?).

I also think (correct me) that current reading functions leave behind little to no info for worksheets that weren't specifically requested. Again, for a Google Sheet, when I register it, I create an overview of all worksheets (name and extent,mostly). When I think about us characterizing the Enron corpus, it would be nice to be able to register each workbook (15K) and get high-level info on the worksheets (80K) w/o necessarily reading their cells.

Q2: what do you think of a registration-based workflow?

Q3: what do you think of marshalling more data about worksheets at registration / workbook creation time? It creates an intermediate between practically no info and full reading of cells, etc.

Finally, it seems like one can return a linen::worksheet (rexcel_read_worksheet() does) and I wonder what that even means. Early on, the student who worked with me on googlesheets also allowed direct access to worksheets and this caused trouble. Technically, it was a problem because she implemented it in a way that ran up against some of XMLs worst gotchas re: memory leakage. But conceptually it was also tricky. A worksheet can't exist outside a workbook, so you were always dragging around host workbook info anyway. So we implemented a policy where you either interacted with the object that comes from registering a sheet or with data coming out of the sheet. But there was no user-facing tangible notion of anything in between. I know our situation is different (R6 class, local xlsx, etc.) but still ....

Q4: what's the deal with worksheet objects? This question is kinda vague. Sorry.

let me know if we should just Skype for some/all of these

@richfitz
Copy link
Member

Q1: Can you help me understand the role of rexcel_read()?

rexcel_read exists for the (I imagine) common use case of "read a sheet, probably the first, from a workbook". So it's a UI thing not a thing that is deeply important to the workings of the package.

Q2: what do you think of a registration-based workflow?

I don't mind what we do internally in the packages at all. It seems it's not that different to what the workbook bits are doing now, and if that bridges the gap between rexcel and googlesheets that's a good thing. I'm more cautious about what we present to users though, but we can build workflows on top of whatever primitives we feel are the necessary common denominators.

If you want to rework what is in there to match googlesheets more, go for it. My UI preference is that I can expose functions with the functionality of (but not necessarily the names of) rexcel_read and rexcel_read_workbook to users.

Q3: what do you think of marshalling more data about worksheets at registration / workbook creation time?

We talked about this in Vancouver and I still think it's a good idea. What is not totally clear to me though is what I can get about cell extent etc from a sheet without parsing the entire XML. But I can live with parsing it twice give how slow the rest of the package is. We should brainstorm (another issue, linen perhaps?) about what the set of data to gather there, and the things you can get from googlesheets probably make a good starting point.

Q4: what's the deal with worksheet objects?

Not totally sure at the moment 😀. In some ways it's mostly useful so that there's a reference-based concept like "parent" directory so different sheets can get to each other easily. It's also a sensible place to put shared data without duplicating it. Other than that I don't really know what your vagueness here is vauging at.

I can hang around after the rOpenSci call tomorrow for a bit if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants