Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New inverted grammar, starting with header cells #31

Open
nacnudus opened this issue Dec 10, 2019 · 3 comments
Open

New inverted grammar, starting with header cells #31

nacnudus opened this issue Dec 10, 2019 · 3 comments

Comments

@nacnudus
Copy link
Owner

nacnudus commented Dec 10, 2019

The current unpivotr grammar starts from the point of view of data cells, and searches for associated headers. This imitated databaker, because it is useful in the most common case (in my experience).

  1. The header cells surround the data cells.
  2. There are more different headers than you care to hardcode into a script

At long last, there is an example of a consistent schema that breaks (1) and doesn't suffer from (2).

Untidy data

image

Tidy version

image

Thoughts

  1. Locate each type of header by filtering, e.g. character == "Species:". Error if not unique (see step 4 for when whole tables repeat, as in the example).
  2. Describe the domain of the header over related data cells by its direction and limit, e.g. direction = "W" and limit = 1 or limit = Inf. Unlike the existing grammar, the direction is from the point of view of the header cell, rather than the data cells.
  3. Given a set of headers so described, unpivotr would resolve the data cells to the matching headers.
  4. If the whole table repeats, as in the example above, the same technique would apply as now -- identify a corner cell of each table, nest, and unpivot one at a time.
@jl5000
Copy link

jl5000 commented Dec 10, 2019

Do we know if there are any other datasets with this structure or if it's an evil one-off? I've never seen a structure like this before.

@nacnudus
Copy link
Owner Author

That's a reasonable point, although it isn't how nerd-sniping works 😄

@danstrobridge-Weston
Copy link

I often get this sort of semi-structured format when working spreadsheets / text files generated by exporting pivoted tables from pdf. i'm eager to test the readr::melt functionality for dealing with it on my next project that can afford to pay me for some development time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants