Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some way to split a field #24

Open
Llammissar opened this issue Feb 1, 2017 · 2 comments
Open

Add some way to split a field #24

Llammissar opened this issue Feb 1, 2017 · 2 comments

Comments

@Llammissar
Copy link

Another feature request that came to mind as I was working. Consider the following single column of data:

file
5core_05thread
5core_06thread
5core_07thread
5core_08thread

I ended up doing it in post-process, but I think it'd be handy to have some way to split fields so that it comes out like this:

cores  threads
5      5
5      6
5      7
5      8
@jondegenhardt
Copy link
Contributor

Nice use case. My first thought is to wonder if there enough commonality in these patterns to develop a tool around. More examples would shed light on this. But, if it turned out that the flexibility of awk or sed is needed, then it might be best to leave these tasks to those tools and custom scripts.

@Llammissar
Copy link
Author

That's a good point, and I'm not unsympathetic to it at all. If I hit more examples, I'll try to remember to outline them here.

I'll note up front that I really don't like sed/awk for this sort of thing because they're specifically general line-oriented tools. It's fine if there's something like "cores" to anchor on for extracting numbers and splitting them (and I think you rightly surmise that I wasn't looking to necessarily extract the column name in the same operation), but for the more general case? They're clunky-- the awareness of columns is extremely powerful and useful.

Just doodling here, but something like:
tsv-filter --split 1:_:cores,threads
...could be helpful. Or maybe something like regex substitution via capture groups:
tsv-filter --split 1:'([0-9]+)cores_([0-9]+)threads':cores,threads
...if we continue looking at my original example. (The column selector is necessary for the more general case that you have multiple columns with the delimiter of interest -- colon, for example -- but you only want to split one of them and the other is something like a timestamp.)

Broadly, I think I'd characterise this class of problem as "normalisation", which also includes other transformations on columns. (For example, some existing tools produce measures in whole seconds, so I want to multiply that my 1000 or divide the millisecond metrics by the same so they can be compared properly. ...This might be a separate ER?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants