Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performant way of knowing in which column(s) a value (string or number) appears #1214

Open
athalhammer opened this issue Oct 4, 2023 · 2 comments
Labels

Comments

@athalhammer
Copy link

athalhammer commented Oct 4, 2023

This issue has already some track record documented in #1209.

The idea is the following. Let's say I have a long and wide table where I have scattered user feedback and other things mixed into and I'd like to identify columns that mention "service". Then the output of the command should look as follows:

$ command xy "service"
7 567
102 89
3 256

To be read as: In column 567, the word "service" appeared seven times, in column 89 it appeared 102 times, and in column 256 it appeared 3 times. Now I can investigate each of these columns further.

In #1209 I suggested to add some sort of column numbering directly to all fields to be able to use other csvkit and shell commands to get this output. @jpmckinney suggested that this can be done with the following command, added it to the documentation (see c4e5612), and closed the PR.

csvgrep -m 222 -a -c 1- examples/realdata/FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 222

Momentarily I was happy with the conclusion and the re-use of existing tools.

However, I quickly came to realize that this is not performing well. Here is an example:

Prep:

cp examples/realdata/FY09_EDU_Recipients_by_State.csv test.csv
for i in {1..200}; do tail -n+2 examples/realdata/FY09_EDU_Recipients_by_State.csv; done >> test.csv

This results in little more than 10k lines.

Now I do a search for 679 (our "service" in this case) as per suggested method and time it:

time csvgrep -m 679 -a -c 1- FY09_EDU_Recipients_by_State.csv | csvformat -M $'\x1e' | xargs -d $'\x1e' -n1 sh -c 'echo $1 | csvcut -n' sh | grep 679 | sort | uniq -c
    201   4: 679
    201   6: 679

real	1m4.553s
user	0m57.484s
sys	0m7.391s

Compare this to the suggested method in #1209:

# explain first what -N does:
csvcut -h | grep "\-N"
              [-C NOT_COLUMNS] [-x] [-N]
  -N, --column-numbers  Add column numbering everywhere, in header and cells.
  
# actual timing:
time csvcut -N ../csvkit/FY09_EDU_Recipients_by_State.csv | sed "s/.*\([0-9]\+~679\).*/\1/gp" -n | sort | uniq -c
    201 4~679
    201 6~679

real	0m0.236s
user	0m0.237s
sys	0m0.020s

The times are roughly a minute apart and this is not an extremely big file yet (few columns).

Questions to the wider audience:

  1. Is this a relevant problem for someone else?
  2. Does anyone have a good and performing solution with existing tools?
  3. If we were to "add" to csvkit - which tool should/could it be?
@jpmckinney
Copy link
Member

Hmm, yes, the shell version is going to be slow, mainly because csvcut needs to start up for every match.

The need, however, seems a bit narrow, for which custom code (like the code you wrote) is the right approach.

If there's sufficient demand and a clear place to implement this feature, it can be done.

@athalhammer
Copy link
Author

I'm not sure what is the perfect way of skimming through long and wide CSV file in command line and visually make sense out of what is displayed. The max-column-width feature of csvlook only works up to 10-20 columns. As soon as multiple line breaks appear due to wrapping it is basically impossible to relate the displayed value of a cell to the respective column. The initial idea of #1209 was to address this in the first place

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants