Add regex replace custom function #270

JamezQ · 2021-08-12T23:51:02Z

This feature allows q to quickly get a new column based on a regex match of another column. And even execute a GROUP BY using this new column.

Example: Find the disk usage by extension for the current directory:

du -sk *|grep "\."|q -b -t 'SELECT resub(c2,".*\.","") as ex,SUM(c1) FROM - GROUP BY ex ORDER BY SUM(c1)'

This is quite nice compared to other answers in: https://unix.stackexchange.com/questions/308846/how-to-find-total-filesize-grouped-by-extension (Slightly modified for only looking in the current dir)

find . -type f -depth 1 | egrep -o "\.[a-zA-Z0-9]+$" | sort -u|xargs -I '%' find . -type f -name "*%" -depth 1 -exec du -ch {} + -exec echo % \;|egrep "^\.[a-zA-Z0-9]+$|total$"|uniq|paste - -

find . -name '?*.*' -type f -depth 1 -print0 |                                                                                                                                                  
  perl -0ne '
    if (@s = stat$_){
      ($ext = $_) =~ s/.*\.//s;
      $s{$ext} += $s[12];
      $n{$ext}++;
    }
    END {
      for (sort{$s{$a} <=> $s{$b}} keys %s) {
        printf "%15d %4d %s\n",  $s{$_}<<9, $n{$_}, $_;
      }
    }'

Of course, it is possible to use q without this feature quite succinctly as well:

du -sk *|grep "\."|sed 's/\t.*\.$.*$$/ \1/g'|q -b 'SELECT c2,SUM(c1) FROM - GROUP BY c2 ORDER BY SUM(c1)'

However, as the capture groups become more complex, getting multiple captures from the same column, etc, I believe resub becomes much more useful.

As another example, I have already used this modified q to analyze a log file containing user agents, and creating a summary of the most used OS/DEVICE combination. (Simply group by both columns and count) This involved two extractions from the same field to acquire the OS and device. (Using capture groups feature of re.sub)

Finally, resub gets to utilize the existing row and column parsing q has, this can be complicated if using sed (commas in quoted fields).

JamezQ · 2021-08-12T23:51:23Z

This passes make test.

harelba · 2021-10-22T12:55:46Z

Hi,

this is a good addition, although I'm aiming to add a full set of regexp functions, including optimizing the pattern compilation. Using re.sub or any other re method directly compiles the pattern on every call to the function, which is extremely slow.

Regarding filename handling, I believe that it's important enough to have its own functions. regexp parsing of filenames can be error prone, and lead to inconsistencies, such as what happens if there is no extension for a filename.

Now that q 3.1.0-beta is merged into master, it's possible to create new PRs. I've created a PR (#282) for adding multiple requested user functions, and added several filename-parsing functions. Achieving the per-ext total sizes would then be something like that:

$ du -ks * | q -b -t "select file_ext(c2),sum(c1) from - group by 1 order by 2"
.bzl     	4
.ini     	4
.markdown	8
.py      	8
.sh      	12
.txt     	12
.egg-info	24
.md      	28
.alal    	29948
.qsql    	44068
         	288636

I'll add the set of regexp functions on that PR as well.

Added resub custom function

8b606b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add regex replace custom function #270

Add regex replace custom function #270

JamezQ commented Aug 12, 2021 •

edited

JamezQ commented Aug 12, 2021

harelba commented Oct 22, 2021

Add regex replace custom function #270

Are you sure you want to change the base?

Add regex replace custom function #270

Conversation

JamezQ commented Aug 12, 2021 • edited

JamezQ commented Aug 12, 2021

harelba commented Oct 22, 2021

JamezQ commented Aug 12, 2021 •

edited