Count File Size (MB) #504

DarwinJS · 2020-07-19T12:31:02Z

Some commercial security scanning tools now charge by file volume in MB.
I am wondering if that measurement could be taken and reported.
I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).

Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)

AlDanial · 2020-07-19T18:56:32Z

Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example:
Step 1 cloc --by-file --csv --out counts.csv directory
Step 2 count_bytes counts.csv
where count_bytes is something like

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.

The solution can easily be adapted to count bytes in files after comments are removed.
Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory
Step 2 count_bytes_no_comments counts.csv
where count_bytes_no_comments is

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    $file .= ".No_Comments";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

DarwinJS · 2020-07-19T19:40:49Z

There are several benefits to having it integrated:

Integrates MBs on the same report or data output format - so that data can be consumed in the same ways. Having completely separate reports would mean a lot of folks would want to try to merge the reports so they can estimate both of these code metrics in the same way - per repository, per language.
Handles all the files in the same way your base code does (so automatically handling archived files like you are doing)
Allows your code for aggregating reports to be used for MBs as well

I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

AlDanial closed this as completed Jul 19, 2020

AlDanial reopened this Jul 19, 2020

AlDanial mentioned this issue Sep 1, 2020

Print also size in bytes #517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Count File Size (MB) #504

Count File Size (MB) #504

DarwinJS commented Jul 19, 2020 •

edited

AlDanial commented Jul 19, 2020

DarwinJS commented Jul 19, 2020

Count File Size (MB) #504

Count File Size (MB) #504

Comments

DarwinJS commented Jul 19, 2020 • edited

AlDanial commented Jul 19, 2020

DarwinJS commented Jul 19, 2020

DarwinJS commented Jul 19, 2020 •

edited