Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Split by file size #1500

Open
BeanBagKing opened this issue Feb 19, 2024 · 0 comments
Open

[feature request] Split by file size #1500

BeanBagKing opened this issue Feb 19, 2024 · 0 comments

Comments

@BeanBagKing
Copy link

BeanBagKing commented Feb 19, 2024

Apologies if this exists, I looked in https://miller.readthedocs.io/en/latest/reference-verbs/#split and search the Issues for similar suggestions.

I'm working with large CSV files that often need to be split into chunks no larger than X size (e.g. max file size 1024 MB) for file transfer. The amount of data in each column varies wildly, so taking a 10 gb file and splitting it into 10 chunks doesn't usually work. One file may end up being significantly over 10gb, and others may be significantly under. Same with splitting by the number of lines. I could split it into more chunks, until the largest is below 1gb, but that requires some trial and error at best and I'm trying to optimize this process.

The following one-liner does what I want, more or less, but it's not the fastest process. Replace both instances of InFile.csv with your file.

tail -n +2 InFile.csv | split -C 1000MB -d - --filter='sh -c "{ head -n1 InFile.csv ; cat; } > $FILE.csv"'

Edit: I should note that, in my case at least, the order of the lines does not have to remain the same.

@johnkerl johnkerl changed the title Feature Request - Split by file size [feature request] Split by file size Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant