Efficiency with larger data sets #82

tpmccallum · 2015-08-29T02:09:57Z

Hi,
I have a question about ingesting text files in stages (as opposed to running the make file in one sitting).
When running the make file with very large number I get the following message, and I can't help think that there may be a more efficient way of ingesting the items.
'''
parallel: Warning: No more processes: Decreasing number of running jobs to 1. Raising ulimit -u or /etc/security/limits.conf may help.
'''

Just to clarify (as far as I know) there are no issues with the files or the catalog (encoding is good - utf8 only etc). I run smaller sets from time to time for testing and they work fine. This efficiency issue only presents itself when ingesting over say 10 million records.

Please see the following ulimit -a output also

'''
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 31559
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 9000
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 31559
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
'''

Thanks so much,
Tim

tpmccallum · 2015-08-29T03:10:50Z

Toying with an idea on Line 81
parallel -a files/metadata/jsoncatalog.txt --block 100M --pipepart python bookworm/MetaParser.py > $@

instead of

cat files/metadata/jsoncatalog.txt | parallel --pipe python bookworm/MetaParser.py > $@

Will report back soon :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiency with larger data sets #82

Efficiency with larger data sets #82

tpmccallum commented Aug 29, 2015

tpmccallum commented Aug 29, 2015

Efficiency with larger data sets #82

Efficiency with larger data sets #82

Comments

tpmccallum commented Aug 29, 2015

tpmccallum commented Aug 29, 2015