CNES web scraper

How to debug:

node-inspector
node --debug-brk crawler.js

Then, go to http://localhost:8080/debug?port=5858

How to run:

node --max-old-space-size=8192 --expose-gc crawler.js

Limitations:

The script consumes a lot of memory in order of their execution time. On the first minute of their execution, about 150 registers are downloaded, netherless, this measure is going down in order of the memory consumption.

A real fixes to this problem is to study how the V8 garbage collector works, and pay attention to remove the closure variables to improve less memory consumption.

So, an work around to this problem is to kill and reopen the script in determined cycle of time using CRON. To do that, run the following instructions:

create a file with the following content on /etc/cron.d/crawler (without the extension '.sh')

#!/bin/sh
pkill node
cd "<PATH_OF_SOURCE>/node_scrap/"
<YOUR_NODE_PATH>/node --max-old-space-size=8192 index.js > /tmp/crawler.log &

run:

crontab -e

add this on the last line of the file:

*/2 * * * * /bin/sh /etc/cron.d/crawler

This will run automatically the script /etc/cron.d/crawler on the interval of 2 in 2 minute, It would be kill and re-execute the crawler script.

Running the crawler

First Step

First of all you need to change the function "initialize" of the class index.js, which the content is something like that:

downloadModule.processStates();
//downloadModule.processEntities();

The first step is to execute the function processStates() this function will download all the urls of the entities, in order to make the process synchronous, and it will maintain the control of what register was downloaded.

More of 300.000 url's will be downloaded. You can check It on database:

 mongo
 use cnes2015
 show collections
 db.entitytodownloads.count();

Second step

You must backup the collection entityurls to entityurls_bak by using the following command:

db.entityurls.copyTo('entityurls_bak')

Then, you have to change the function initialize in order to make it call the function that download the entity details:

//downloadModule.processStates();
downloadModule.processEntities();

You can now check your log with tail: tail -f /tmp/crawler.log

How to export

To a Unique CSV:

cd output
mongoexport --db cnes --collection entities --csv --fieldFile entities_fields.txt --out entities.csv

To a separated CSV file with the UF as a file name (like SP, RJ, MG etc)

rm -rf output/*.csv && node crawler/exporter.js

To Database dump:

to generate the dump:

mongodump -d cnes -o output

to restore:

mongorestore cnes

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
crawler		crawler
output		output
script		script
test		test
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json

License

TapiocaAberta/datasus_node_scrap

Folders and files

Latest commit

History

Repository files navigation

CNES web scraper

Table of Contents

How to debug:

How to run:

Limitations:

Running the crawler

First Step

Second step

How to export

To a Unique CSV:

To a separated CSV file with the UF as a file name (like SP, RJ, MG etc)

To Database dump:

About

Resources

License

Stars

Watchers

Forks

Languages