Skip to content

TapiocaAberta/datasus_node_scrap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CNES web scraper

Build Status Coverage Status bitHound Score Dependency Status

Table of Contents

You have to use nodejs version 0.10.x

Download the dependencies:

npm install

Install MongoDB:

sudo apt-get install mongodb

How to debug:

node-inspector
node --debug-brk crawler.js

Then, go to http://localhost:8080/debug?port=5858

How to run:

node --max-old-space-size=8192 --expose-gc crawler.js

Limitations:

The script consumes a lot of memory in order of their execution time. On the first minute of their execution, about 150 registers are downloaded, netherless, this measure is going down in order of the memory consumption.

A real fixes to this problem is to study how the V8 garbage collector works, and pay attention to remove the closure variables to improve less memory consumption.

So, an work around to this problem is to kill and reopen the script in determined cycle of time using CRON. To do that, run the following instructions:

create a file with the following content on /etc/cron.d/crawler (without the extension '.sh')

#!/bin/sh
pkill node
cd "<PATH_OF_SOURCE>/node_scrap/"
<YOUR_NODE_PATH>/node --max-old-space-size=8192 index.js > /tmp/crawler.log &

run:

crontab -e

add this on the last line of the file:

*/2 * * * * /bin/sh /etc/cron.d/crawler

This will run automatically the script /etc/cron.d/crawler on the interval of 2 in 2 minute, It would be kill and re-execute the crawler script.

Running the crawler

First Step

First of all you need to change the function "initialize" of the class index.js, which the content is something like that:

downloadModule.processStates();
//downloadModule.processEntities();

The first step is to execute the function processStates() this function will download all the urls of the entities, in order to make the process synchronous, and it will maintain the control of what register was downloaded.

More of 300.000 url's will be downloaded. You can check It on database:

 mongo
 use cnes2015
 show collections
 db.entitytodownloads.count();

Second step

You must backup the collection entityurls to entityurls_bak by using the following command:

db.entityurls.copyTo('entityurls_bak')

Then, you have to change the function initialize in order to make it call the function that download the entity details:

//downloadModule.processStates();
downloadModule.processEntities();

You can now check your log with tail: tail -f /tmp/crawler.log

How to export

To a Unique CSV:

cd output
mongoexport --db cnes --collection entities --csv --fieldFile entities_fields.txt --out entities.csv

To a separated CSV file with the UF as a file name (like SP, RJ, MG etc)

rm -rf output/*.csv && node crawler/exporter.js

To Database dump:

to generate the dump:

mongodump -d cnes -o output

to restore:

mongorestore cnes