GitHub

#Web Scraping

###What is it?

Computer software technique of automatically collecting and extracting information from websites.
Any content that can be viewed on a webpage can be scraped. Period.
Before web based API’s became prominent we had web scraping.

###Think About It

Site owners typically care more about maintaining their public-facing visitor site than about their structured data feeds.
Most businesses don’t build a lot of defenses against automated access.
With APIs, you often have to register to get a key and then send that key with every request. With simple HTTP requests, you’re basically anonymous besides IP address and cookies, which can be spoofed.
Web scraping is universally available. You don’t have to wait for a site to open up an API or contact anyone at the organization. Just spend time browsing the site until you find the data you need and figure out some basic access patterns.

#Scraping with Node ##(Express,Request,Cheerio)

Cheerio - http://cheeriojs.github.io/cheerio/

Implementation of core jQuery specifically for the server
Helps us traverse the DOM and extract data

##Require Dependencies

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

##Make GET request

app.get('/', function (req, res) {
// request to url and load html
});

##Start Server and Export Module

app.listen('8081');
exports = module.exports = app;

##Request to a URL and load HTML

request(url, function (err, res, html) {
    if (!err && res.statusCode == 200) {
      // pass DOM to cheerio
      var $ = cheerio.load(html);
	console.log(html);
  });

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
client/public		client/public
server		server
.gitignore		.gitignore
package.json		package.json
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client/public

client/public

server

server

.gitignore

.gitignore

package.json

package.json

readme.md

readme.md

Repository files navigation

About

Releases

Packages

Languages

robhaj/scrapeNode

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages