Skip to content

robhaj/scrapeNode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Web Scraping

###What is it?

  • Computer software technique of automatically collecting and extracting information from websites.

  • Any content that can be viewed on a webpage can be scraped. Period.

  • Before web based API’s became prominent we had web scraping.

###Think About It

  • Site owners typically care more about maintaining their public-facing visitor site than about their structured data feeds.

  • Most businesses don’t build a lot of defenses against automated access.

  • With APIs, you often have to register to get a key and then send that key with every request. With simple HTTP requests, you’re basically anonymous besides IP address and cookies, which can be spoofed.

  • Web scraping is universally available. You don’t have to wait for a site to open up an API or contact anyone at the organization. Just spend time browsing the site until you find the data you need and figure out some basic access patterns.

#Scraping with Node ##(Express,Request,Cheerio)

Cheerio - http://cheeriojs.github.io/cheerio/

  • Implementation of core jQuery specifically for the server
  • Helps us traverse the DOM and extract data

##Require Dependencies

var express = require('express');
var fs = require('fs');
var request = require('request');
var cheerio = require('cheerio');
var app     = express();

##Make GET request

app.get('/', function (req, res) {
// request to url and load html
});

##Start Server and Export Module

app.listen('8081');
exports = module.exports = app;

##Request to a URL and load HTML

request(url, function (err, res, html) {
    if (!err && res.statusCode == 200) {
      // pass DOM to cheerio
      var $ = cheerio.load(html);
	console.log(html);
  });

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published