Skip to content

codica2/scrapinghub-ruby-example

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Scrapinghub ruby example

Scrapinghub Platform is the most advanced platform for deploying and running web crawlers.

Requirements

  1. Docker is a tool designed to make it easier to create, deploy, and run applications by using containers.
  2. shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.
  3. Scrapinghub account

Step one: build your spider

NOTE: make sure you meet the website rules before scraping it.

Let's imagine that we want to get a list of articles from the Сodica website. I'll use typhoeus for HTTP requests and nokogiri as HTML parser.

app/crawler.rb

# require libraries
require 'typhoeus'
require 'nokogiri'
require 'json'

# determine where to write the result
begin
  outfile = File.open(ENV.fetch('SHUB_FIFO_PATH'), mode: 'w')
rescue IndexError
  outfile = STDOUT
end

# parse response
response = Typhoeus.get('https://www.codica.com/blog/').response_body
doc      = Nokogiri::HTML(response)

# select and save all titles
doc.css('.post-title').each do |title|
  result = JSON.generate(title: title.text.split.join(' '))
  outfile.write result
  outfile.write "\n"
end

Notes:

...

begin
  outfile = File.open(ENV.fetch('SHUB_FIFO_PATH'), mode: 'w')
rescue IndexError
  outfile = STDOUT
end

...

Here we set up where to write results. Scrapinghub provides a SHUB_FIFO_PATH ENV variable to store items on the website. You can locally pass a filename to this ENV variable to write on a disk.

$> ruby app/crawler.rb

#=>
{"title":"How To Start Your Own Online Marketplace"}
{"title":"MVP and Prototype: What’s Best to Validate Your Business Idea?"}
{"title":"5 Key Principles for a User-Friendly Website"}
{"title":"Building a Slack Bot for Internal Time Tracking"}
{"title":"4 Main JavaScript Development Trends in 2019"}

...

Step two: create required files by Scrapinghub

Docker image should be able to run via start-crawl command without arguments. start-crawl should be executable. At our project start-crawl is a app/crawler.rb. Second required file is shub-image-info.rb. Let's create it.

app/shub-image-info.rb

require 'json'

puts JSON.generate(project_type: 'other', spiders: ['c-spider'])
exit

Just change c-spider name to your own.

Step three: make required files executable

Add #!/usr/bin/env ruby to both app/shub-image-info.rb and app/crawler.rb. It makes these files executable.

Step four: create Dockerfile

FROM ruby:2.5.1-stretch
ENV LANG=C.UTF-8

RUN apt-get update

COPY . /app

WORKDIR /app

RUN bundle install
RUN ln -sfT /app/shub-image-info.rb /usr/sbin/shub-image-info && \
    ln -sfT /app/crawler.rb /usr/sbin/start-crawl

RUN chmod +x /app/shub-image-info.rb /app/crawler.rb

CMD /bin/bash

It's a basic Dockerfile. We install a project and point our files to files that scrapinghub will look for to start parsing.

Step five: deploy and start your spider

After you installed and logged in to shub, you need to create a project on the scrapinghub.

Copy project ID and create scrapinghub.yml file. You can read more about scrapinghub.yml here.

app/scrapinghub.yml

projects:
  c-spider:
    id: YOUR_PROJECT_ID
    image: images.scrapinghub.com/project/YOUR_PROJECT_ID

version: spider-1
apikey: YOUR_API_KEY

And upload your spider.

shub image upload c-spider

After spider deployed, go to the scrapinghub dashboard and run it. As a result, you will have something like this.

And now you can access your scrapped data with Items API.

License

Copyright © 2015-2019 Codica. It is released under the MIT License.

About Codica

Codica logo

We love open source software! See our other projects or hire us to design, develop, and grow your product.