Skip to content

Latest commit

 

History

History
75 lines (50 loc) · 1.61 KB

parsing_html.markdown

File metadata and controls

75 lines (50 loc) · 1.61 KB

Parsing HTML with Nokogiri

How can we use Ruby to interact with, filter, or traverse HTML? Let's play around with Nokogiri.

gem install nokogiri

Playground #1 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

What is doc? What does it represent? What information does it include?

Playground #2 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

images = doc.css('img')

What does the images variable represent? How many images are there? What information are you given by Nokogiri about these images? Can you write a loop that gathers the src of each image? Use the example below for reference:

doc.css('a').map do |a|
  a['href']
end

Playground #3 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

div   = doc.at_css('div')
divs  = doc.css('div')

What is the difference between .at_css and .css?

Playground #4 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

var1 = doc.css('.field-type-text-with-summary')
var2 = doc.css('.field-type-text-with-summary p')
var3 = doc.css('.field-type-text-with-summary p').text

What is the difference between var1, var2, and var3? What do '.mod-intro' and '.mod-intro p' refer to?

Optional

What else can you do with Nokogiri?

Check out the Bastards Book of Ruby Nokogiri documentation.