Parsing HTML with Nokogiri

How can we use Ruby to interact with, filter, or traverse HTML? Let's play around with Nokogiri.

gem install nokogiri

Playground #1 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

What is doc? What does it represent? What information does it include?

Playground #2 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

images = doc.css('img')

What does the images variable represent? How many images are there? What information are you given by Nokogiri about these images? Can you write a loop that gathers the src of each image? Use the example below for reference:

doc.css('a').map do |a|
  a['href']
end

Playground #3 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

div   = doc.at_css('div')
divs  = doc.css('div')

What is the difference between .at_css and .css?

Playground #4 in Pry:

require 'nokogiri'
require 'open-uri'

html = open('https://www.turing.io')
doc  = Nokogiri::HTML(html)

var1 = doc.css('.field-type-text-with-summary')
var2 = doc.css('.field-type-text-with-summary p')
var3 = doc.css('.field-type-text-with-summary p').text

What is the difference between var1, var2, and var3? What do '.mod-intro' and '.mod-intro p' refer to?

Optional

What else can you do with Nokogiri?

Check out the Bastards Book of Ruby Nokogiri documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing_html.markdown

parsing_html.markdown

Parsing HTML with Nokogiri

Playground #1 in Pry:

Playground #2 in Pry:

Playground #3 in Pry:

Playground #4 in Pry:

Optional

Files

parsing_html.markdown

Latest commit

History

parsing_html.markdown

File metadata and controls

Parsing HTML with Nokogiri

Playground #1 in Pry:

Playground #2 in Pry:

Playground #3 in Pry:

Playground #4 in Pry:

Optional