Web Scraping Manuals

About the List

This is a list of articles and books teaching web scraping.

Base Things

To know base things is more important than to know particular tools or implementations.

It is important to know what is HTTP, TCP, TLS, DNS, HTML, XML, XPath, CSS, DOM, proxying network requests.

It is LESS important to know how to build crawler with SuperScrapingFramework or what function of PowerfulHTMLParsingLibrary allows you to extract text from selected element of HTML DOM tree. These things are very specific. You do not have to know how to operate with every scraping framework or HTML parsing package in the world. If you know base things it is just a matter of short time to get knowledge about how to operate this base things with a particular programming package.

Information Availability

The list must provide information which is accessable instantly. The list does not accept books whose content are not available online.

Information Granularity

If a book contains a number of topics, it makes sense to refer to particular topic of the book in a particular section of Learning Web Scraping list.

How to Contribute

You may submit a new issue with an article or book you want to add. I will read the article or take a look at animals on a cover picture of the book and will decide is it worth to be included in the list.

Web Scraping Articles and Topics

HTML

WHATWG / HTML

HTTP

High Performance Browser Networking / HTTP/1.X
High Performance Browser Networking / HTTP/2
HTTP Working Group HTTP Specs

DNS

Nothing yet here.

TCP

High Performance Browser Networking / Building Blocks of TCP

TLS

High Performance Browser Networking / Transport Layer Security (TLS)

WebSocket

High Performance Browser Networking / WebSocket
WHATWG / Websocket

Concurrency

The Little Book of Semaphores

Text Encoding

WHATWG / Encoding

URL

WHATWG / URL

XMLHttpRequest

WHATWG / XMLHttpRequest
High Performance Browser Networking / XMLHttpRequest

Security

OWASP Web Security Testing Guide

IP Address

Understanding IP Addressing

Data Structures

Probabilistic Data Structures for Web Analytics and Data Mining

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

manuals.md

manuals.md

Web Scraping Manuals

Table of Contents

About the List

Base Things

Information Availability

Information Granularity

How to Contribute

Web Scraping Articles and Topics

HTML

HTTP

DNS

TCP

TLS

WebSocket

Concurrency

Text Encoding

URL

XMLHttpRequest

Security

IP Address

Data Structures

Files

manuals.md

Latest commit

History

manuals.md

File metadata and controls

Web Scraping Manuals

Table of Contents

About the List

Base Things

Information Availability

Information Granularity

How to Contribute

Web Scraping Articles and Topics

HTML

HTTP

DNS

TCP

TLS

WebSocket

Concurrency

Text Encoding

URL

XMLHttpRequest

Security

IP Address

Data Structures