/
website-links.rkt
217 lines (172 loc) · 8.92 KB
/
website-links.rkt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
#||
# Recursively extracting nested links from a webpage using Racket
This is my first attempt at writing Racket or LISP code. It might be ugly ...
For now, let's build a web crawler, next, I shall write myself a language.
That's where Racket really shine.
The code is [here](https://github.com/lucabol/website-links). Thanks to [Mike](https://github.com/mjrousos) for reviewing it.
## Why
I want to translate a website, including recursively reached pages, to a pdf to read on my e-reader.
This program does the first step: going from a URL to a list of URLs by recursively navigating the links
on the page.
The `PS1Script` directory contains scripts to perfrom the other steps:
1. Translate all the links to PDF pages (using Chrome headless).
2. Combine all the PDFs to a single document (using `cpdf`).
## Prelude
First you declare which language you are using. Racket is a language to build
languages and the language itself is just one of many (try rereading this phrase).
||#
#lang racket
#||
We then expose the main functions of the library. Aka recursively getting all links from a webpage
to a certain nesting level both as a Racket list and as a newline delimitated string.
||#
; Provides functions to extract links from web pages recursively
(provide
; uriString nestingLevels -> list of uriStrings
; Doesn't follow links to non-html resources or pointing to a different domain then uriString.
uri->nestedLinks
; uriSTring nestingLevels -> string of newline separated Uri Strings
uri->nestedLinksNl)
#||
## Implementation
This is similar, but not identical, to C# `using` statement as you declare which packages are needed.
||#
(require (planet neil/html-parsing:2:0)
net/url
xml
html
sxml/sxpath
threading)
#||
The program crawl just links to html files in the same domain as the first URI given to it. It is
possible to extend this more by having the program take a regular expression (or * expression) to
identify which file sto leave out of the crawling.
||#
(define invalid-suffixes
'("./" ".xml" ".jpg" ".jpeg" ".png" ".gif" ".tiff" ".psd"
".eps" ".ai" ".indd" ".raw" ".svg"))
(define invalid-prefixes '("#" "mailto:" "javascript:"))
(define (different-domain? baseUrl l)
(define url (string->url l))
(and (url-host url) (not (equal? (url-host baseUrl) (url-host url)))))
(define (good-link? baseUrl l) (not (or (different-domain? baseUrl l)
(ormap (curry string-suffix? l) invalid-suffixes)
(ormap (curry string-prefix? l) invalid-prefixes))))
#||
Next we have to parse the HTML. We use XPath for that. Racket is particularly good for XML parsing as
it maps naturally to expressions which are the bread and butter of the language. I don't enjoy
the nesting of expressions that LISP like languages force onto you (as in the expression below). But
see later for a partial solution.
||#
(define (xexp->links xexp) (flatten (map cdr ((sxpath "//a/@href") xexp))))
#||
The strange `~>>` operator in the code below comes from the [`Threading`](https://docs.racket-lang.org/threading/index.html)
package. This set of macros lets you build pipelines of computations similarly to the F# `|>` operator.
The initial value comes first and then the functions are applied in series each one to the result of
the previous. By doing that you 'flatten' the nested expressions, making them more readable (at least to my eyes).
This capability of changing the core behavior of the language is something very peculiar to the LISP family,
and the reaason why I am attracted to Racket in the first place.
This function extracts all the 'good' links from a url. BTW: I love that you can use `->` to name symbols.
||#
(define (url->links url)
(~>> (call/input-url url get-pure-port html->xexp)
xexp->links
(filter (curry good-link? url))))
#|| λ~> is the function composition operator in the 'Threading' library. You got to love embedding lambdas in the code. ||#
(define uri->links
(λ~> string->url url->links))
#||
This is the main recursive workhorse of the program. It works something like this (numbers marked in the code):
0. Treat links to subparts of a web page as if they were links to the webpage
1. If it is not a good link, return the links already visited (`visited`)
2. Same thing if the link is alread in `visited`
3. If we reached the nesting level specified, add the link to `visited` and return 'visited'
4. Otherwise add the link to `visited` and call yourself on all sublinks on the page
The function is not tail recursive, but that is not a huge deal in Racket as the stack is very large.
It doesn't blow up as easily as in most other languages.
||#
(define (uri->nestedLinks-rec baseUrl uri visited levels)
(define abs-url (if (string-contains? uri "#") ; <0>
(~> uri (string-split "#") first (combine-url/relative baseUrl _))
(~>> uri (combine-url/relative baseUrl))))
(log-info "~a, ~a, ~a:~a~n" (url->string baseUrl)
levels uri (url->string abs-url))
(cond [(not (good-link? baseUrl uri)) visited] ; <1>
[(member abs-url visited) visited] ; <2>
[(zero? levels) (cons abs-url visited)] ; <3>
[else (for/fold ([acc (cons abs-url visited)]) ;<4>
([l (in-list (url->links abs-url))])
(uri->nestedLinks-rec abs-url l acc (sub1 levels)))]))
#|| Finally we can trivially define the two main functions in the module. ||#
(define (uri->nestedLinks uri levels)
(reverse (uri->nestedLinks-rec (string->url uri) "" '() levels)))
(define (uri->nestedLinksNl uri levels)
(define links (uri->nestedLinks uri levels))
(string-join (map url->string links) "\n" #:after-last "\n"))
#||
## Test
To my great pleasure, Racket allows (encourages?) you to have tests in the same file as the code.
They just go into sub modules, that can be constructed piecewise with the `module+` instruction.
You could add the tests beside each function, but I decided to have a separate section in the file
instead. To run them you call `raco test FILENAME`.
||#
(define (uri->path test-uri)
(build-path "./data" (~> test-uri first uri->file string->path)))
(define uri->file (λ~> string->url url-host))
(define test-uris '(
("https://www.lucabol.com" 3)
("https://beautifulracket.com/" 3)
("https://en.wikipedia.org/wiki/Typeface" 1)
("https://brieferhistoryoftime.com" 3)
("https://mobydick.wales/" 3)
("https://resilientwebdesign.com" 3)
("https://www.c82.net/euclid/" 3)
))
(module+ test
(require rackunit)
#|| I got a bit sloppy not naming my lambdas here ... But, doesn't the lambda symbol look cool? ||#
(for-each (λ (test-uri)
(with-input-from-file
(uri->path test-uri)
(λ () (begin
(define saved-result (port->string))
(define calc-result
(uri->nestedLinksNl (first test-uri) (second test-uri)))
(check-equal? calc-result saved-result test-uri)))
#:mode 'text
))
test-uris))
#|| This is used to regenerate the test data. You can then inspect it manually before running tests. ||#
(define (refresh-test-data)
(for-each (λ (test-uri)
(with-output-to-file
(uri->path test-uri)
(λ () (display (uri->nestedLinksNl (first test-uri) (second test-uri))))
#:exists 'replace))
test-uris))
#||
## Main
Main goes into its own submodule as well. Racket is not as pure as Haskell, so you can naturally
manage side effects like user input and such. You got to appreciate the concisivness of the command
line parsing library.
The code below looks a bit odd to me. It could probably be refactored
so that the parser expression returns the values instead of filling out parameters.
||#
(module+ main
(define levels (make-parameter "3"))
(define uri (make-parameter #f))
(define parser
(command-line
#:program "website-links"
#:usage-help "Extracts links from a webpage recursively to a specified level."
#:once-each
[("-l" "--levels") LEVELS "How many nested levels to process (default 3)." (levels LEVELS)]
#:args (URI) (uri URI)))
(display (uri->nestedLinksNl (uri) (string->number (levels)))))
#||
## Conclusion
I liked Racket very much. It takes a little while to get use to the expression syntax, which is very
different from the C-like one most of us are used to. It also takes a while to get used to the style
of the documentation, which is written very precisely for the careful reader. We are more used to the
'here is an example, copy it' kind of documentation. For the distracted programmer ...
||#