/
README.Rmd
539 lines (344 loc) · 15.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
output: github_document
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "README-"
)
```
```{r, include=FALSE}
options("width"=110)
tmp <- packageDescription( basename(getwd()) )
```
```{r, results='asis', echo=FALSE}
cat("##", tmp$Title)
```
```{r, include=FALSE}
filelist.R <- list.files("R", recursive = TRUE, pattern="\\.R$", ignore.case = TRUE, full.names = TRUE)
filelist.tests <- list.files("tests", recursive = TRUE, pattern="\\.R$", ignore.case = TRUE, full.names = TRUE)
filelist.cpp <- list.files("src", recursive = TRUE, pattern="\\.cpp$", ignore.case = TRUE, full.names = TRUE)
lines.R <- unlist(lapply(filelist.R, readLines, warn = FALSE))
lines.tests <- unlist(lapply(filelist.tests, readLines, warn = FALSE))
lines.cpp <- unlist(lapply(filelist.cpp, readLines, warn = FALSE))
length.R <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.R, value = TRUE, invert = TRUE))
length.tests <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.tests, value = TRUE, invert = TRUE))
length.cpp <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.cpp, value = TRUE, invert = TRUE))
```
[![ropensci\_footer](https://raw.githubusercontent.com/ropensci/robotstxt/master/logo/github_footer.png)](https://ropensci.org)
**Status**
*lines of R code:* `r length.R`, *lines of test code:* `r length.tests`
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/)
[![](https://badges.ropensci.org/25_status.svg)](https://github.com/ropensci/software-review/issues/25)
<a href="https://travis-ci.org/ropensci/robotstxt"><img src="https://api.travis-ci.org/ropensci/robotstxt.svg?branch=master"><a/>
<a href="https://cran.r-project.org/package=robotstxt"><img src="http://www.r-pkg.org/badges/version/robotstxt"></a>
[![cran checks](https://cranchecks.info/badges/summary/reshape)](https://cran.r-project.org/web/checks/check_results_reshape.html)
<a href="https://codecov.io/gh/ropensci/robotstxt"><img src="https://codecov.io/gh/ropensci/robotstxt/branch/master/graph/badge.svg" alt="Codecov" /></a>
<img src="http://cranlogs.r-pkg.org/badges/grand-total/robotstxt">
<img src="http://cranlogs.r-pkg.org/badges/robotstxt">
**Development version**
```{r, include=FALSE}
source_files <-
grep(
"/R/|/src/|/tests/",
list.files(recursive = TRUE, full.names = TRUE),
value = TRUE
)
last_change <-
as.character(
format(max(file.info(source_files)$mtime), tz="UTC")
)
```
```{r, results='asis', echo=FALSE}
cat(tmp$Version)
cat(" - ")
cat(stringr::str_replace(last_change, " ", " / "))
```
**Description**
```{r, results='asis', echo=FALSE}
cat(tmp$Description)
```
**License**
```{r, results='asis', echo=FALSE}
cat(tmp$License, "<br>")
cat(tmp$Author)
```
**Citation**
```{r, results='asis', eval=FALSE}
citation("robotstxt")
```
**BibTex for citing**
```{r, eval=FALSE}
toBibtex(citation("robotstxt"))
```
**Contribution - AKA The-Think-Twice-Be-Nice-Rule**
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:
> As contributors and maintainers of this project, we pledge to respect all people who
contribute through reporting issues, posting feature requests, updating documentation,
submitting pull requests or patches, and other activities.
>
> We are committed to making participation in this project a harassment-free experience for
everyone, regardless of level of experience, gender, gender identity and expression,
sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
>
> Examples of unacceptable behavior by participants include the use of sexual language or
imagery, derogatory comments or personal attacks, trolling, public or private harassment,
insults, or other unprofessional conduct.
>
> Project maintainers have the right and responsibility to remove, edit, or reject comments,
commits, code, wiki edits, issues, and other contributions that are not aligned to this
Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed
from the project team.
>
> Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by
opening an issue or contacting one or more of the project maintainers.
>
> This Code of Conduct is adapted from the Contributor Covenant
(https://www.contributor-covenant.org/), version 1.0.0, available at
https://www.contributor-covenant.org/version/1/0/0/code-of-conduct/
## Installation
**Installation and start - stable version**
```{r, eval=FALSE}
install.packages("robotstxt")
library(robotstxt)
```
**Installation and start - development version**
```{r, eval=FALSE}
devtools::install_github("ropensci/robotstxt")
library(robotstxt)
```
## Usage
**Robotstxt class documentation**
```{r, eval=FALSE}
?robotstxt
```
Simple path access right checking (the functional way) ...
```{r}
library(robotstxt)
options(robotstxt_warn = FALSE)
paths_allowed(
paths = c("/api/rest_v1/?doc", "/w/"),
domain = "wikipedia.org",
bot = "*"
)
paths_allowed(
paths = c(
"https://wikipedia.org/api/rest_v1/?doc",
"https://wikipedia.org/w/"
)
)
```
... or (the object oriented way) ...
```{r}
library(robotstxt)
options(robotstxt_warn = FALSE)
rtxt <-
robotstxt(domain = "wikipedia.org")
rtxt$check(
paths = c("/api/rest_v1/?doc", "/w/"),
bot = "*"
)
```
### Retrieval
Retrieving the robots.txt file for a domain:
```{r}
# retrieval
rt <-
get_robotstxt("https://petermeissner.de")
# printing
rt
```
### Interpretation
Checking whether or not one is supposadly allowed to access some resource from a
web server is - unfortunately - not just a matter of downloading and parsing a
simple robots.txt file.
First there is no official specification for robots.txt files so every robots.txt
file written and every robots.txt file read and used is an interpretation. Most of
the time we all have a common understanding on how things are supposed to work
but things get more complicated at the edges.
Some interpretation problems:
- finding no robots.txt file at the server (e.g. HTTP status code 404) implies that everything is allowed
- subdomains should have there own robots.txt file if not it is assumed that everything is allowed
- redirects involving protocol changes - e.g. upgrading from http to https - are followed and considered no domain or subdomain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the original domain
- redirects from subdomain www to the doamin is considered no domain change - so whatever is found at the end of the redirect is considered to be the robots.txt file for the subdomain originally requested
### Event Handling
Because the interpretation of robots.txt rules not just depends on the rules specified within the file,
the package implements an event handler system that allows to interpret and re-interpret events into rules.
Under the hood the `rt_request_handler()` function is called within `get_robotstxt()`.
This function takes an {httr} request-response object and a set of event handlers.
Processing the request and the handlers it checks for various events and states
around getting the file and reading in its content. If an event/state happened
the event handlers are passed on to the `request_handler_handler()` along for
problem resolution and collecting robots.txt file transformations:
- rule priorities decide if rules are applied given the current state priority
- if rules specify signals those are emitted (e.g. error, message, warning)
- often rules imply overwriting the raw content with a suitable interpretation given the circumstances the file was (or was not) retrieved
Event handler rules can either consist of 4 items or can be functions - the former being the usual case and that used throughout the package itself.
Functions like `paths_allowed()` do have parameters that allow passing
along handler rules or handler functions.
Handler rules are lists with the following items:
- `over_write_file_with`: if the rule is triggered and has higher priority than those rules applied beforehand (i.e. the new priority has an higher value than the old priority) than the robots.txt file retrieved will be overwritten by this character vector
- `signal`: might be `"message"`, `"warning"`, or `"error"` and will use the signal function to signal the event/state just handled. Signaling a warning or a message might be suppressed by setting the function paramter `warn = FALSE`.
- `cache` should the package be allowed to cache the results of the retrieval or not
- `priority` the priority of the rule specified as numeric value, rules with higher priority will be allowed to overwrite robots.txt file content changed by rules with lower priority
The package knows the following rules with the following defaults:
- `on_server_error ` :
- given a server error - the server is unable to serve a file - we assume that something is terrible wrong and forbid all paths for the time being but do not cache the result so that we might get an updated file later on
```{r}
on_server_error_default
```
- `on_client_error ` :
- client errors encompass all HTTP status 4xx status codes except 404 which is handled directly
- despite the fact that there are a lot of codes that might indicate that the client has to take action (authentication, billing, ... see: https://de.wikipedia.org/wiki/HTTP-Statuscode) in the case of retrieving robots.txt with simple GET request things should just work and any client error is treated as if there is no file available and thus scraping is generally allowed
```{r}
on_client_error_default
```
- `on_not_found ` :
- HTTP status code 404 has its own handler but is treated the same ways other client errors: if there is no file available and thus scraping is generally allowed
```{r}
on_not_found_default
```
- `on_redirect ` :
- redirects are ok - often redirects redirect from HTTP schema to HTTPS - robotstxt will use whatever content it has been redirected to
```{r}
on_redirect_default
```
- `on_domain_change ` :
- domain changes are handled as if the robots.txt file did not exist and thus scraping is generally allowed
```{r}
on_domain_change_default
```
- `on_file_type_mismatch ` :
- if {robotstxt} gets content with content type other than text it probably is not a robotstxt file, this situation is handled as if no file was provided and thus scraping is generally allowed
```{r}
on_file_type_mismatch_default
```
- `on_suspect_content ` :
- if {robotstxt} cannot parse it probably is not a robotstxt file, this situation is handled as if no file was provided and thus scraping is generally allowed
```{r}
on_suspect_content_default
```
### Design Map for Event/State Handling
**from version 0.7.x onwards**
While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.
**Features and Problems handled:**
- now handles corner cases of retrieving robots.txt files
- e.g. if no robots.txt file is available this basically means "you can scrape it all"
- but there are further corner cases (what if there is a server error, what if redirection takes place, what is redirection takes place to different domains, what if a file is returned but it is not parsable, or is of format HTML or JSON, ...)
**Design Decisions**
1. the whole HTTP request-response-chain is checked for certain event/state types
- server error
- client error
- file not found (404)
- redirection
- redirection to another domain
2. the content returned by the HTTP is checked against
- mime type / file type specification mismatch
- suspicious content (file content does seem to be JSON, HTML, or XML instead of robots.txt)
3. state/event handler define how these states and events are handled
4. a handler handler executes the rules defined in individual handlers
5. handler can be overwritten
6. handler defaults are defined that they should always do the right thing
7. handler can ...
- overwrite the content of a robots.txt file (e.g. allow/disallow all)
- modify how problems should be signaled: error, warning, message, none
- if robots.txt file retrieval should be cached or not
8. problems (no matter how they were handled) are attached to the robots.txt's as attributes, allowing for ...
- transparency
- reacting post-mortem to the problems that occured
9. all handler (even the actual execution of the HTTP-request) can be overwritten at runtime to inject user defined behaviour beforehand
### Warnings
By default all functions retrieving robots.txt files will warn if there are
- any HTTP events happening while retrieving the file (e.g. redirects) or
- the content of the file does not seem to be a valid robots.txt file.
The warnings in the following example can be turned of in three ways:
```{r, include = FALSE}
options("robotstxt_warn" = TRUE)
```
(example)
```{r}
library(robotstxt)
paths_allowed("petermeissner.de")
```
(solution 1)
```{r}
library(robotstxt)
suppressWarnings({
paths_allowed("petermeissner.de")
})
```
(solution 2)
```{r}
library(robotstxt)
paths_allowed("petermeissner.de", warn = FALSE)
```
(solution 3)
```{r}
library(robotstxt)
options(robotstxt_warn = FALSE)
paths_allowed("petermeissner.de")
```
### Inspection and Debugging
The robots.txt files retrieved are basically mere character vectors:
```{r}
rt <-
get_robotstxt("petermeissner.de")
as.character(rt)
cat(rt)
```
The last HTTP request is stored in an object
```{r}
rt_last_http$request
```
But they also have some additional information stored as attributes.
```{r}
names(attributes(rt))
```
Events that might change the interpretation of the rules found in the robots.txt file:
```{r}
attr(rt, "problems")
```
The {httr} request-response object that allwos to dig into what exactly was going on in the client-server exchange.
```{r}
attr(rt, "request")
```
... or lets us retrieve the original content given back by the server:
```{r}
httr::content(
x = attr(rt, "request"),
as = "text",
encoding = "UTF-8"
)
```
... or have a look at the actual HTTP request issued and all response headers given back by the server:
```{r}
# extract request-response object
rt_req <-
attr(rt, "request")
# HTTP request
rt_req$request
# response headers
rt_req$all_headers
```
### Transformation
For convenience the package also includes a `as.list()` method for robots.txt files.
```{r}
as.list(rt)
```
### Caching
The retrieval of robots.txt files is cached on a per R-session basis.
Restarting an R-session will invalidate the cache. Also using the the
function parameter `froce = TRUE` will force the package to re-retrieve the
robots.txt file.
```{r}
paths_allowed("petermeissner.de/I_want_to_scrape_this_now", force = TRUE, verbose = TRUE)
paths_allowed("petermeissner.de/I_want_to_scrape_this_now",verbose = TRUE)
```
## More information
- https://www.robotstxt.org/norobots-rfc.txt
- [Have a look at the vignette at https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html ](https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html)
- [Google on robots.txt](https://developers.google.com/search/reference/robots_txt?hl=en)
- https://wiki.selfhtml.org/wiki/Grundlagen/Robots.txt
- https://support.google.com/webmasters/answer/6062608?hl=en
- https://www.robotstxt.org/robotstxt.html