-
Notifications
You must be signed in to change notification settings - Fork 24
/
README.Rmd
218 lines (151 loc) · 9.96 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
title: "R Client for Dataverse Repositories"
output: github_document
---
```{r knitr_options, echo=FALSE, results="hide"}
options(width = 120)
knitr::opts_chunk$set(results = "hold")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
```
[![CRAN Version](https://www.r-pkg.org/badges/version/dataverse)](https://cran.r-project.org/package=dataverse) ![Downloads](https://cranlogs.r-pkg.org/badges/dataverse) [![Travis-CI Build Status](https://travis-ci.org/IQSS/dataverse-client-r.png?branch=master)](https://travis-ci.org/IQSS/dataverse-client-r) [![codecov.io](https://codecov.io/github/IQSS/dataverse-client-r/coverage.svg?branch=master)](https://codecov.io/github/IQSS/dataverse-client-r?branch=master)
[![Dataverse Project logo](https://dataverse.org/files/dataverseorg/files/dataverse_project_logo-hp.png)](https://dataverse.org)
The **dataverse** package provides access to [Dataverse](https://dataverse.org/) APIs (versions 4-5), enabling data search, retrieval, and deposit, thus allowing R users to integrate public data sharing into the reproducible research workflow. **dataverse** is the next-generation iteration of [the **dvn** package](https://cran.r-project.org/package=dvn), which works with Dataverse 3 ("Dataverse Network") applications. **dataverse** includes numerous improvements for data search, retrieval, and deposit, including use of the (currently in development) **sword** package for data deposit and the **UNF** package for data fingerprinting.
### Getting Started
You can find a stable 2017 release on [CRAN](https://cran.r-project.org/package=dataverse), or install the latest development version from GitHub:
```{r, echo = FALSE, eval = FALSE}
if (!require("remotes")) {
install.packages("remotes")
}
remotes::install_github("iqss/dataverse-client-r")
```
```{r}
library("dataverse")
```
#### Keys
Some features of the Dataverse API are public and require no authentication. This means in many cases you can search for and retrieve data without a Dataverse account for that a specific Dataverse installation. But, other features require a Dataverse account for the specific server installation of the Dataverse software, and an API key linked to that account. Instructions for obtaining an account and setting up an API key are available in the [Dataverse User Guide](https://guides.dataverse.org/en/latest/user/account.html). (Note: if your key is compromised, it can be regenerated to preserve security.) Once you have an API key, this should be stored as an environment variable called `DATAVERSE_KEY`. It can be set within R using:
``` r
Sys.setenv("DATAVERSE_KEY" = "examplekey12345")
```
#### Server
Because [there are many Dataverse installations](https://dataverse.org/), all functions in the R client require specifying what server installation you are interacting with. This can be set by default with an environment variable, `DATAVERSE_SERVER`. This should be the Dataverse server, without the "https" prefix or the "/api" URL path, etc. For example, the Harvard Dataverse can be used by setting:
``` r
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
```
Note: The package attempts to compensate for any malformed values, though.
Currently, the package wraps the data management features of the Dataverse API. Functions for other API features - related to user management and permissions - are not currently exported in the package (but are drafted in the [source code](https://github.com/IQSS/dataverse-client-r)).
### Data and Metadata Retrieval
The dataverse package provides multiple interfaces to obtain data into R. Users can supply a file DOI, a dataset DOI combined with a filename, or a dataverse object. They can read in the file as a raw binary or a dataset read in with the appropriate R function.
#### Reading data as R objects
Use the `get_dataframe_*()` functions, depending on the input you have. For example, we will read a survey dataset on Dataverse, [nlsw88.dta](https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/PPKHI1/ZYATZZ) (`doi:10.70122/FK2/PPKHI1/ZYATZZ`), originally in Stata dta form.
With a file DOI, we can use the `get_dataframe_by_doi` function:
```{r get_dataframe_by_doi}
nlsw <-
get_dataframe_by_doi(
filedoi = "10.70122/FK2/PPIAXE/MHDB0O",
server = "demo.dataverse.org"
)
```
which by default reads in the ingested file (not the original dta) by the [`readr::read_tsv`](https://readr.tidyverse.org/reference/read_delim.html) function.
Alternatively, we can download the same file by specifying the filename and the DOI of the "dataset" (in Dataverse, a collection of files is called a dataset).
```{r get_dataframe_by_name_tsv, message=FALSE}
nlsw_tsv <-
get_dataframe_by_name(
filename = "nlsw88.tab",
dataset = "10.70122/FK2/PPIAXE",
server = "demo.dataverse.org"
)
```
Now, Dataverse often translates rectangular data into an ingested, or "archival" version, which is application-neutral and easily-readable. `read_dataframe_*()` defaults to taking this ingested version rather than using the original, through the argument `original = FALSE`.
This default is safe because you may not have the proprietary software that was originally used. On the other hand, the data may have lost information in the process of the ingestation.
Instead, to read the same file but its original version, specify `original = TRUE` and set an `.f` argument. In this case, we know that `nlsw88.tab` is a Stata `.dta` dataset, so we will use the `haven::read_dta` function.
```{r get_dataframe_by_name_original}
nlsw_original <-
get_dataframe_by_name(
filename = "nlsw88.tab",
dataset = "10.70122/FK2/PPIAXE",
.f = haven::read_dta,
original = TRUE,
server = "demo.dataverse.org"
)
```
Note that even though the file prefix is ".tab", we use `haven::read_dta`.
Of course, when the dataset is not ingested (such as a Rds file), users would always need to specify an `.f` argument for the specific file.
Note the difference between `nls_tsv` and `nls_original`. `nls_original` preserves the data attributes like value labels, whereas `nls_tsv` has dropped this or left this in file metadata.
```{r}
class(nlsw_tsv$race) # tab ingested version only has numeric data
```
```{r}
attr(nlsw_original$race, "labels") # original dta has value labels
```
#### Reading a dataset as a binary file.
In some cases, you may not want to read in the data in your environment, perhaps because that is not possible (e.g. for a `.docx` file), and you want to simply write these files your local disk. To do this, use the more primitive `get_file_*` commands. The arguments are equivalent, except we no longer need an `.f` argument
```{r get_file_by_name}
nlsw_raw <-
get_file_by_name(
filename = "nlsw88.tab",
dataset = "10.70122/FK2/PPIAXE",
server = "demo.dataverse.org"
)
class(nlsw_raw)
```
#### Reading file metadata
The function `get_file_metadata()` can also be used similarly. This will return a metadata format for ingested tabular files in the `ddi` format. The function `get_dataset()` will retrieve the list of files in a dataset.
```{r, get_dataset}
get_dataset(
dataset = "10.70122/FK2/PPIAXE",
server = "demo.dataverse.org"
)
```
### Data Discovery
Dataverse supplies a robust search API to discover Dataverses, datasets, and files. The simplest searches simply consist of a query string:
```{r search1, eval = FALSE}
dataverse_search("Gary King")
```
More complicated searches might specify metadata fields:
```{r search2, eval = FALSE}
dataverse_search(author = "Gary King", title = "Ecological Inference")
```
And searches can be restricted to specific types of objects (Dataverse, dataset, or file):
```{r search3, eval = FALSE}
dataverse_search(author = "Gary King", type = "dataset")
```
The results are paginated using `per_page` argument. To retrieve subsequent pages, specify `start`.
### Data Archiving
Dataverse provides two - basically unrelated - workflows for managing (adding, documenting, and publishing) datasets. The first is built on [SWORD v2.0](http://swordapp.org/sword-v2/). This means that to create a new dataset listing, you will have to first initialize a dataset entry with some metadata, add one or more files to the dataset, and then publish it. This looks something like the following:
``` r
# retrieve your service document
d <- service_document()
# create a list of metadata
metadat <-
list(
title = "My Study",
creator = "Doe, John",
description = "An example study"
)
# create the dataset
ds <- initiate_sword_dataset("mydataverse", body = metadat)
# add files to dataset
tmp <- tempfile()
write.csv(iris, file = tmp)
f <- add_file(ds, file = tmp)
# publish new dataset
publish_sword_dataset(ds)
# dataset will now be published
list_datasets("mydataverse")
```
The second workflow is called the "native" API and is similar but uses slightly different functions:
``` r
# create the dataset
ds <- create_dataset("mydataverse")
# add files
tmp <- tempfile()
write.csv(iris, file = tmp)
f <- add_dataset_file(file = tmp, dataset = ds)
# publish dataset
publish_dataset(ds)
# dataset will now be published
get_dataverse("mydataverse")
```
Through the native API it is possible to update a dataset by modifying its metadata with `update_dataset()` or file contents using `update_dataset_file()` and then republish a new version using `publish_dataset()`.
### Other Installations
Users interested in downloading metadata from archives other than Dataverse may be interested in Kurt Hornik's [OAIHarvester](https://cran.r-project.org/package=OAIHarvester) and Scott Chamberlain's [oai](https://cran.r-project.org/package=oai), which offer metadata download from any web repository that is compliant with the [Open Archives Initiative](http://www.openarchives.org/) standards. Additionally, [rdryad](https://cran.r-project.org/package=rdryad) uses OAIHarvester to interface with [Dryad](https://datadryad.org/stash). The [rfigshare](https://cran.r-project.org/package=rfigshare) package works in a similar spirit to **dataverse** with <https://figshare.com/>.