/
slides.qmd
278 lines (194 loc) · 6.17 KB
/
slides.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
---
title: "Data Strategies for Future Us"
author: "Andy Barrett,\n National Snow and Ice Data Center"
date: last-modified
date-format: iso
format:
revealjs:
colorlinks: true
theme: night
slide-number: c/t
# Handle redirect from the old URL
aliases:
- "data_strategies_slides.html"
---
## What are Data Strategies?
Data Strategies enhance collaboration and reproducible science.
- Workflows;
- Data management best practices;
- Documentation;
Good to start from the beginning of a project, great to start
from where you are now.
## Who is future us?
:::: {.columns}
::: {.column}
You.
Your team.
The scientific community.
:::
::: {.column}
![](images/horst-starwars-teamwork.png)
:::
::::
## A simple data workflow
![http://r4ds.hadley.nz/](images/wickham-data-science-workflow.png)
## When to cloud?
- What is the data volume?
- How long will it take to download?
- Can you store all that data (cost and space)?
- Do you have the computing power for processing?
- Does your team need a common computing environment?
- Do you need to share data at each step or just an end product?
## Workflow Solutions
### Local
```{dot}
digraph G {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
communicate [shape=plaintext, label="Communicate"];
cloud [shape=parallelogram, label="Earthdata Cloud"];
daac [shape=cylinder, label="DAAC"]
subgraph cluster_incloud {
style=filled;
color=lightgrey;
cloud;
label="In Cloud";
}
subgraph cluster_0 {
style=filled;
color=lightgrey;
node [shape=plaintext, style=filled,color=white];
Tidy -> Transform -> Visualize -> Model -> Transform;
label = "Local Machine";
}
daac -> Tidy
cloud -> Tidy
Model -> communicate;
}
```
## Workflow Solutions
### Hybrid
```{dot}
digraph G {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
communicate [shape=plaintext, label="Communicate"];
cloud [shape=parallelogram, label="Earthdata Cloud"];
daac [shape=cylinder, label="DAAC"]
subgraph cluster_incloud {
style=filled;
color=lightgrey;
tidy_cloud [shape=plaintext, label="Tidy", style=filled,color=white];
cloud -> tidy_cloud;
label="In Cloud";
}
subgraph cluster_local {
style=filled;
color=lightgrey;
node [shape=plaintext, style=filled,color=white];
tidy_local [shape=plaintext, label="Tidy", style=filled,color=white];
tidy_local -> Transform -> Visualize -> Model -> Transform;
label = "Local Machine";
}
daac -> tidy_local;
tidy_cloud -> Transform
Model -> communicate;
}
```
## Workflow Solutions
### All in cloud
```{dot}
digraph G {
fontname="Helvetica,Arial,sans-serif"
node [fontname="Helvetica,Arial,sans-serif"]
edge [fontname="Helvetica,Arial,sans-serif"]
communicate [shape=plaintext, label="Communicate"];
cloud [shape=parallelogram, label="Earthdata Cloud"];
daac [shape=cylinder, label="DAAC"]
subgraph cluster_incloud {
style=filled;
color=lightgrey;
tidy_cloud [shape=plaintext, label="Tidy", style=filled,color=white];
node [shape=plaintext, style=filled,color=white];
cloud -> tidy_cloud -> Transform -> Visualize -> Model -> Transform;
label="In Cloud";
}
daac -> tidy_cloud;
Model -> communicate;
}
```
## Workflow Solutions
### Use Cloud-based Services
- Cloud services are infrastructure, platforms and software hosted in
the cloud and made available to users via an API, often accessed via
a web interface.
- NASA's Harmony (https://harmony.earthdata.nasa.gov/) can subset,
reproject and reformat, and serve data.
- This might save processing steps.
## What does this look like in the cloud?
```{.python}
import earthaccess
auth = earthaccess.login(strategy='netrc')
Query = earthaccess.granule_query().concept_id(
'C2153572614-NSIDC_CPRD'
).temporal(
"2020-03-01", "2020-03-30"
).bounding_box(
-134.7,58.9,-133.9,59.2)
granules = Query.get(4)
files = earthaccess.open(granules)
ds = xr.open_dataset(files[1], group='/gt1l/land_ice_segments')
ds
# Start to do awesome science
```
## How to future-proof workflows and make them reproducible
**FAIR**
- **F**indable,
- **A**ccessible,
- **I**nteroperable,
- **R**eusable
Applies to the future you and your team as well.
## Make sure data are **F**indable and **A**ccessible
Does everyone on your team know where the data is?
Can they access it?
Helpful to document this somewhere.
## Data Management
Keep raw data, raw!
Save intermediate data not just final versions.
Use consistent and descriptive folder and file name patterns.
```
(base) nsidc-442-abarrett:data_strategies_for_a_future_us$ tree Data
Data
├── calibrated
├── cleaned
├── figures
├── final
├── monthly_averages
├── raw
└── results
7 directories, 0 files
```
## Standard file formats make data **I**nteroperable
- GeoTIFF for imagery or 2D raster data
- NetCDF for multi-dimensional data (3D+)
- Shapefiles or GeoJSON for vector data
- csv for tabular data.
Avoid Excel and other proprietary formats.
## Metadata makes data **I**nteroperable and **R**eusable.
Metadata standards and conventions ensure that standard tools can
read/interpret the data.
Standards also define the meaning of metadata attributes.
- What is the Coordinate Reference System?
(projection, grid mapping)
- What are the units?
- What is the variable name?
- What is the source of the data?
- What script produced the data?
## Document the Analysis
Document each step.
- Where did you get the data, which files, which version?
- Write it down. Anywhere is good but using a script is better.
Can you (or anyone else) easily reproduce your processing pipeline?
With GUI interfaces - e.g. ArcGIS, QGIS, Excel - use screenshots, journal commands.