This repository has been archived by the owner on Mar 25, 2021. It is now read-only.
/
README
487 lines (339 loc) · 24.3 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
README - SMRFFS-EXT4
Seagate Technologies LLC
Lead Engineer: Adrian Palmer
December 2015
Table of Contents
1. About
2. ZAC/ZBD standards
Commands
Challenges
3. Stack Changes
[change, method, backers]
ahci
libata
libsas
SCSI
SD
blk_dev
IO Scheduler
mdraid
lvm
FS
/sys
4. Userland Utilities
hdparm
sdparm
mke2fs
mount.ext4
resize2fs
tune2fs
dumpe2fs
debugfs
e2freefrag
e2image
e2undo
gparted
gdisk
5. Schedule
6. Patch Notes
7. Installation
8. FAQ
9. Use Cases
10. Future Work
11. Feedback
12. Contact info/legal
===============================================
1. About
SMRFFS is an addition to the popular EXT4 to enable support for devices that use the ZBC or ZAC standards. Project scope includes support for Host Aware (HA) devices, may include support for Host Managed (HM) devices, and will include ability to restrict behavior to enforce a common ZBC / ZAC command set protocol.
SMR drives have a specific idiosyncrasy: (a) drive managed drives prefer non-interrupted sequential writes through a zone, (b) host aware drives prefer forward writes within a zone, and (c) host managed drives require forward writes within a zone (along with other constraints). By optimizing sequential file layout -- in-order writes and garbage collection (idle-time defragmentation and compaction) -- the file system should work with the drive to reduce non-preferred or disallowed behavior, greatly decreasing latency for applications.
2. ZAC/ZBC standards
Standards:
Zoned Block Commands (ZBC)
Zoned-device ATA Commands (ZAC)
ZAC/ZBC standards arose in T10/T13 in response to SMR drives being developed to enter the market. New methods are being standardized to establish a communication protocol for zoned block devices. ZBC covers SCSI devices, and the standard is being ratified through the T10 organization. ATA standards will be ratified through the T13 organization under the title ZAC.
Latest specifications can be found on www.t10.org and www.t13.org.
ZAC and ZBC command sets cover both Host Aware (HA) and Host Managed (HM) devices. SMR drives are expected to saturate the HDD market over the coming years. Without this modification (ZBC command support), HM will NOT work with traditional filesystems. With this modification, HA will demonstrate performance and determinism -- as found in non-SMR drives -- in traditional & new applications.
ZAC and ZBC specifications are device agnostic. The specifications were developed for SMR HDDs, but can be applied to conventional drives, Flash & SSDs, and even [possibly] optical media.
ZBC was sent to INCITS on 4Sept2015 (INCITS 536). ZAC is expected to be sent to INCITS in December 2015. Additional features are being planned for later drafts.
Commands
REPORT_ZONES
The REPORT_ZONES command is the primary method for gaining information about the zones on a disk. In order to make any meaningful decisions about the IO, the data must be gathered. The information returned is as such.
Zone type: Conventional, Sequential Write Required, Sequential Write Preferred
Zone condition: Not Write Pointer, Empty, Open, Read Only, Full, Offline
Non_seq: a bit that indicates that an out-of-order IO request has been received for the zone.
Zone length: Length of zone in LBAs
Zone start LBA
Write Pointer LBA
Because the REPORT_ZONES command is a non-queued command, issuing a REPORT_ZONES command to the drive will cause all commands in the drive's work queue to be flushed. This will create a significant performance problem in Filesystems and Applications that continually request this information. It is expected that allocation software maintain a mirror cache of this information.
RESET_WRITE_POINTER
The RESET_WRITE_POINTER command is a successor to the TRIM command for ZAC/ZBC devices. Unlike TRIM, RESET_WRITE_POINTER is responsible for clearing a zone. The forward-only Write Pointer is reset to the beginning of the zone, allowing data to be overwritten without consequence. Like TRIM, this is implemented as DISCARD within the kernel.
OPEN_ZONE
CLOSE_ZONE
FINISH_ZONE
These 3 commands are optional, and they manage zone conditions. OPEN_ZONE and CLOSE_ZONE toggle the Zone Condition between EXPLICIT_OPEN and CLOSED. Without the use of this command, the Zone Condition is IMPLICIT_OPEN upon a write to the zone.
There are advisory numbers on the drive, presented through the VPD pages, which limit the number of zones that can be open with EXPLICIT_OPEN and IMPLICIT_OPEN. Once the number of zones with either of these states exceed this number, the device will have to close zones. This is done implicitly for zones with condition IMPLICIT_OPEN and requires intervention with EXPLICIT_OPEN.
The advisory numbers are drive dependent.
Challenges
ZAC/ZBC paradigms attempt to provide an interface to solve a fundamental problem: SMR is forward-write only. This change violates a long-held notion of storage design: random writes for random access devices. Random write is now separated from random read. Because each level in the storage stack operates on a shared, generally stateless interface, each level is responsible for fulfilling the requirements for ZAC/ZBC. As each layer has [little] knowledge of the other layers, each is responsible for FIFO correctness, preventing race conditions and re-ordering of IO.
ZAC/ZBD also presents more information that has to be passed up the stack. Currently, there are no pathways and no consumers of this data. However, for optimal performance, the information must be consumed.
Besides the idiosyncrasies of SMR that are solved with ZAC/ZBC, the solution brings its own idiosyncrasies. The RESET_WRITE_POINTER has a security idiosyncrasy: because reads ahead of the write_pointer return a predetermined pattern (eg. all zeros), a RESET_WRITE_POINTER will render all data in the zone effectively deleted. On HM drives, this results in an irreversible deletion -- HM requires sequential writes to advance the write pointer. The REPORT_ZONES command requires that drive activity be finalized to accurately report the location. This results in a disk flush operation.
3. Stack Changes
For every pathway through the stack, the ZAC/ZBD zone information must be examined and replicated upwards. Furthermore, action commands must be able to find there way down the stack to the driver, and ultimately the drive.
It is expected that over time, the ZAC/ZBD pathways will overtake and replace existing pathways. The ZAC/ZBD standards are compatible with conventional drives (although some of the information will have to synthesized along the way). Existing acceptance theories require that changes be both minimal and unintrusive. However, the required changes are anything but. Therefore, there will have to be a phase-in approach where conventional and ZAC/ZBD paths are parallel and mostly separate.
AHCI
AHCI is the software equivalent of SATA firmware. AHCI is responsible for exposing the advanced features of the SATA interface. Although AHCI presents a passthrough mode, the addition of ZAC/ZBD commands enables faster, more stable execution and caching of zone information.
Work on AHCI was completed by Seagate Technologies in late 2014.
libata
libata is the library that hosts the commands for ATA communication. ZAC/ZBD commands were added to this library. This includes sense date for ACS-4
. This layer is also responsible for processing translations between SCSI and ATA.
Work on libata was completed by Seagate Technologies in early 2015. Work is based off previous improvements by SUSE.
libsas
libsas is the SCSI equivalent to libata. ZAC/ZBD commands need to be added.
Work is not yet scheduled for libsas.
SCSI
SCSI provides the commands in a non-transferable format to the upper layers. When a command is received here (with its arguments), it is translated and sent to the lower libraries. ZAC/ZBD commands are added to the SCSI layer. Also, as re-ordering can happen at this layer (in alignment with NCQ), a re-queueing algorithm has been added to reorder the reorders. The queue simply re-queues improper IO requests (IOW not at the write pointer) at the end of the queue. This is a circular list that is iterated for the correct IO.
Work on SCSI is championed by SUSE. This work was integrated into the stack by Seagate. Additional work is being to ensure both HA and HM have included pathways.
SD
SD (SCSI DEVICE) is the driver for the drive. It provides read, write and ioctl interfaces to higher layers. for ZAC/ZBD, 2 new interfacesi were added: 1 for reset_wp and another for report_zones. Because the SD driver sees every write, no matter the source, the SD driver now stores the zone information in a memory cache to avoid performance penalties related to issuing a REPORT_ZONES command to the drive's firmware.
Blockdev
The blockdev system receives ioctl commands then issues them on behalf of the caller to the device. This usually provides a cleaner interface, or hides multiple commands. ZAC/ZBD commands have been added.
Work on blockdev has been extensive, started by SUSE, and incorporated by Seagate.
IO scheduler
The IO scheduler elevator is responsible for deciding the order of writes to the disk. Existing elevators seek to rearrange IO with either nothing (noop elevator), or a combination of LBA seeks, priority, process-based scheduling, and time deadline. A new scheduler needs to be added to account for LBA sequentiality.
Work on the IO scheduler is yet to be scheduled.
md/mdraid
md will have 2 purposes: the first is to provide shims that interface between disk and apps; the second is to enable ZAC/ZBC-aware RAID.
There are 3 types of shims: one that provides conventional <--> HA/HM, one that provides a HA/HM <--> conventional interface, and another that provides a HA/HM<-->HA/HM interface. The first is a simulator for HA/HM running on a conventional drive. Of the remaining, the former blocks ZAC/ZBC from rising to upper layers, and the latter passes the information.
The conventional <--> HA/HM shim is the early phase of ZAC/ZBC adoption. It is a simulator, and has little expected value beyond that. The path that this shim represents is expected to be absorbed into the SD driver, allowing conventional disks to be presented as HA/HM.
Work on this shim has been started, but the project has been shelved because of advancement of the other kernel work, by Seagate.
The HA/HM <--> conventional shim provides an HA/HM drive to be used with legacy/non-compliant applications (Filesystems). As it presents the drive as a conventional drive to everything above it, it eliminates the need for further massive changes. This is a lasting stopgap measure until ZAC/ZBC is fully integrated into the stack and matured. This is also a solution for legacy filesystems that can't yet be obsoleted (eg Fat32 for EFI partitions) This shim works by transforming the filesystem into Copy-on-Write at the block device. As such, what the Filesystem believes are the allocations is completely different than what the drive sees as allocations. The shim maintains metadata that is the LBA mapping table. During idle times, the shim can clean up the mappings (defragment) to improve read performance.
This shim seeks to allow all layers above it to work. RAID/LVM/FS will work as is.
Work on this shim has been significantly advanced by Seagate, but is not included as part of the SMRFFS project.
The HA/HM <--> HA/HM shim provides ZAC/ZBC information up the stack. Because most of the functionality of this md shim is mirror and already implemented in SD, there is little to do, except in combination with LVM/RAID. There is one reason this is needed: With multiple disks (or even 1 drive), zone information is not guaranteed to be identical. This shim, along with mdraid, will need to mangle (read: change) the reported zone information in particular way.
This shim may not be strictly necessary, as the functionality of it can be fully absorbed into the consuming layers.
Work on this shim, and the associated layers, is expected to begin in August 2015
LVM
Logical Volume Manager (LVM) is software that combines disks linearly, allowing the drives to appear to change size. The drives are buttressed against each other. This results in a JBOD (Just a Bunch Of Disks) array. There is no guarantee that drives underneath are identical, and in general, LVM doesn't care. However, to be presented as a single volume, the aggregation must be seamless. For ZAC/ZBD, this includes offsetting LBAs (as is current), but also to align different zone information (IOW the SAME field in the REPORT_ZONES cannot be set for the information passed up, although it may be set for each individual drive). The LVM could have a mix of zones that are different types and different sizes.
Work on LVM has just began in October 2015.
mdraid
mdraid will require extensive changes. The drives will be arranged in a way that will require a combination of 1) overlapping zones 2) striping zones and 3) parity.
Planning and design has begun.
Page Cache
We currently expect there is nothing to do for ZAC/ZBC in the page cache, except for the possibility of adding ordered stability to the pages as they enter the cache, go through the cache, and exit the cache.
Filesystem (EXT4)
Aside from possibly mdraid, the FS is the lowest application that chooses allocation. Everything below the FS seeks to honor the FS choice, and everything above cares little. The FS is most sensitive to ZAC/ZBD changes. Without the needed changes, existing FSes will either 1) simply fail or 2) have performance degradation. The FS now has a need to know about the logical/physical layout of the disk. FSes of yesteryear sought to optimize based on CHS information from the firmware. However, after FS creation and layout, that information was never queried again, and the FS is essentially drive agnostic. SMRFFS seeks to continue in the same tradition.
Upon creation, the FS is created in a way that mimics the underlying device. Block Groups are laid out to match the zone alignments. Once created, the metadata in the FS mirrors the information in REPORT_ZONES at any given time (this removes massive performance penalties). The allocator is changed such that the writes are no longer random, but rather follow forward-write only rules. Upon mount, because of the criticalness of following forward-write only, the allocation bitmaps are scanned and checked for accuracy against the REPORT_ZONES information. This one rule requires multiple algorithm additions and enhancements inside the FS. While this initially introduces 2 control paths in the FS (one for conventional drives, and another for ZAC/ZBC), we expect that the ZAC/ZBC path will absorb all use cases from the conventional path. ZAC/ZBC will work on a conventional drive (although some information -- zone start and length -- need to be synthesized in SD).
Work on the FS (EXT4) is currently under development by Seagate. This is the SMRFS project.
sysfs
Up to this point, all work has been committed to the kernelspace. There are utilities that work in userspace that will need the ZAC/ZBC information also. Many of these utilities take the place of the FS for a specific purpose. ZAC/ZBC zone information will need to be presented (and maintained) in sysfs from SD.
4. Userland Utilities
mke2fs
Worked on by Seagate
1. Add ZBD flag
Requires packed_meta_data
Requires extents*
Requires bigalloc
2. Query Zone information from disk
lay out BGs accordingly
handle multi-size BGs
3. SB/GD changes
4. New Extent layout
*incompatible with EXT2/3/4 indirect lists
December 2015
Various fixes aligning structures to zone boundaries: block groups and journal location/size.
hdparm
Finished: Reworked by Seagate
1. Query and report Drive type
2. Query and Report Zone Information
sdparm
1. Query and report Drive type
2. Query and Report Zone Information
gdisk (not fdisk)
1. New Defaults?
2. add ZBD flag
3. Query disk and suggest optimizations
4. handle zones with GPT (Not MBR with fdisk)
gparted (not parted)
1. New Defaults?
2. add ZBD flag
3. Query disk and suggest optimizations
4. handle zones with GPT (Not MBR with parted)
EXT4 Library (e2fsprogs)
1. Add ZBD structures
2. Update SB/GD structures
3. Add write-engine for write-in-place utilities
4. Add new journal support for write-in-place utilities
5. Add new allocator routine (same as in FS)
e2freefrag
No major changes
1. Add reporting recommendation to compact
dumpe2fs
1. Add reporting information for ZBD SB & GDs
e2undo
Obsolete. Will REQUIRE journal on ZBD, which makes this redundant.
e2image
1. Will need to write SB using write engine.
e2defrag
Needs to be gutted and rewritten
uses write-engine
uses allocator
1. Add defragmenter compatible with ZBD
2. Add compactor option(s)
Compact within zones (zone pack)
Compact zones (disk pack) (range)
3. Add new journal support (metadata)
tune2fs
use write-engine
1. Multisize BG support
2. Add options for new fields in SB & GDs
3. If needed, move/resize BGs, edit inodes (journal)
4. Re-write SB & GDs
resize2fs
use write-engine
Will not modify partitions
1. Add ZBD flag
2. Add support for multi-size BGs support
3. Will need to re-write SB
tune2fs
use write-engine
1. modify SB for ZBD options
2. Support for multi-size BGs
3. modify GD for size/condition/type
debugfs
use write-engine, and all functions in library
e4fschk
use write-engine
1. Add support for new options in SB & GD
2. Add new inode handling
3. Add new journal support
mdadm
possibly extensive rewrite
1. add ZBD support
5. Schedule
Internally, we have organized the project into 'releases' ranging from v0.1 to v0.8
v0.1 Superficial changes with existing code (assume 256MiB zones)
mkfs options -b 4096 -C 8192 -E bigalloc,packed_meta_blocks=1,discard,num_backup_sb=0 -O extent,sparse_super2,^has_journal
Simulation of 8k blocks
No journal
v0.2 Minor FS changes
Add ZAC/ZBC bit flag in SB
Add internal structures to support ZAC/ZBC
Forward write only verification/tweaking
v0.3 Kernel IO stack changes
Update AHCI, libata, SCSI, SD
v0.4 Kernel IO stack communication
ioctls from SD to FS
v0.45 Improved updates to v0.4
v0.5 Major FS changes
New block allocator - forward-write only at Write Pointer
New journal
B+trees for metadata
New extents
New Garbage Collector/Defragmenter/Compactor
groundwork for multi-sized BGs
v0.6 Userland utilities
resize2fs
tune2fs
dumpe2fs
debugfs
e2freefrag
e2image
e2defrag
e4fschk
e2undo
mke2fs
hdparm
sdparm
gdisk
gparted
mdadm
others?
v0.7 RAID support
DM shim: HA/HM <--> conventional
DM shim: HA/HM <--> HA/HM
multi-sized BGs
LVM
mdraid
v0.8 Performance/Standards compliance
Add/verify/enforce HM requirements
Completed
v0.1
Developed, tested, released (tweaks still ongoing)
v0.2
Developed, tested, released
v0.3
Developed, tested, released
v0.4
Developed, tested, released, presented at Vault Storage Conference
v0.45
Incorporated code, released
In Progress
v0.5 - expected December 2015
Tweaking B+Tree code
Garbage collection development
To Be Done
v0.6 - expected December 2015
v0.7 - expected December 2015
v0.8 - expected TBA
Presentatons/Speaking Engagements
Linux Storage and FileSystems/Memory Management Summit 2015
Linux Vault Conference 2015
Massive Storage Systems and Technology 2015
Linux Plumber's Conference 2015
SNIA Developer's Conference 2015
6. Patch Notes
ATA_IDE
Providing a base for future work, this patch updates files with code needed to provide ZAC support at the ATA layers. These changes allow the basic communication with SATA ZAC/ZBD drives. These patches add the new ZAC/ZBD commands to the libraries, detecting them as such and to what degree they require maintenance (NONE, Drive Managed --if reported, Host Aware, Host Managed)
Changes include:
New ZAC/ZBC commands
Changes in taskfile to accommodate commands
Errors for ZAC/ZBC commands
Traces for ZAC/ZBC commands
Translations for ZAC.ZBC commands
Detection of drive type
SCSI_SAS
As the Linux stack assumes SCSI internally, commands for ZAC/ZBC (developed on SATA drives) must be implemented. These changes reside on top of ATA_IDE changes. The extent of this patch receives commands by code number, simply to pass them along to lower levels. Beyond codes being defined, there is no implementation of SCSI commands.
SD
The driver for the devices. ZAC/ZBD procedures have been added. Upon detection of devices, the SD driver is responsible for issuing and storing the command results.
Without the lower layer patches, changes here would not take effect.
Changes require the setting of CONFIG_SCSI_ZBC
BLOCKDEV
This patch adds functionality for the management of ZAC/ZBC zones and exposes symbols upward. Compilation requires CONFIG_BLK_DEV_ZONED and CONFIG_BLK_ZAONED
EXT4
The EXT4 patch begins to defines the needed structures for ZAC/ZBC use. The goal is to manage the zone on the SMR drive via the management of the BGs.
7. Installation
Under kernel 4.2.0 (or for another, some conflicts may need to be resolved), apply each patch with git apply.
Or compile the provided kernel for already included patches.
Compile and install the kernel as per normal procedures.
8. FAQs
What's the difference between SMR solutions?
There are 4 formats: No format, Drive Managed (DM), Host Aware (HA) and Host Managed (HM)
No format is conceptual SMR: forward write only. (Period). Think of it like tape.
DM: The drive is presented to the OS as a conventional drive. The drive hides all implementation of the forward-write only work and allows random writes by the OS. Under certain workloads, this has performance problems. Current software will work on these drives.
HM: The drive is present to the OS a new device type. The drive requires the OS to make proper IO choices and follow the rules. Anything not conforming to the rules is returned as an error. By following the rules, high performance is expected. All currently existing software (filesystems) will break on HM.
HA: The drive uses and expects HM rules, but offers flexibility instead of an error when non-conformant writes are received. Current software will work, but not optimally on these drives.
What's the difference between ZAC and ZBC?
Zoned Block Commands (ZBC)
Zoned-device ATA Commands (ZAC)
Both standards have the same commands and return types, for the same purpose. ZAC is ATA and ZBC is SCSI.
9. Use Cases
SMR ZAC/ZBC drives are currently slated for an archive market, with future work required for other use cases.
a) backup systems:
This case is Write Once, Read Many (if ever) -- WORM. Data is written strictly sequentially, either using no filesystem or a log-structured FS. Data can be written through to the end of the disk.
10. Future/additional Work
Zoned Device Mapper (https://github.com/Seagate/ZDM-Device-Mapper)
A HA/HM drive to conventional mapping. Uses CoW to allow any legacy FS to work on SMR drives.
SMR Multidrive
Building on top of SMRFFS and ZDM, Seagate is looking to incorporate SMR RAID solutions into the stack.
11. Feedback
Skepticism
This is a large project with major changes throughout the IO stack, and is expected to have to have groundbreaking acceptance when SMR drives saturate the market. Until then, there is skepticism in the community. This is expected for such a change. We take this feedback from the open source community, as well as research from proprietary vendors, and the direction of other FS projects to come to the conclusion that this is a needed and beneficial project for the next generation of storage technology.
Contact
Post questions on github, or send email to maintainer (adrian.palmer@seagate.com).
12. Legal
Releases will be available at http://www.github.com/seagate/SMR_FS-EXT4
How is Seagate cooperative in this project?
Under the GPLv2 license, Seagate is willing to share code with partners who will contribute to Seagate's efforts as Seagate contributes to the community. Seagate is actively seeking help, from corporations or individuals. Please contact the author to provide assistance.
Seagate seeks no revenue directly from this filesystem. It is given as a gift to the community.
Seagate's modifications to EXT4 are distributed under the GPLv2 license "as is," without technical support, and WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
You should receive a copy of the GNU Lesser General Public License along with any updates/patches. If not, see <http://www.gnu.org/licenses/>.