README

README - SMRFFS-EXT4
Seagate Technologies LLC
Lead Engineer: Adrian Palmer
December 2015

Table of Contents

1. About
2. ZAC/ZBD standards
	Commands
	Challenges
3. Stack Changes
	[change, method, backers]
	ahci
	libata
	libsas
	SCSI
	SD
	blk_dev
	IO Scheduler
	mdraid
	lvm
	FS
	/sys
4. Userland Utilities
	hdparm
	sdparm
	mke2fs
	mount.ext4
	resize2fs
	tune2fs
	dumpe2fs
	debugfs
	e2freefrag
	e2image
	e2undo
	gparted
	gdisk
5. Schedule
6. Patch Notes
7. Installation
8. FAQ
9. Use Cases
10. Future Work
11. Feedback
12. Contact info/legal
	
===============================================

1. About

SMRFFS is an addition to the popular EXT4 to enable support for devices that use the ZBC or ZAC standards. Project scope includes support for Host Aware (HA) devices, may include support for Host Managed (HM) devices, and will include ability to restrict behavior to enforce a common ZBC / ZAC command set protocol. 

SMR drives have a specific idiosyncrasy: (a) drive managed drives prefer non-interrupted sequential writes through a zone, (b) host aware drives prefer forward writes within a zone, and (c) host managed drives require forward writes within a zone (along with other constraints). By optimizing sequential file layout -- in-order writes and garbage collection (idle-time defragmentation and compaction) -- the file system should work with the drive to reduce non-preferred or disallowed behavior, greatly decreasing latency for applications.

2. ZAC/ZBC standards

Standards:
	Zoned Block Commands (ZBC)
	Zoned-device ATA Commands (ZAC) 

	ZAC/ZBC standards arose in T10/T13 in response to SMR drives being developed to enter the market. New methods are being standardized to establish a communication protocol for zoned block devices. ZBC covers SCSI devices, and the standard is being ratified through the T10 organization. ATA standards will be ratified through the T13 organization under the title ZAC. 

	Latest specifications can be found on www.t10.org and www.t13.org.

	ZAC and ZBC command sets cover both Host Aware (HA) and Host Managed (HM) devices. SMR drives are expected to saturate the HDD market over the coming years. Without this modification (ZBC command support), HM will NOT work with traditional filesystems. With this modification, HA will demonstrate performance and determinism -- as found in non-SMR drives -- in traditional & new applications.

	ZAC and ZBC specifications are device agnostic. The specifications were developed for SMR HDDs, but can be applied to conventional drives, Flash & SSDs, and even [possibly] optical media.

	ZBC was sent to INCITS on 4Sept2015 (INCITS 536).  ZAC is expected to be sent to INCITS in December 2015.  Additional features are being planned for later drafts.

Commands

	REPORT_ZONES
	
	The REPORT_ZONES command is the primary method for gaining information about the zones on a disk. In order to make any meaningful decisions about the IO, the data must be gathered. The information returned is as such.
	
	Zone type: Conventional, Sequential Write Required, Sequential Write Preferred
	Zone condition: Not Write Pointer, Empty, Open, Read Only, Full, Offline
	Non_seq: a bit that indicates that an out-of-order IO request has been received for the zone.
	Zone length: Length of zone in LBAs
	Zone start LBA
	Write Pointer LBA

	Because the REPORT_ZONES command is a non-queued command, issuing a REPORT_ZONES command to the drive will cause all commands in the drive's work queue to be flushed. This will create a significant performance problem in Filesystems and Applications that continually request this information. It is expected that allocation software maintain a mirror cache of this information.
	
	RESET_WRITE_POINTER

	The RESET_WRITE_POINTER command is a successor to the TRIM command for ZAC/ZBC devices. Unlike TRIM, RESET_WRITE_POINTER is responsible for clearing a zone. The forward-only Write Pointer is reset to the beginning of the zone, allowing data to be overwritten without consequence. Like TRIM, this is implemented as DISCARD within the kernel.

	OPEN_ZONE
	CLOSE_ZONE
	FINISH_ZONE

	These 3 commands are optional, and they manage zone conditions. OPEN_ZONE and CLOSE_ZONE toggle the Zone Condition between EXPLICIT_OPEN and CLOSED. Without the use of this command, the Zone Condition is IMPLICIT_OPEN upon a write to the zone.

	There are advisory numbers on the drive, presented through the VPD pages, which limit the number of zones that can be open with EXPLICIT_OPEN and IMPLICIT_OPEN. Once the number of zones with either of these states exceed this number, the device will have to close zones. This is done implicitly for zones with condition IMPLICIT_OPEN and requires intervention with EXPLICIT_OPEN.

	The advisory numbers are drive dependent.
	
Challenges
	ZAC/ZBC paradigms attempt to provide an interface to solve a fundamental problem: SMR is forward-write only. This change violates a long-held notion of storage design: random writes for random access devices. Random write is now separated from random read. Because each level in the storage stack operates on a shared, generally stateless interface, each level is responsible for fulfilling the requirements for ZAC/ZBC. As each layer has [little] knowledge of the other layers, each is responsible for FIFO correctness, preventing race conditions and re-ordering of IO.

	ZAC/ZBD also presents more information that has to be passed up the stack. Currently, there are no pathways and no consumers of this data. However, for optimal performance, the information must be consumed.

	Besides the idiosyncrasies of SMR that are solved with ZAC/ZBC, the solution brings its own idiosyncrasies. The RESET_WRITE_POINTER has a security idiosyncrasy: because reads ahead of the write_pointer return a predetermined pattern (eg. all zeros), a RESET_WRITE_POINTER will render all data in the zone effectively deleted. On HM drives, this results in an irreversible deletion -- HM requires sequential writes to advance the write pointer. The REPORT_ZONES command requires that drive activity be finalized to accurately report the location. This results in a disk flush operation. 


3. Stack Changes

	For every pathway through the stack, the ZAC/ZBD zone information must be examined and replicated upwards. Furthermore, action commands must be able to find there way down the stack to the driver, and ultimately the drive.

	It is expected that over time, the ZAC/ZBD pathways will overtake and replace existing pathways. The ZAC/ZBD standards are compatible with conventional drives (although some of the information will have to synthesized along the way). Existing acceptance theories require that changes be both minimal and unintrusive. However, the required changes are anything but. Therefore, there will have to be a phase-in approach where conventional and ZAC/ZBD paths are parallel and mostly separate.

	AHCI
	AHCI is the software equivalent of SATA firmware. AHCI is responsible for exposing the advanced features of the SATA interface. Although AHCI presents a passthrough mode, the addition of ZAC/ZBD commands enables faster, more stable execution and caching of zone information.

	Work on AHCI was completed by Seagate Technologies in late 2014.

	libata
	libata is the library that hosts the commands for ATA communication. ZAC/ZBD commands were added to this library. This includes sense date for ACS-4
. This layer is also responsible for processing translations between SCSI and ATA.

	Work on libata was completed by Seagate Technologies in early 2015. Work is based off previous improvements by SUSE. 
	
	libsas
	libsas is the SCSI equivalent to libata. ZAC/ZBD commands need to be added.

	Work is not yet scheduled for libsas.

	SCSI
	SCSI provides the commands in a non-transferable format to the upper layers. When a command is received here (with its arguments), it is translated and sent to the lower libraries. ZAC/ZBD commands are added to the SCSI layer. Also, as re-ordering can happen at this layer (in alignment with NCQ), a re-queueing algorithm has been added to reorder the reorders. The queue simply re-queues improper IO requests (IOW not at the write pointer) at the end of the queue. This is a circular list that is iterated for the correct IO.

	Work on SCSI is championed by SUSE. This work was integrated into the stack by Seagate.  Additional work is being to ensure both HA and HM have included pathways.

	SD
	SD (SCSI DEVICE) is the driver for the drive. It provides read, write and ioctl interfaces to higher layers. for ZAC/ZBD, 2 new interfacesi were added: 1 for reset_wp and another for report_zones. Because the SD driver sees every write, no matter the source, the SD driver now stores the zone information in a memory cache to avoid performance penalties related to issuing a REPORT_ZONES command to the drive's firmware.

	Blockdev
	The blockdev system receives ioctl commands then issues them on behalf of the caller to the device. This usually provides a cleaner interface, or hides multiple commands. ZAC/ZBD commands have been added.

	Work on blockdev has been extensive, started by SUSE, and incorporated by Seagate.


	IO scheduler
	The IO scheduler elevator is responsible for deciding the order of writes to the disk. Existing elevators seek to rearrange IO with either nothing (noop elevator), or a combination of LBA seeks, priority, process-based scheduling, and time deadline. A new scheduler needs to be added to account for LBA sequentiality.

	Work on the IO scheduler is yet to be scheduled.

	md/mdraid
	md will have 2 purposes: the first is to provide shims that interface between disk and apps; the second is to enable ZAC/ZBC-aware RAID.

	There are 3 types of shims: one that provides conventional <--> HA/HM, one that provides a HA/HM <--> conventional interface, and another that provides a HA/HM<-->HA/HM interface. The first is a simulator for HA/HM running on a conventional drive. Of the remaining, the former blocks ZAC/ZBC from rising to upper layers, and the latter passes the information.


	The conventional <--> HA/HM shim is the early phase of ZAC/ZBC adoption. It is a simulator, and has little expected value beyond that. The path that this shim represents is expected to be absorbed into the SD driver, allowing conventional disks to be presented as HA/HM. 

	Work on this shim has been started, but the project has been shelved because of advancement of the other kernel work, by Seagate.

	The HA/HM <--> conventional shim provides an HA/HM drive to be used with legacy/non-compliant applications (Filesystems). As it presents the drive as a conventional drive to everything above it, it eliminates the need for further massive changes. This is a lasting stopgap measure until ZAC/ZBC is fully integrated into the stack and matured. This is also a solution for legacy filesystems that can't yet be obsoleted (eg Fat32 for EFI partitions) This shim works by transforming the filesystem into Copy-on-Write at the block device. As such, what the Filesystem believes are the allocations is completely different than what the drive sees as allocations. The shim maintains metadata that is the LBA mapping table. During idle times, the shim can clean up the mappings (defragment) to improve read performance.

	This shim seeks to allow all layers above it to work. RAID/LVM/FS will work as is.

	Work on this shim has been significantly advanced by Seagate, but is not included as part of the SMRFFS project.

	The HA/HM <--> HA/HM shim provides ZAC/ZBC information up the stack. Because most of the functionality of this md shim is mirror and already implemented in SD, there is little to do, except in combination with LVM/RAID. There is one reason this is needed: With multiple disks (or even 1 drive), zone information is not guaranteed to be identical. This shim, along with mdraid, will need to mangle (read: change) the reported zone information in particular way.

	This shim may not be strictly necessary, as the functionality of it can be fully absorbed into the consuming layers.

	Work on this shim, and the associated layers, is expected to begin in August 2015

	LVM
	Logical Volume Manager (LVM) is software that combines disks linearly, allowing the drives to appear to change size. The drives are buttressed against each other. This results in a JBOD (Just a Bunch Of Disks) array. There is no guarantee that drives underneath are identical, and in general, LVM doesn't care. However, to be presented as a single volume, the aggregation must be seamless. For ZAC/ZBD, this includes offsetting LBAs (as is current), but also to align different zone information (IOW the SAME field in the REPORT_ZONES cannot be set for the information passed up, although it may be set for each individual drive). The LVM could have a mix of zones that are different types and different sizes.

	Work on LVM has just began in October 2015.

	mdraid
	mdraid will require extensive changes. The drives will be arranged in a way that will require a combination of 1) overlapping zones 2) striping zones and 3) parity.

	Planning and design has begun.

	Page Cache

	We currently expect there is nothing to do for ZAC/ZBC in the page cache, except for the possibility of adding ordered stability to the pages as they enter the cache, go through the cache, and exit the cache.

	Filesystem (EXT4)

	Aside from possibly mdraid, the FS is the lowest application that chooses allocation. Everything below the FS seeks to honor the FS choice, and everything above cares little. The FS is most sensitive to ZAC/ZBD changes. Without the needed changes, existing FSes will either 1) simply fail or 2) have performance degradation. The FS now has a need to know about the logical/physical layout of the disk. FSes of yesteryear sought to optimize based on CHS information from the firmware. However, after FS creation and layout, that information was never queried again, and the FS is essentially drive agnostic. SMRFFS seeks to continue in the same tradition.

	Upon creation, the FS is created in a way that mimics the underlying device. Block Groups are laid out to match the zone alignments. Once created, the metadata in the FS mirrors the information in REPORT_ZONES at any given time (this removes massive performance penalties). The allocator is changed such that the writes are no longer random, but rather follow forward-write only rules. Upon mount, because of the criticalness of following forward-write only, the allocation bitmaps are scanned and checked for accuracy against the REPORT_ZONES information. This one rule requires multiple algorithm additions and enhancements inside the FS. While this initially introduces 2 control paths in the FS (one for conventional drives, and another for ZAC/ZBC), we expect that the ZAC/ZBC path will absorb all use cases from the conventional path. ZAC/ZBC will work on a conventional drive (although some information -- zone start and length -- need to be synthesized in SD).

	Work on the FS (EXT4) is currently under development by Seagate.  This is the SMRFS project.

	sysfs
	Up to this point, all work has been committed to the kernelspace. There are utilities that work in userspace that will need the ZAC/ZBC information also. Many of these utilities take the place of the FS for a specific purpose. ZAC/ZBC zone information will need to be presented (and maintained) in sysfs from SD.

4. Userland Utilities

mke2fs
	Worked on by Seagate
	1. Add ZBD flag
		Requires packed_meta_data
		Requires extents*
		Requires bigalloc
	2. Query Zone information from disk
		lay out BGs accordingly
		handle multi-size BGs
	3. SB/GD changes
	4. New Extent layout
	
	*incompatible with EXT2/3/4 indirect lists

	December 2015
		Various fixes aligning structures to zone boundaries: block groups and journal location/size.

hdparm
	Finished: Reworked by Seagate
	1. Query and report Drive type
	2. Query and Report Zone Information
	
sdparm
	1. Query and report Drive type
	2. Query and Report Zone Information
	
gdisk (not fdisk)
	1. New Defaults?
	2. add ZBD flag
	3. Query disk and suggest optimizations
	4. handle zones with GPT (Not MBR with fdisk)
	
gparted (not parted)
	1. New Defaults?
	2. add ZBD flag
	3. Query disk and suggest optimizations
	4. handle zones with GPT (Not MBR with parted)
	
EXT4 Library (e2fsprogs)
	1. Add ZBD structures
	2. Update SB/GD structures
	3. Add write-engine for write-in-place utilities
	4. Add new journal support for write-in-place utilities
	5. Add new allocator routine (same as in FS)

e2freefrag
	No major changes
	1. Add reporting recommendation to compact
	
dumpe2fs
	1. Add reporting information for ZBD SB & GDs
	
e2undo
	Obsolete. Will REQUIRE journal on ZBD, which makes this redundant.
	
e2image
	1. Will need to write SB using write engine.
	
e2defrag	
	Needs to be gutted and rewritten
	uses write-engine
	uses allocator
	1. Add defragmenter compatible with ZBD
	2. Add compactor option(s)
		Compact within zones (zone pack)
		Compact zones (disk pack) (range)
	3. Add new journal support (metadata)
	
tune2fs
	use write-engine
	1. Multisize BG support
	2. Add options for new fields in SB & GDs
	3. If needed, move/resize BGs, edit inodes (journal)
	4. Re-write SB & GDs
	
resize2fs
	use write-engine
	Will not modify partitions
	1. Add ZBD flag
	2. Add support for multi-size BGs support
	3. Will need to re-write SB
	
tune2fs
	use write-engine
	1. modify SB for ZBD options
	2. Support for multi-size BGs
	3. modify GD for size/condition/type
	
debugfs
	use write-engine, and all functions in library
	
e4fschk
	use write-engine
	1. Add support for new options in SB & GD
	2. Add new inode handling
	3. Add new journal support

mdadm
	possibly extensive rewrite
	1. add ZBD support
	

5. Schedule

	Internally, we have organized the project into 'releases' ranging from v0.1 to v0.8

	v0.1 Superficial changes with existing code (assume 256MiB zones)
		mkfs options -b 4096 -C 8192 -E bigalloc,packed_meta_blocks=1,discard,num_backup_sb=0 -O extent,sparse_super2,^has_journal
		Simulation of 8k blocks
		No journal

	v0.2 Minor FS changes
		Add ZAC/ZBC bit flag in SB
		Add internal structures to support ZAC/ZBC
		Forward write only verification/tweaking

	v0.3 Kernel IO stack changes
		Update AHCI, libata, SCSI, SD

	v0.4 Kernel IO stack communication
		ioctls from SD to FS

	v0.45 Improved updates to v0.4

	v0.5 Major FS changes
		New block allocator - forward-write only at Write Pointer
		New journal
		B+trees for metadata
		New extents
		New Garbage Collector/Defragmenter/Compactor
		groundwork for multi-sized BGs

	v0.6 Userland utilities
		resize2fs
		tune2fs
		dumpe2fs
		debugfs
		e2freefrag
		e2image
		e2defrag
		e4fschk
		e2undo
		mke2fs
		hdparm
		sdparm
		gdisk
		gparted
		mdadm
		others?

	v0.7 RAID support
		DM shim: HA/HM <--> conventional
		DM shim: HA/HM <--> HA/HM
		multi-sized BGs
		LVM
		mdraid

	v0.8 Performance/Standards compliance
		Add/verify/enforce HM requirements

	Completed
		v0.1
			Developed, tested, released (tweaks still ongoing)
		v0.2
			Developed, tested, released
		v0.3
			Developed, tested, released
		v0.4
			Developed, tested, released, presented at Vault Storage Conference
		v0.45
			Incorporated code, released

	In Progress
		v0.5 - expected December 2015
		Tweaking B+Tree code
		Garbage collection development

	To Be Done
		v0.6 - expected December 2015
		v0.7 - expected December 2015
		v0.8 - expected TBA

	Presentatons/Speaking Engagements
		Linux Storage and FileSystems/Memory Management Summit 2015
		Linux Vault Conference 2015
		Massive Storage Systems and Technology 2015
		Linux Plumber's Conference 2015
		SNIA Developer's Conference 2015

6. Patch Notes

	ATA_IDE
	
	Providing a base for future work, this patch updates files with code needed to provide ZAC support at the ATA layers. These changes allow the basic communication with SATA ZAC/ZBD drives. 	These patches add the new ZAC/ZBD commands to the libraries, detecting them as such and to what degree they require maintenance (NONE, Drive Managed --if reported, Host Aware, Host Managed)
	Changes include:
		New ZAC/ZBC commands
		Changes in taskfile to accommodate commands
		Errors for ZAC/ZBC commands
		Traces for ZAC/ZBC commands
		Translations for ZAC.ZBC commands
		Detection of drive type
		

	SCSI_SAS

	As the Linux stack assumes SCSI internally, commands for ZAC/ZBC (developed on SATA drives) must be implemented. These changes reside on top of ATA_IDE changes. The extent of this patch receives commands by code number, simply to pass them along to lower levels. Beyond codes being defined, there is no implementation of SCSI commands.
	
	SD

	The driver for the devices. ZAC/ZBD procedures have been added. Upon detection of devices, the SD driver is responsible for issuing and storing the command results. 
	Without the lower layer patches, changes here would not take effect.
	Changes require the setting of CONFIG_SCSI_ZBC

	BLOCKDEV

	This patch adds functionality for the management of ZAC/ZBC zones and exposes symbols upward. Compilation requires CONFIG_BLK_DEV_ZONED and CONFIG_BLK_ZAONED

	EXT4

	The EXT4 patch begins to defines the needed structures for ZAC/ZBC use. The goal is to manage the zone on the SMR drive via the management of the BGs. 

7. Installation

	Under kernel 4.2.0 (or for another, some conflicts may need to be resolved), apply each patch with git apply.
	Or compile the provided kernel for already included patches.

	Compile and install the kernel as per normal procedures.


8. FAQs

	What's the difference between SMR solutions?
		There are 4 formats: No format, Drive Managed (DM), Host Aware (HA) and Host Managed (HM)

		No format is conceptual SMR: forward write only. (Period). Think of it like tape.

		DM: The drive is presented to the OS as a conventional drive. The drive hides all implementation of the forward-write only work and allows random writes by the OS. Under certain workloads, this has performance problems. Current software will work on these drives.

		HM: The drive is present to the OS a new device type. The drive requires the OS to make proper IO choices and follow the rules. Anything not conforming to the rules is returned as an error. By following the rules, high performance is expected. All currently existing software (filesystems) will break on HM.

		HA: The drive uses and expects HM rules, but offers flexibility instead of an error when non-conformant writes are received. Current software will work, but not optimally on these drives.


	What's the difference between ZAC and ZBC?
		Zoned Block Commands (ZBC)
 		Zoned-device ATA Commands (ZAC) 

		Both standards have the same commands and return types, for the same purpose. ZAC is ATA and ZBC is SCSI.

9. Use Cases

	SMR ZAC/ZBC drives are currently slated for an archive market, with future work required for other use cases.

	a) backup systems:
		This case is Write Once, Read Many (if ever) -- WORM. Data is written strictly sequentially, either using no filesystem or a log-structured FS. Data can be written through to the end of the disk.

10. Future/additional Work

	Zoned Device Mapper (https://github.com/Seagate/ZDM-Device-Mapper)
		A HA/HM drive to conventional mapping.  Uses CoW to allow any legacy FS to work on SMR drives.

	SMR Multidrive
		Building on top of SMRFFS and ZDM, Seagate is looking to incorporate SMR RAID solutions into the stack.

11. Feedback

	Skepticism

	This is a large project with major changes throughout the IO stack, and is expected to have to have groundbreaking acceptance when SMR drives saturate the market.  Until then, there is skepticism in the community.  This is expected for such a change.  We take this feedback from the open source community, as well as research from proprietary vendors, and the direction of other FS projects to come to the conclusion that this is a needed and beneficial project for the next generation of storage technology.

	Contact

	Post questions on github, or send email to maintainer (adrian.palmer@seagate.com).


12. Legal

Releases will be available at http://www.github.com/seagate/SMR_FS-EXT4

How is Seagate cooperative in this project?

Under the GPLv2 license, Seagate is willing to share code with partners who will contribute to Seagate's efforts as Seagate contributes to the community. Seagate is actively seeking help, from corporations or individuals. Please contact the author to provide assistance.

Seagate seeks no revenue directly from this filesystem. It is given as a gift to the community. 

Seagate's modifications to EXT4 are distributed under the GPLv2 license "as is," without technical support, and WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You should receive a copy of the GNU Lesser General Public License along with any updates/patches. If not, see <http://www.gnu.org/licenses/>.