Skip to content

🍞 no_std Unicode case folding comparisons

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-UNICODE
Notifications You must be signed in to change notification settings

artichoke/focaccia

Repository files navigation

focaccia

GitHub Actions Code Coverage Discord Twitter
Crate API API trunk

Unicode case folding methods for case-insensitive string comparisons. Used to implement case folding operations on the Symbol and String classes in the Ruby Core implementation in Artichoke Ruby.

Focaccia supports full, ASCII, and Turkic Unicode case folding equality and ordering comparisons.

One of the most common things that software developers do is "normalize" text for the purposes of comparison. And one of the most basic ways that developers are taught to normalize text for comparison is to compare it in a "case insensitive" fashion. In other cases, developers want to compare strings in a case sensitive manner. Unicode defines upper, lower, and title case properties for characters, plus special cases that impact specific language's use of text. (W3C, Case Folding)

focaccia is a flat Italian bread. The focaccia crate compares UTF-8 strings by flattening them to folded downcase. Artichoke goes well with focaccia.

Usage

Add this to your Cargo.toml:

[dependencies]
focaccia = "1.4.0"

Then make case insensitive string comparisons like:

use core::cmp::Ordering;
use focaccia::CaseFold;

let fold = CaseFold::Full;
assert_eq!(fold.casecmp("MASSE", "Maße"), Ordering::Equal);
assert_eq!(fold.casecmp("São Paulo", "Sao Paulo"), Ordering::Greater);

assert!(fold.case_eq("MASSE", "Maße"));
assert!(!fold.case_eq("São Paulo", "Sao Paulo"));

For text known to be ASCII, Focaccia can make a more performant comparison check:

use core::cmp::Ordering;
use focaccia::CaseFold;

let fold = CaseFold::Ascii;
assert_eq!(fold.casecmp("Crate: focaccia", "Crate: FOCACCIA"), Ordering::Equal);
assert_eq!(fold.casecmp("Fabled", "failed"), Ordering::Less);

assert!(fold.case_eq("Crate: focaccia", "Crate: FOCACCIA"));
assert!(!fold.case_eq("Fabled", "failed"));

ASCII case comparison can be checked on a byte slice:

use core::cmp::Ordering;
use focaccia::{ascii_casecmp, ascii_case_eq};

assert_eq!(ascii_casecmp(b"Artichoke Ruby", b"artichoke ruby"), Ordering::Equal);
assert!(ascii_case_eq(b"Artichoke Ruby", b"artichoke ruby"));

Turkic case folding is similar to full case folding with additional mappings for dotted and dotless I:

use core::cmp::Ordering;
use focaccia::CaseFold;

let fold = CaseFold::Turkic;
assert_eq!(fold.casecmp("İstanbul", "istanbul"), Ordering::Equal);
assert_ne!(fold.casecmp("İstanbul", "Istanbul"), Ordering::Equal);

assert!(fold.case_eq("İstanbul", "istanbul"));
assert!(!fold.case_eq("İstanbul", "Istanbul"));

Implementation

Focaccia generates conversion tables from Unicode data files. Focaccia implements case folding as defined in the Unicode standard (see CaseFolding.txt).

no_std

Focaccia is no_std compatible with an optional and enabled by default dependency on std. Focaccia does not link to alloc in its no_std configuration.

Crate features

All features are enabled by default.

  • std - Enable linking to the Rust Standard Library. Enabling this feature adds Error implementations to error types in this crate.

Minimum Supported Rust Version

This crate requires at least Rust 1.56.0. This version can be bumped in minor releases.

Unicode Version

Focaccia implements Unicode case folding with the Unicode 15.0.0 case folding ruleset.

Each new release of Unicode may bring updates to the CaseFolding.txt which is the source for the folding mappings in this crate. Updates to the case folding rules will be accompanied with a minor version bump.

License

focaccia is licensed under the MIT License (c) Ryan Lopopolo.

focaccia includes Unicode Data Files which are subject to the Unicode Terms of Use and Unicode Data Files and Software License (c) 1991-2022 Unicode, Inc.

Generated files in this repository are marked with // @generated comments and the Unicode copyright. These generated files incorporate data derived from the Unicode Data Files. More details about the generation process can be found in scripts/gen_case_lookups.rb. The generated sources created by this script are subject to both the MIT License contained in this repository and the Unicode Data Files and Software License.