Skip to content

Internationalization self test for the EPUB 3.3. spec

Ivan Herman edited this page Mar 3, 2021 · 18 revisions

Background information for reviewers

The particularity of the EPUB is its structure (see also the overview diagram). At first approximation an EPUB instance is a packaged Web site. The real content are in XHTML + CSS, SVG + CSS, possibly MathML, images, etc. The EPUB specification does not redefine these content specific formats, just refers to it. This also means that most of the internationalization features on, say, typography, search on text, writing directions, localization of items like names or dates, etc. depend on the i18n features of those formats, and EPUB takes these granted. Because those formats are the subjects of W3C specifications with a rigorous i18n review, it is not necessary to perform those i18n reviews for the EPUB specification proper.

EPUB does add additional information and structure to the collection of content files. These are:

  1. A set of XML files on the physical packaging format, called the Open Container Format.
  2. A navigation document that is used by a reading system to display the table of contents. The navigation file is defined to be in XHTML format, using standard markup.
  3. A package document, essentially a set of metadata items that governs the behaviors of “Reading Systems”, i.e., the piece of software and hardware that presents the EPUB content to the end user. This document is defined in XML.

The Open Container Format has no user facing and/or textual content, it is therefore irrelevant as far as i18n is concerned. Also, because the navigation document is in XHTML, the aforementioned comment applies to it, too: its internationalization and localization features are dependent on the i18n features of XHTML + CSS.

The package document, however, does include textual information that, directly or indirectly, does influence the behavior of reading systems and needs an i18n review. In other words this (self) review is done on the i18n features of the package document.

A further fact on the package document that is important for this review: the “textual” elements, i.e., those XML elements within the package document that contain natural text (title, creator, subject, accessibility summary, etc.) are not specified directly by the EPUB specification either. Those elements are all either Dublin Core metadata terms and elements, controlled by Dublin Core™ Metadata Initiative (DCMI), or schema.org elements, controlled by the schema.org process. The EPUB specification “just” uses them. The “native” elements in the specifications, i.e., XML elements defined in a namespace that is controlled by this Specification, are all “structural” elements, e.g., links, so-called spine items, etc., that do not contain textual content.

Specification structure

The EPUB 3.3 specification consists of three Recommendations in preparation:

  • EPUB 3.3 specifies the content structure of an EPUB 3.3 document. This is the core specification for the authors of an EPUB 3.3 publication, and specifies the features described above.
  • EPUB 3.3 Reading Systems specifies the conformance requirements for EPUB 3.3 reading systems, which comprise stand alone reading applications, software embedded in a reading device, but also the behavior of a browser extension.
  • EPUB Accessibility 1.1 specifies the content conformance requirements for verifying the accessibility of EPUB publications.

Note that the EPUB 3.3 Reading System says, as part of the conformance requirements:

It MUST honor all presentation logic expressed through the Package Document [EPUB-33] (e.g., the reading order, fallback chains, page progression direction and fixed layouts).

As for the EPUB Accessibility 1.1 document, it concentrates on accessibility requirements for EPUB Publications; in some sense, its relation to the other specification is a bit like the relationship of the WCAG Recommendations to HTML.

As a consequence, each check refer, primarily, to the the EPUB 3.3 specification itself; unless otherwise stated, the quote above covers the EPUB 3.3 Reading Systems and the EPUB Accessibility 1.1 documents’ checks as well.


Short checklist

Using the short i18n review checklist the following items are relevant for the EPUB specification:

  1. If the spec (or its implementation) contains any natural language text that will be read by a human (this includes error messages or other UI text, JSON strings, etc, etc).
  2. If the spec (or its implementation) deals with time in any way that will be read by humans and/or crosses time zone boundaries.
  3. If the spec (or its implementation) defines markup.
  4. If the spec (or its implementation) deals with names, addresses, time & date formats, etc.
  5. If the spec (or its implementation) describes a format or data that is likely to need localization.
  6. If the spec (or its implementation) makes any reference to or relies on any cultural norms.

yielding the following detailed checklist items below.


Detailed checklist

This checklist for extracted from the i18n self-review checklist, using the outcome of the short checklist above.

Language

Language basics

  1. It should be possible to associate a language with any piece of natural language text that will be read by a user. more

    EPUB 3.3 Check: There are two settings:

    1. The XML xml:lang attribute is used for the package document and its enclosed (XML) elements, indicating the language for the metadata items (title, publishers, etc.). See section on shared attributes in the spec. For those, the xml:lang specification applies.
    2. The separate dc:language element specifies the language of the publication, which may control the search and categorization features, but also the user interface provided by the Reading System. This value is not inherited by the content documents that must set the language locally (according to the HTML5 or SVG rules).
  2. Where possible, there should be a way to label natural language changes in inline text. more

    EPUB 3.3 Check: The xml:lang attribute is applicable to all metadata elements that have a textual content.

  3. Consider whether it is useful to express the intended linguistic audience of a resource, in addition to specifying the language used for text processing. more

    EPUB 3.3 Check: The package document includes the dc:language element: “specifies the language of the content of the EPUB Publication” (as opposed to the language for the metadata entries). This element is REQUIRED in the package document. Note that the package document may contain several dc:language elements; this is used, e.g., for multi-language publications.

  4. A language declaration that indicates the text processing language for a range of text must associate a single language value with a specific range of text. more

    EPUB 3.3 Check: this is covered by the xml:lang attribute (a "range" of text being a single metadata element in this context).

  5. Use the HTML lang and XML xml:lang language attributes where appropriate to identify the text processing language, rather than creating a new attribute or mechanism. more

    EPUB 3.3 Check: xml:lang is used when appropriate.

  6. It should be possible to associate a metadata-type language declaration (which indicates the intended use of the resource rather than the language of a specific range of text) with multiple language values. more

    EPUB 3.3 Check: The package document may contain several dc:language elements; this may be used for multi-language publications (with the first language element considered to be the “primary” language).

  7. Attributes that express the language of external resources should not use the HTML lang and XML xml:lang language attributes, but should use a different attribute when they represent metadata (which indicates the intended use of the resource rather than the language of a specific range of text). more

    EPUB 3.3 Check: See specification of the link element which introduces the hreflang attribute when linking from the package document.

Defining language values

  1. Values for language declarations must use BCP 47. more

    EPUB 3.3 Check: The value of xml:lang is defined by the relevant section of the XML specification (referring to BCP47). The value of the dc:language element is defined to be BCP47 by DCMI.

  2. Refer to BCP 47, not to RFC 5646. more

    EPUB 3.3 Check: BCP47 is used.

  3. Be specific about what level of conformance you expect for language tags: BCP 47 defines two levels of conformance, "valid" and "well-formed".

    EPUB 3.3 Check: dc:language is specified to be well-formed per BCP47 by DCMI; this is reinforced in the EPUB specification. The same is done for xml:lang

  4. Specifications may require implementations to check if language tags are "valid", but in most circumstances should only require that the language tags be "well-formed".

    EPUB 3.3 Check: dc:language is specified to be well-formed per BCP47 by DCMI; this is reinforced in the EPUB specification. The same is done for xml:lang

  5. Specifications should require content and content authors to use "valid" language tags.

    EPUB 3.3 Check (Negative): dc:language is specified to be well-formed per BCP47 by DCMI; this is reinforced in the EPUB specification. There is no requirement to use "valid" language. See also issue 1509 that details the reasons (mosly on the role of epubcheck).

  6. Reference BCP47 for language tag matching.

    EPUB 3.3 Check: BCP47 is used.

Declaring language at the resource level

  1. The specification should indicate how to define the default text-processing language for the resource as a whole. more

    EPUB 3.3 Check: For the metadata entries, the xml:lang processing model applies. For the publication as a whole, the package document MUST include a valid dc:language element.

  2. Content within the resource should inherit the language of the text-processing declared at the resource level, unless it is specifically overridden.

    EPUB 3.3 Check: This is the xml:lang processing model.

  3. Consider whether it is necessary to have separate declarations to indicate the text-processing language versus metadata about the expected use of the resource. more

    EPUB 3.3 Check: This is what the separation among the usage of the xml:lang attribute, the dc:language tag, and the language setting in the separate content documents.

  4. If there is only one language declaration for a resource, and it has more than one language tag as a value, it must be possible to identify the default text-processing language for the resource. more

    EPUB 3.3 Check: n/a. The xml:lang attribute can only take a single value. For dc:language, in case several values are used, the first one is considered to be the "main". (See spec text.)

Establishing the language of a content block

  1. By default, blocks of content should inherit any text-processing language set for the resource as a whole. more

    EPUB 3.3 Check: n/a for the metadata values, except that the package level value of xml:lang can be overwritten if it is explicitly specified on an (XML) element.

  2. It should be possible to indicate a change in language for blocks of content where the language changes. more

    EPUB 3.3 Check: xml:lang processing does that.

Establishing the language of inline runs

  1. It should be possible to indicate language for spans of inline text where the language changes. more

    EPUB 3.3 Check (Negative): The content of the relevant metadata items (title, authors, accessibility summary etc.) are defined as strings by DCMI or schema.org. The content is in UNICODE, which means that bidi should be used, but no internal structure can be defined.

Text direction

Basic requirements

  1. It must be possible to indicate base direction for each individual paragraph-level item of natural language text that will be read by someone. more

    EPUB 3.3 Check:

    1. The top level element in the package document, as well as the elements with a text content, can use the dir attribute, with possible values of ltr, rtl, or auto.
    2. The spine element, that lists the reading order of the content, has the optional page-progression-direction attribute that sets the direction on the publication level (e.g., for the placement of the table of content by the Reading System or any other user interface feature).
  2. It must be possible to indicate base direction changes for embedded runs of inline bidirectional text for all natural language text that will be read by someone. more

    EPUB 3.3 Check (Negative): The content of the relevant metadata items (title, authors, etc) are defined as strings by DCMI or schema.org. The content is in UNICODE, which means that bidi should be used, but no internal structure can be defined.

  3. Annotating right-to-left text must require the minimum amount of effort for people who work natively with right-to-left scripts. more

    EPUB 3.3 Check: n/a. EPUB does not define any annotation behavior.

Background information

  1. Do not assume that direction can be determined from language information. more

    EPUB 3.3 Check: this is covered by the definition of the dir attribute
    EPUB 3.3. Reading Systems Check: this is covered by the dir attribute processing

Base direction values

  1. Values for the default base direction should include left-to-right, right-to-left, and auto. more

    EPUB 3.3 Check: this is covered by the definition of the dir attribute

Handling direction in markup

The content of this section is not relevant for EPUB, insofar as the metadata in a package document is only a collection of strings, no markup is defined.

  1. The spec should indicate how to define a default base direction for the resource as a whole, ie. set the overall base direction. more

    EPUB 3.3 Check: n/a.

  2. The default base direction, in the absence of other information, should be LTR. more

    EPUB 3.3 Check: n/a.

  3. The content author must be able to indicate parts of the text where the base direction changes. At the block level, this should be achieved using attributes or metadata, and should not rely on Unicode control characters.

    EPUB 3.3 Check: n/a.

  4. It must be possible to also set the direction for content fragments to auto. This means that the base direction will be determined by examining the content itself.

    EPUB 3.3 Check: n/a.

  5. If the overall base direction is set to auto for plain text, the direction of content paragraphs should be determined on a paragraph by paragraph basis.

    EPUB 3.3 Check: n/a.

  6. To indicate the sides of a block of text where relative to the start and end of its contained lines, you should use 'before' and 'after' (maybe block-start/block-end – the terminology is changing), rather than 'top' and 'bottom'.

    EPUB 3.3 Check: n/a.

  7. To indicate the start/end of a line you should use 'start' and 'end' rather than 'left' and 'right'.

    EPUB 3.3 Check: n/a.

  8. Provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

    EPUB 3.3 Check: n/a.

Handling base direction for strings

  1. Provide metadata constructs that can be used to indicate the base direction of any natural language string. more

    EPUB 3.3 Check: This is the role of the dir attribute.

  2. Specify that consumers of strings should use heuristics, preferably based on the Unicode Standard first-strong algorithm, to detect the base direction of a string except where metadata is provided. more

    EPUB 3.3 Check: covered by the dir attribute specification
    EPUB 3.3 Reading System Check: covered by the dir attribute behavior

  3. Where possible, define a field to indicate the default direction for all strings in a given resource or document. more

    EPUB 3.3 Check: This is the role of the dir attribute.

  4. Do NOT assume that a creating a document-level default without the ability to change direction for any string is sufficient. more

    EPUB 3.3 Check: The dir attribute can be set on all elements.

  5. If metadata is not available due to legacy implementations and cannot otherwise be provided, specifications MAY allow a base direction to be interpolated from available language metadata. more

    EPUB 3.3 Check: n/a

  6. Specifications MUST NOT require the production or use of paired bidi controls. more

    EPUB 3.3 Check: The specification does not go into these details.

Setting base direction for inline or substring text

There is no mechanism to set inline directionality in the metadata elements beyond what Unicode provides and beyond what can be set for the metadata item as a whole. All relevant elements have been defined by DCMI or schema.org, and this specification cannot change them by adding internal XML or HTML structures. Bidi should be used relying on the UNICODE RLM/LRM marker characters.

  1. It must be possible to indicate spans of inline text where the base direction changes. If markup is available, this is the preferred method. Otherwise your specification must require that Unicode control characters are recognized by the receiving application, and correctly implemented.

    EPUB 3.3 Check: The reference is to the core BIDI, which covers this.
    EPUB 3.3 Reading System Check: The processing behavior of dir specifies this.

  2. It must be possible to also set the direction for a span to auto. This means that the base direction will be determined by examining the content itself. A typical approach here would be to set the direction based on the first strong directional character outside of any markup. more

    EPUB 3.3 Check: n/a. There is no extra markup for the metadata items.

  3. If users use Unicode bidirectional control characters, the isolating RLI/LRI/FSI with PDI characters must be supported by the application and recommended (rather than RLE/LRE with PDF) by the spec.

    EPUB 3.3 Check: The reference is to the core BIDI, which covers this.
    EPUB 3.3 Reading System Check: The processing behavior of dir specifies this.

  4. Use of RLM/LRM should be appropriate, and expectations of what those controls can and cannot do should be clear in the spec. more

    EPUB 3.3 Check: The reference is to the core BIDI, which covers this.
    EPUB 3.3 Reading System Check: The processing behavior of dir specifies this.

  5. For markup, provide dedicated attributes for control of base direction and bidirectional overrides; do not rely on the user applying style properties to arbitrary markup to achieve bidi control.

    EPUB 3.3 Check: This is the role of the dir attribute, but only on the full metadata item.

  6. For markup, allow bidi attributes on all inline elements in markup that contain text.

    EPUB 3.3 Check: This is the role of the dir attribute, but only on the full metadata item.

  7. For markup, provide attributes that allow the user to (a) create an embedded base direction or (b) override the bidirectional algorithm altogether; the attribute should allow the user to set the direction to LTR or RTL or the aforementioned Auto in either of these two scenarios.

    EPUB 3.3 Check: This is the role of the dir attribute, but only on the full metadata item.

Referencing the Unicode Standard

  1. Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. more

    EPUB 3.3 Check: the reference is: “The Unicode Standard. Unicode Consortium. URL: https://www.unicode.org/versions/latest/”

  2. A generic reference to the Unicode Standard MUST be made if it is desired that characters allocated after a specification is published are usable with that specification. A specific reference to the Unicode Standard MAY be included to ensure that functionality depending on a particular version is available and will not change over time. more

    EPUB 3.3 Check: see above.

  3. All generic references to the Unicode Standard MUST refer to the latest version of the Unicode Standard available at the date of publication of the containing specification. more

    EPUB 3.3 Check: the reference above is the only reference in the spec.

  4. All generic references to ISO/IEC 10646 MUST refer to the latest version of ISO/IEC 10646 available at the date of publication of the containing specification. more

    EPUB 3.3 Check: the reference above is the only reference in the spec.

Markup & syntax

Defining elements and attributes

  1. Do not define attribute values that will contain user readable content. Use elements for such content. more

    EPUB 3.3 Check: not such attribute is defined.

  2. If you do define attribute values containing user readable content, provide a means to indicate directional and language information for that text separately from the text contained in the element.

    EPUB 3.3 Check: n/a

  3. Provide a way for authors to annotate arbitrary inline content using a span-like element or construct. more

    EPUB 3.3 Check: For metadata items that may have translations and/or alternate script representation, the specification provides a way to repeat the content in different languages and scripts using the refines mechanism, see example in the spec.

Defining identifiers

  1. Identifiers should be case-sensitive.

    EPUB 3.3 Check: the relevant portion of the spec is based on xml, which is case-sensitive.

Working with plain text

  1. Avoid natural language text in elements that only allow for plain text and in attribute values.

    EPUB 3.3 Check: this approach is followed in the spec.

  2. Provide a span-like element that can be used for any text content to apply information needed for internationalization. more

    EPUB 3.3 Check (Negative): All textual content metadata are defined by DCMI, and EPUB cannot add extra internal structure.

Locales, date and time values, and locally affected formats

Working with locale-affected values

  1. When definining data formats, use locale-neutral serialization forms.

    EPUB 3.3 Check: n/a.

Working with time

Check for all: EPUB includes the dc:date and dcterms:modified elements (also defined by DCMI). The value of:

  • dc:date is “RECOMMENDED that the date string conform to [ISO8601], particularly the subset expressed in W3C Date and Time Formats [DateTime], as such strings are both human and machine readable” (see the DCMI specification).
  • dcterms:modified “MUST be an [XMLSCHEMA-2] dateTime conformant date of the form: CCYY-MM-DDThh:mm:ssZ” (see the DCMI specification).
  1. When defining calendar and date systems, be sure to allow for dates prior to the common era, or at least define handling of dates outside the most common range.
  2. When defining time or date data types, ensure that the time zone or relationship to UTC is always defined.
  3. Provide a health warning for conversion of time or date data types that are "floating" to/from incremental types, referring as necessary to the Time Zones WG Note. more
  4. Allow for leap seconds in date and time data types. more
  5. Use consistent terminology when discussing date and time values. Use 'floating' time for time zone independent values.
  6. Keep separate the definition of time zone from time zone offset.
  7. Use IANA time zone IDs to identify time zones. Do not use offsets or LTO as a proxy for time zone.
  8. Use a separate field to identify time zone.
  9. When defining rules for a "week", allow for culturally specific rules to be applied. more
  10. When defining rules for week number of year, allow for culturally specific rules to be applied.
  11. When non-Gregorian calendars are permitted, note that the "month" field can go to 13 (undecimber).

Working with personal names

These all relate to the dc:creator and dc:contributor elements, as defined by DCMI.

  1. Check whether you really need to store or access given name and family name separately. more

    EPUB 3.3 Check: The specification does not require separate name and family names.

  2. Avoid placing limits on the length of names, or if you do, make allowance for long strings. more

    EPUB 3.3 Check: There is no limit.

  3. Try to avoid using the labels 'first name' and 'last name' in non-localized contexts. more

    EPUB 3.3 Check: No such labels are used.

  4. Consider whether it would make sense to have one or more extra fields, in addition to the full name field, where users can provide part(s) of their name that you need to use for a specific purpose. more

    EPUB 3.3 Check: This is provided by the refines mechanism combined with the role property.

  5. Allow for users to be asked separately how they would like you be addressed when someone contacts them. more

    EPUB 3.3 Check: n/a.

  6. If parts of a person's name are captured separately, ensure that the separate items can capture all relevant information. more

    EPUB 3.3 Check: n/a.

  7. Be careful about assumptions built into algorithms that pull out the parts of a name automatically. more

    EPUB 3.3 Check: n/a.

  8. Don't assume that a single letter name is an initial. more

    EPUB 3.3 Check: n/a.

  9. Don't require that people supply a family name. more

    EPUB 3.3 Check: n/a.

  10. Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. more

    EPUB 3.3 Check: n/a.

  11. Don't require names to be entered all in upper case. more

    EPUB 3.3 Check: n/a.

  12. Allow the user to enter a name with spaces. more

    EPUB 3.3 Check: n/a.

  13. Don't assume that members of the same family will share the same family name. more

    EPUB 3.3 Check: n/a.

  14. It may be better for a form to ask for 'Previous name' rather than 'Maiden name' or 'née'. more

    EPUB 3.3 Check: n/a.

  15. You may want to store the name in both Latin and native scripts, in which case you probably need to ask the user to submit their name in both native script and Latin-only form, as separate items. more

    EPUB 3.3 Check: n/a.