Skip to content

Exploring a Document, but Encoding a Text

Rebecca Parker edited this page Mar 19, 2019 · 18 revisions

markup = annotation or other marks within a text intended to instruct a compositor, typist, or web developer how a particular passage should be printed, laid out, or displayed

markup language = set of markup conventions specifying how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means

Generalizing from that sense, we define encoding as any means of making explicit an interpretation of a text using a markup language.

What is XML?

XML stands for eXtensible Markup Language, and it’s a standard system for storing and accessing information used practically everywhere around the world. For our purposes as researchers, it’s an excellent method for storing information, and for preparing to share it with the public. We write XML to form hierarchies (or nested structures) of information in order to locate and extract said information (whether that be for presentation as HTML, creation of data visualizations, or more simply—information searchability.) XML is interested in the meaning of data more than in its presentation. While most other markup languages are concerned with mimicking how a document appears XML because it does not have a fixed set of tags can extend beyond presentation markup. This makes XML documents multi-purposing. So that you can mark up a text only once and then use it for multiple purposes.

image of example XML element: <element attribute="attribteValue">Hello I am the content of this element.</element>

  • A tag is the text between the left angle bracket (<) and the right angle bracket (>). There are starting tags and ending tags. A start tag is defined with angle brackets, and an end tag looks like a start tag, except it has a forward slash after the opening angle bracket.

  • An element is the starting tag, the ending tag, and everything in between. This can include text and/or other elements. Here is an example of nested elements: <person handle="RJP43" pronoun="she">Rebecca <surname>Parker</surname></person>. When we talk about an element, we’re referring to the whole thing. The element name refers to the text written inside of the start and end tags.

  • An attribute is a name-value pair inside the start tag of an element. Elements can include something called attributes—an additional markup that gives supplementary information about an element (attributes are sort of like adjectives, or descriptive modifiers). They consist of an attribute name and an attribute value.

image of example self-closing element; <lb n="1251"/>
In special cases, XML elements can actually have no content at all! These are called self-closing elements and they have a special syntax so that they open and close inside a single tag.

  • Don't contain text or any other elements.

  • Consist of a single tag - smush the start and end tag together.

  • May have attributes.

This is a XML Comment:

<!-- comment text goes here -->
Note: Two dashes in the middle of a comment are not allowed. When writing XML comments we recommend encoders provide their initials and the date the comment is being left. We want you to think of XML comments as breadcrumbs to future encoders and processors of your XML; therefore, be sure to use complete sentences and leave logical comments that can be understood by others even after you are no longer working on the project.

These are XML Reserved Characters & How To Escape Them:

< less than - &lt;
> greater than - &gt;
& ampersand - &amp;

Understanding the XML Hierarchy

Elements when brought together conform to a particular hierarchy. The following three analogies will help you better understand a well-formed, properly-nested XML hierarchy:

  1. Nesting Dolls
    photo of Russian Nesting Dolls with each doll sitting outside of their parent doll
    Elements are Russian Nesting Dolls - “Well-formedness = Nested-ness” - Everything is properly delimited, There is a single root element (“the big doll”) that contains all of the other elements both structural and contextual in nature, No elements overlap

  2. Family Tree
    a data tree with book as the root and page divisions as the youngest child
    Elements form trees - Reference relationships: Ancestor, Descendant, Sibling, Parent, Grandparent. Humanities scholars use XML to represent their documents because the tree model is convenient both as a logical representation (meaning some aspects of the inherent structure of documents are tree-like) and for programming purposes (meaning computers can process tree representations efficiently).

  3. Boxes in Boxes
    a data visualization of a book's content divided out into boxes sitting within other boxes with book being the outer box and page divisions being the smallest most internal box
    Elements are boxes - Attributes distinguish box types

Rules for "Well-Formed" XML

  • The XML prolog is optional; however, if it exists, it must come first in the document.
    Example XML prolog: <?xml version="1.0" encoding="UTF-8"?>

  • An XML document must be contained in a single element. That single element is called the root element, and it contains all the text and any other elements.

  • XML elements can't overlap - elements must be properly nested - need a start and end (or self-closing)

  • XML elements are case sensitive - the start and end tag must match - <person> vs <PERSON> vs <Person>

  • Attributes must have values and those values must be enclosed within quotation marks.