Centrum voor Teksteditie en Bronnenstudie
Centre for Scholarly Editing and Document Studies
a research centre of the Royal Academy of Dutch Language and Literature
6. Correlations of logical and physical structures
The encoding of letters, and primary manuscript materials in general, can be problematic for XML-based encoding schemes like TEI and DALF. The problem arises out of one of the premises of XML, namely the conception of a document as an Ordered Hierarchy of Content Objects (OHCO). This requires documents to take on a properly nesting hierarchy of container elements. This means that elements can only contain true subsets of other elements, like for example:
<a>... <b>...</b> </a>Overlapping start and end tags, like in the following example, are illegal:
<a>... <b>... </a> </b>In contrast to XML documents, real-life texts seldom satisfy this qualification. Still, with a minimal degree of abstraction, many textual features can be mapped onto properly nesting XML structures without great difficulties. Yet, (modern) primary manuscript materials have a very concrete nature, containing all of the peculiarities of the draft stage like deletions, scribbles, scars,... that are filtered out from the final form of nicely laid-out print texts. Since the TEI scheme is aimed preliminary at the representation of logical structure, problems of overlap will easily appear when encoding the logical and physical aspects of manuscript materials.
This example contains logical-physical structures (2 paragraphs, 1 addition), mixed with more logical structures (1 name), and purely physical structures (the shifts in writing direction and paper). Because of the various overlaps, the encoding of this fragment according to the TEI scheme is not straightforward. Chapter 31. Multiple Hierarchies of the TEI P4 Guidelines suggests some strategies to overcome the problem of overlap in the XML version of the TEI DTD. They can be summarised along 3 principles:
Applied to this (admittedly, rather complex, but still short) example, it is clear that options 2 and 3 would proof very unwieldy, resulting either in a plethora of split structures or different encodings. Although syntactically correct from a technical point of view, neither of these approaches seems to produce a faithful representation in the markup of a theoretical view on text that sees those structural levels functioning in a continuum. Therefore, in the design of the DALF DTD, we opted for the first solution: encoding overlapping hierarchies with empty elements. In view of the difficulties that exist with the processing of empty elements, this strategy has to be minimised as much as possible, however. The above-mentioned bias of the TEI scheme towards logical structures provided a good rationale. Most of the logical structures in the example can be encoded with existing TEI elements. No TEI elements exist however to mark the start and end of physical materials that contain those logical structures. The <pb /> TEI element (see http://www.tei-c.org/P4X/ref-PAGEBR.html) is not adequate as its semantics restrict it to the marking of page sides in a book. Therefore, the DALF DTD has been augmented with a mechanism to express a minimal ‘layer’ view on documents, in a way that avoids both the possibility of overlap with the encoding of logical structures, and also an overgeneralised use of empty elements. This means that a letter can be seen as a complex of physical layers, pieces of physical containers for the logical structures. When different layers are distinguished, their boundaries can be indicated with empty elements that do not disturb the proper nesting of other elements.
This ‘layer’ approach is implemented in the DALF DTD as a mechanism for identifying and referring to different physical layers, modelled after the TEI mechanism for identifying and referring to document hands (see http://www.tei-c.org./P4X/ref-HAND.html). In order to identify physical layers, the <layerList> element is added to the <profileDesc> element in the header. It lists different layer definitions as distinct <layer> elements:
The <layerList> element has only the global attributes (see 3.3. Global attributes). Each <layer> element can have one additional attribute:
In order to provide some categorisation of the type of layer as an attribute value, DALF encoders are strongly encouraged to provide each layer definition with a type attribute, with appropriate values such as ‘post-it’, ‘newspaper_article’, etc. Furthermore, although the id attribute is never obligatory, it must be present on each <layer> element, in order to provide a reference point for the elements in the text that signal the boundaries of different physical layers in the document.
<profileDesc> <handList> <hand id="hand2" /> </handList> <layerList> <layer id="l2" type="post-it" /> </layerList> </profileDesc>
When different layers are defined in the header of a DALF letter, some special text elements can be used to refer to those definitions. Because of the above-mentioned difficulties with overlapping structures and the processing of empty elements, we opted against a ‘milestone’ use of empty elements. Instead, a pair of tags is specified to mark the start and end of physical boundaries, that semantically function as start and end tags, but syntactically are empty elements. They are respectively <layerStart /> and <layerEnd />, and since shifts of physical layers can occur at any place in a letter, they are specified as members of the TEI element class Incl (see 7. Modifications to TEI element classes).
In order to refer to a definition of the layer in the header, a special attribute is added to the global ones (defined at 3.3. Global attributes):
The TEI scheme provides several means to link fragmented information (see http://www.tei-c.org/P4X/SA.html). Although these do not provide any straightforward functionality as yet, some pieces of specialised software could make use of them to reconstruct alternative views of the document. In the spirit of the suggestions made in chapter 31. Multiple Hierarchies of the TEI P4 Guidelines, the following example shows how the physical layers in the example at the start of this section can be encoded, and how a link between those layer elements can be made explicit with the TEI <join> element:
<text> <body> <p>The next word <layerStart layer="l2" id="ls1" />will appear in a distinct physical area.</p> <p>If so desired, a new paragraph will complicate matters, as well as a reference to <name>Stijn <layerEnd layer="l2" id="le1" /> Streuvels</name>, and a couple of new <add hand="hand2">tri <layerStart layer="l2" id="ls1b" />cky!</add> boundary crossings to make the story <seg rend="90">com<layerEnd layer="l2" id="le1b" />plete. </seg></p> ... </body> <back> <join targets="ls1 le1 ls1b le1b" result="div" desc="physical layer" /> </back> </text>
The following lines demonstrate how the DALFExtns.dtd file redefines the <profileDesc> element to include <layerList>, and how the unique DALF elements used to identify and refer to physical layers in a letter are defined:
<!ELEMENT profileDesc %om.RR; (creation?, langUsage*, textDesc*,particDesc*, settingDesc*, handList*, layerList*, textClass*)> <!ATTLIST profileDesc %a.global;> <!ELEMENT layerList %om.RR; (layer*)> <!ATTLIST layerList %a.global;> <!ELEMENT layer %om.RR; (p)*> <!ATTLIST layer %a.global; type CDATA #IMPLIED> <!ELEMENT layerStart %om.RR; EMPTY> <!ATTLIST layerStart %a.global; layer IDREF #REQUIRED> <!ELEMENT layerEnd %om.RR; EMPTY> <!ATTLIST layerEnd %a.global; layer IDREF #REQUIRED>