Introduction
The XML tree paradigm has several well-known limitations for document modeling and
processing, some of which have received a lot of attention (especially overlap; see
the
overviews in Sperberg-McQueen and Huitfeldt 2000 and DeRose 2004) and
some of which have received less (e.g., discontinuity, simultaneity, transposition,
white
space as crypto-overlap). Many of these have work-arounds, also well known, that—as
is
implicit in the term work-around
—have disadvantages, but because they get the
job done and because XML has a large user community with diverse levels of technological
expertise, it is difficult to overcome inertia and move to a technology that might
offer a
more comprehensive fit with the full range of document structures with which researchers
need
to interact both intellectually and programmatically. Proceeding from a high-level
view of why
XML has the limitations it has, this presentation explores how an alternative model
of Text as
Graph (TAG) might address these types of structures and tasks in a more natural and
idiomatic
way than is available within an XML paradigm.
From an informatic perspective all documents are structured, including those that
are
traditionally identified as plain text. Some of the structural properties of plain-text
documents are expressed through formatting conventions, such as the use of blank lines
to
separate paragraphs, or of indentation to mark the beginning of a paragraph, or of
centering
to mark a header. The sequence of words in a text, delimited in a complex way that
involves
white space, punctuation, and other symbols, constitutes, on a certain level, an implicit
organizational tier above the sequence of characters. The conventions at work in plain text do not formally, completely, unambiguously,
or in a wholly standardized way differentiate the content of a document from the coded
representation of its structure (or, perhaps more accurately, structures), which problematizes
using plain text for document processing (whether for data mining, publication, or
other
purposes). The challenges this poses have come to be addressed by representing the
structural
properties of a document not through plain-text characters (which might be considered
pseudo-markup), but through formal, standardized markup, such as XML.
The XML data model is an ordered tree, or, more precisely, a rooted and ordered directed
acyclic graph that prohibits multiple parentage, which in the document-processing
community
has come to be understood as representing an Ordered Hierarchy of Content Objects
(OHCO). It
is well known that the OHCO model works reasonably well for describing structures
that consist
of single ordered hierarchies, such as the exhaustive tesselated division of a novel
into
chapters and the chapters into paragraphs, but it is not well suited to modeling structures
that cannot be represented fully by a single tree. The markup community has focused intensively on overlapping hierarchies as a
challenge to the OHCO model, and with good reason, but we argue below that overlap is only one manifestation of
a higher-level problem, and this perspective has implications for deciding how best
to
overcome it. If overlap were the problem, projecting multiple
trees over the content might solve it (e.g., through the SGML CONCUR feature), as might the adoption of a model that permits but does not require hierarchy,
including multiple hierarchies, such as the range model exemplified by LMNL. But if the problem is that a tree is inadequate for higher-level reasons that are
only partially exemplified by overlap, we might have more success if we address the
issue at
that higher level. The Text As Graph model that we introduce below is not intended
to be a
solution to the overlap problem in XML
; it is built around a fresh
consideration of the textual structures, both latent and overt, that a data model
will need to
be able to represent. It is nonetheless not accidental that TAG agrees in some respects
with
XML, in others with GODDAG or TexMECS, and in others with LMNL, since all of these
specifications have sought, in partially converging ways, to model text structure.
Below we identify specific situations that pose problems for an OHCO perspective.
With
respect to hierarchy, in addition to overlap, where text may have multiple overlapping
hierarchies, text may not be hierarchical at all. We also identify situations where
text is
not ordered, as well as those where XML creates artifactual content objects that do
not
clearly correspond to what a human would consider a textual content object. In other
words,
TAG seeks to interrogate and address the O, the H, and the CO of OHCO, and not only
the
well-known multiple-hierarchy challenge.
Challenges for text modeling
In this section we illustrate several types of textual structures that have proven
awkward
for XML because they contradict or otherwise are not part of the OHCO tree model.
For each we
provide an abstract description of the problem, of one or more XML workarounds, and
their
GODDAG, TexMECS, and LMNL counterparts (as appropriate), illustrated with examples
drawn from
use cases in Digital Humanities research projects.
Overlap
The challenge to text modeling in XML that has attracted the most attention is overlap.
For example, notice in the image below how the phrase Two vast and trunkless legs of
stone Stand in the desart
begins in the middle of line 2 and ends in the middle of
line 3, an absence of synchronicity between verse lines and sentences that is called
enjambment. :
Piez’s illustration is actually of LMNL ranges, rather than of XML element trees.
The
same structure might be visualized as independent overlapping trees as follows, where
cyan
represents the tree of metrical lines and green represents the tree of linguistic
phrases:
Because it is not possible to represent the preceding structure in XML markup, the
following pseudo-XML is not well-formed:
<line><phrase>Who said —</phrase> <phrase>“Two vast and trunkless legs of stone</line>
<line>Stand in the desart….</phrase> <phrase>Near them,</phrase> <phrase>on the sand</phrase></line>
New XML users often misunderstand the prohibition against overlap as a prohibition
against overlapping tags, but if that were the entire
issue, it could be remedied by simply removing the syntactic prohibition. But the
rule about
tags exists because tags must represent a tree, hierarchy in a tree prohibits multiple
parentage, and overlap would permit a node to have more than one parent. Overlap is
possible
in GODDAG only incidentally because TexMECS permits overlapping tags; at a higher
level it
is because GODDAG permits Text nodes to have multiple parents and TexMECS serializes
the
GODDAG model. LMNL sawtooth syntax may look like XML syntax with the prohibition against
overlapping tags removed, but the real difference is at the level of the data model:
LMNL
ranges can overlap and XML elements cannot because the content between XML start and end tags is a
sequence of descendant nodes in a tree, and not a range of textual atoms.
TAG represents overlap naturally because the TAG counterpart to an XML element is
a
directed hyperedge that associates a head Markup node with a set of tail Text nodes.
To tag
a line of poetry in the example above, TAG would create a hyperedge from a Markup
node with
the name
property value of line
(comparable to a
<line>
element in XML) to a set of Text nodes (comparable to Text nodes
in XML). Sets are unordered, but because the TAG model requires sequence
edges between Text nodes, which record the continuous order of the text stream (comparable
to the sequence of atoms in the LMNL model), the textual content of the line is fully
specified by (= can be retrieved by examining) the membership of the set of tail Text
nodes
and the sequence edges between them. In the illustration below, the black arrows represent
regular edges that connect Text nodes in order, the irregular colored bounding lines
demarcate the sets of tail Text nodes, and a similarly colored arrow points into them
from
their Markup node heads:
Additional use cases involving overlap challenges in XML include pages vs paragraphs
in
publications of novels, folios vs texts in medieval manuscripts, and speeches vs metrical
lines in drama. Overlap in poetic structures has been explored in detail in Piez 2014, which also discusses an unusual structural paradox involving
Chapter 24 of Mary Shelley’s Frankenstein. Overlap
involving word and metrical foot boundaries in poetry is discussed below.
Discontinuity
Sperberg-McQueen and Huitfeldt 2008b offer the following paragraph from Lewis
Carroll’s Alice in Wonderland as an example of
discontinuity:
Alice was beginning to get very tired of sitting by her sister on the bank, and of
having nothing to do: once or twice she had peeped into the book her sister was reading,
but it had no pictures or conversations in it, and what is the use of a
book,
thought Alice without pictures or conversation?
There is no way to mark up this passage in XML without fragmenting the quotation into
two elements (and relying on semantics to stitch together the pieces in the application
layer), yet our human intuition is that there is a single quotation, and that the
model,
therefore, should represent it as a single object. As Sperberg-McQueen and Huitfeldt 2008b also observe, there is a sense in
which book
and without
are adjacent and a different sense in
which book
and thought
are adjacent. XML syntax and the XML
tree cannot represent both of these realities simultaneously, which means that at
least one
of them must be handed off to the application layer.
Sperberg-McQueen and Huitfeldt 2008b situate this type of structure in a GODDAG
context, where it intersects with the distinction between containment and dominance. Concerning LMNL,
they write that
[With respect to] the unity of discontinuous elements: such a unity may be asserted
by
the application layer (that is, by the definition of a LMNL vocabulary), but it is
not
visible on the LMNL level, and thus need not be accounted for at the level of LMNL
itself.
The design of LMNL thus seems to require that any account of dominance (as distinct
from containment), and any account of discontinuous elements, be handled in the
application layer. LMNL itself achieves a degree of simplicity and regularity as a
result,
at the expense of complexity in the application.
Piez 2008 describes discontinuity in LMNL as modeled by the limen, where the example provided
(http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html#page23)
records it through coindexed annotations. That dependency seems to locate discontinuity
in
an application layer, since whether coindexed annotations represent discontinuity
is not an
inherent property of the coindexing. As was noted earlier, TexMECS is capable of modeling
discontinuity directly with suspend-tags and resume-tags, but if a TexMECS document
is to be
parsed into GODDAG, the fragments are then modeled as separate structural components
TAG prioritizes the representation of text structures, including discontinuity, in
the
model, without dependency on application-layer semantics. The example from Alice in Wonderland described above would have the following form
in TAG:
Other examples of discontinuity involve stage directions in dramatic text, such as
the
following example from George Bernard Shaw’s Mrs. Warren’s
profession:
VIVIE. Sit down: I’m not ready to go back to work yet. [Praed sits]. You both think
I
have an attack of nerves. Not a bit of it. But there are two subjects I want dropped,
if
you don’t mind.
One of them [to Frank] is love’s young dream in any shape or form: the other [to
Praed] is the romance and beauty of life, especially Ostend and the gaiety of Brussels.
You are welcome to any illusions you may have left on these subjects: I have none.
If we
three are to remain friends, I must be treated as a woman of business, permanently
single
[to Frank] and permanently unromantic [to Praed].
Here the last two stage directions interrupt not just the speech, but the
sentence.
Hierarchy, containment and dominance
The challenges that have emerged from our experience of XML as a model of text involve
not only the limitations of OHCO, but also its tyranny. If text is understood as an Ordered Hierarchy of Content
Objects, are there aspects of text that are not ordered (O), are there aspects that
are not
hierarchical (H, by which we mean not just that they are not mono-hierarchical, but
that
they are not hierarchical at all), and does the model create content objects artifactually,
that is, where they are not perceived as inherent properties of the text being modeled
(CO)?
XML requires us to model all content as both ordered and hierarchical, and it represents
content objects as elements (at least as content objects are described in DeRose et al. 1990). GODDAG and LMNL both grew out of a recognition that not all
properties of text can be modeled effectively as a single hierarchy, and their focus
is not
limited to that issue, but they differ in the extent to which they interrogate features
of
text that may not be hierarchical at all, that may not be ordered, and that may not
involve
what a human would consider a content object.
As was mentioned earlier, the XML data model does not distinguish containment from
dominance, which Tennison explains and illustrates in LMNL terms as follows:
Containment is a happenstance relationship between ranges while dominance is one
that has a meaningful semantic. A page may happen to contain a stanza, but a poem dominates
the stanzas that it contains.[Tennison 2008]
In XML, an ancestor element both contains (starts
before and ends after in the serialization) its descendants and dominates them (is connected to them by a path that travels only downward in
the tree). In the XML view below of the Shakespearean sonnet the we used as a TAG
example
above, the <poem>
element both contains and dominates three
<quatrain>
elements and one <couplet>
element, and
the <quatrain>
and <couplet>
elements both contain and
dominate <line>
elements:
<poem>
<quatrain>
<line>No longer mourn for me when I am dead</line>
<line>Than you shall hear the surly sullen bell</line>
<line>Give warning to the world that I am fled</line>
<line>From this vile world with vilest worms to dwell:</line>
</quatrain>
<quatrain>
<line>Nay, if you read this line, remember not</line>
<line>The hand that writ it, for I love you so,</line>
<line>That I in your sweet thoughts would be forgot,</line>
<line>If thinking on me then should make you woe.</line>
</quatrain>
<quatrain>
<line>O! if,—I say you look upon this verse,</line>
<line>When I perhaps compounded am with clay,</line>
<line>Do not so much as my poor name rehearse;</line>
<line>But let your love even with my life decay;</line>
</quatrain>
<couplet>
<line>Lest the wise world should look into your moan,</line>
<line>And mock you with me after I am gone.</line>
</couplet>
</poem>
In the earlier TAG example of this sonnet, a Markup-to-Text hyperedge defines the
tail
as a set of Text nodes and labels (datatypes) it. In the TAG version of this example,
all
quatrain and line Markup-to-Text hyperedges point to sets of Text nodes, and containment
is
modeled by subset relations among the Text-node tails of those hyperedges. Where the
Text
nodes that constitute the tail of a Markup-to-Text hyperedge with the name
property (on the Markup node) of line
form a proper subset of the Text nodes
that constitute the tail of a Markup-to-Text hyperedge with the name
property
(on the Markup node) of quatrain
, the quatrain contains the line. In this emphasis on containment, rather than dominance, TAG is similar to flat
LMNL, except that LMNL ranges must be continuous (LMNL handles discontinuity separately),
while contiguity is not relevant in defining the set of Text nodes that may serve
as the
tail of a hyperedge in TAG (see the discussion of Discontinuity, above). In the TAG
version
we have chosen not to make the three quatrains and the couplet what in XML terms would
be
children of a root <poem>
element, but we could, should we wish, create a
Markup-to-Text hyperedge with a name
property value (on the Markup node) of
poem
. This could point, through a hyperedge, to the set of all Text nodes
in the poem, which would let us model containment. It could also serve as the head
of a
Markup-to-Markup hyperedge from it to the Markup nodes with quatrain
and
couplet
name
property values, which would let us model dominance. In other words, Markup-to-Text nodes model containment, rather than dominance
(indirectly, through subset properties of the Text nodes to which they point), and
where it
is important to distinguish dominance from containment, the TAG model supports this
through
Markup-to-Markup hyperedges.
One final consequence of the XML conflation of containment and dominance is that when
exactly the same text must be tagged in two ways simultaneously, XML requires one
of the
elements to contain the other. But, as was noted above, if a Markup node with the
name
property value of paragraph
and a Markup node with the
name
property value of quotation
both point to exactly the
same set of Text nodes, in TAG it does not make sense to ask whether the paragraph
contains
the quotation or the quotation contains the paragraph because containment in TAG is
defined
as a proper subset relationship among sets of Text nodes. Whether a paragraph consists
of a
quotation or a quotation consists of a paragraph is a reasonable question, but in
TAG it is
a question of dominance, expressed through Markup-to-Markup hyperedges, and not of
containment, expressed (indirectly) through Markup-to-Text hyperedges.
Artifactual hierarchy
As we described above, in XML, markup is (among other things) a form of datatyping,
and
the XML spec uses the word type
explicitly in this meaning:
Each element has a type, identified by name, sometimes called its generic
identifier
(GI) [W3C XML §3]
This means that when XML assigns a type to part of a document by making it an
element, it simultaneously creates an element node, which pushes the textual content
down a
level in the document hierarchy. Consider the following XML structure, represented
here in
markup and as hierarchy:
<title><name>Romeo</name> and <name>Juliet</name></title>
If we wish to specify in XML that the first and third words of the title are of type
name
, we can tag them as elements of that type, with the result that the
Text nodes they contain wind up on a different level of the hierarchy than the conjunction
between them. This contradicts our intuition that the title contains three words, two of which
have the type name
, replacing it with a model in which the title contains two
objects of type name
with a word between them, and it is the
name
objects that contain the first and third words.
Because TAG separates the use of markup in hierarchy and its use for datatyping, it
is
possible to assign a type to text without distorting the hierarchy. Here is the TAG
representation of the same content:
As illustrated in the example above, markup of Text nodes in the TAG model, unlike
in
XML, does not create a hierarchical layer as a side effect of datatyping. As we have
seen
earlier, it is possible to represent hierarchy in TAG, but it is not an inescapable
consequence of all markup, as it is in XML.
White space as crypto-overlap
In natural language processing, tokenization is the process of breaking up a string
of
plain text characters into substrings (typically words and punctuation, which may
be
adjacent or separated by white space), often while removing token separators in the
process.
Tokenization of plain text in XML is performed using regular expressions and the
tokenize()
function, but tokenize()
atomizes its first argument,
which means that it cannot be used on tagged text without losing the markup in the
process.
Even tokenization that would not create overlap-based well-formedness violations,
such as
splitting and tagging the words of a line of poetry in which the stressed vowels are
tagged
as <stress>
(see the illustration below), requires intermediary temporary
manipulations, such as converting the markup to text, tokenizing with
tokenize()
, and then converting the temporary text back into markup, or
adding additional markup, tokenizing with <xsl:for-each-group>
, and then
removing the temporary markup.
The reason tokenizing tagged text is awkward in XML even where overlap is not a risk
has
one explanation in terms of the syntax and another in terms of the data model. In
terms of
the syntax, the markup and text are intertwined in a way that makes it impossible
to ignore
markup during tokenization while retaining access to it after the process is complete.
In
terms of the data model, as noted above, tagging the stressed vowels in a line of
verse
pushes their textual content down a level in the hierarchy, so the line no longer
forms a
string. Furthermore, although it is not usually described this way, the use of white
space
to separate words may be understood as pseudo-markup, which means that the words in
tagged
text potentially represent overlapping hierarchies in plain-text disguise.
In TAG, however, Markup nodes on one layer point to Text nodes on another layer, one
that contains nothing but Text nodes, which makes it possible to tokenize the text
without
interference from the markup. The tokenization splits larger Text nodes into smaller
ones,
but they remain in the tail of their old Markup-to-Text hyperedges, while new Markup-to-Text
hyperedges are added to tag the new individual words. In the simplified illustrations
below,
we have created a poem
that consists entirely of a single three-word line
(No longer mourn
). In the firt of these illustrations, the stressed vowels
are tagged but the words are not:
Because stress is marked on a single vowel sound, XML would be capable of tagging
the
individual words while retaining the stress markup, since no overlap would result.
For that
reason, the following XML representation, which tags both words and stressed vowels,
is well
formed:
<line>
<word>No</word>
<word>l<stress>o</stress>nger</word>
<word>m<stress>ou</stress>rn</word>
</line>
Yet if we try to use tokenize()
in a transformation to add the
<word>
markup to a line that already contains the
<stress>
markup, the <stress>
markup will be lost
during atomization.
This situation is not a challenge for TAG. In the example below, we have added
Markup-to-Text nodes to tag the words, which can be determined by tokenizing the text
on
white space. Tokenization is possible because the Text nodes are not interrupted by
the
markup, which points to them without being inserted between them (syntactically) and
without
pushing them to different levels of the hierarchy (in the tree structure):
The additional markup requires additional division of Text nodes, but all modifications
are local, and the only part of the graph that has to be updated is the part to which
the
markup is being added.
The preceding example does not create overlap because the Text nodes that are marked
up
for stress are subsets of those that are marked up as words. But if we also want to
tag
poetic feet, which are needed to identify caesura (a regular coincidence of word and
foot
boundaries in the lines of a poem), overlap would become an issue in XML. One work-around
in
an XML environment has turned out to involve, surprisingly, tagging neither the feet
nor the
words (see [Birnbaum and Thorsen 2015]), deriving both from other properties of the
line during processing, but the fact that we can use white-space pseudo-markup to
escape the
consequences of syntactic overlap doesn’t mean that the the overlap isn’t there. A
data
model that can represent both feet and words explicitly, and that could identify caesura
as
a relationship between those two types of structural components, would represent explicitly
the human understanding of caesura, and the explicit representation of structure is
much of
what markup is all about. In the illustration below, we have added foot markup to
the
previous example:
A corresponding XML-like structure that tags words and feet would not be well formed
because the <foot>
elements would overlap with the
<word>
elements:
<line><foot><word>no</word><word>lon</foot><foot>ger</word><word>mourn</word></foot></line>
In XML, [t]he identification of caesura requires the identification of both feet
and words, which are not coextensive and which frequently overlap. The challenge,
then, is
to locate where foot and line boundaries coincide without employing markup in a way
that
would violate well-formedness overlap constraints.
[Birnbaum and Thorsen 2015] In TAG, where overlap is not an issue, caesura is possible when two adjacent Text
nodes
are in the tails of different Markup-to-Text word
hyperedges and different
Markup-to-Text foot
hyperedges. Caesura is typically 1) at or near the middle
of the line, and 2) implemented consistently, so not every coincidence of word and
foot
boundaries proclaims a caesura; that coincidence is necessary, but not sufficient.
Scope of reference
Footnotes can be understood as annotations on text, but in XML they are typically
represented by elements at the location where the note reference should occur in a
reading
text, as with the <footnote>
element in DocBook or the
<note>
element in TEI. Anchoring a footnote at a point in the text
stream, instead of as an annotation on a string of (possibly tagged) text with a beginning
and an end, is problematic because it does not mark explicitly the scope of the note,
such
as whether a footnote reference at the end of a paragraph points to the preceding
sentence
or the preceding two sentences or more, or to the entire paragraph. The TEI
<note>
element avoids this limitation because it can point to an
arbitrary target with XPointer, but this stand-off strategy is an indirect way of
specifying
what might have been represented more immediately as an attribute if XML attributes
were
able 1) to model rich content, and 2) to annotate something without being forced to
give it
a generic identifier that specifies its type.
TAG avoids the XML prohibition against markup in attribute values because in TAG the
Text nodes of an annotation can be a target of markup, just like those of the main
text. TAG
avoids the scope of reference problem because the annotation can point to a Markup
node with
a name if an appropriate one exists (such as paragraph in a document that marks up
paragraphs). In the example below, because TAG permits anonymous Markup nodes (that
is,
because the name
property of Markup nodes is optional), we annotate arbitrary
text without giving it the equivalent of an XML generic identifier, although in a
revision
currently under development, we are exploring pointing directly from the annotation
to the
Text nodes, which would obviate the need for the anonymous Markup node. With either
of these
approaches, footnote-like relationships can be modeled in TAG as what they are: rich-text
annotations on text regardless of whether the target of the annotation corresponds
to a
Content Object with an identifiable type. TAG is similar to LMNL in this respect,
except
that in TAG text being footnoted that is discontinuous is no different from continuous
text;
it is a set of Text nodes that constitute the tail of a Markup-to-Text hyperedge.
In the simplified example below, we add a footnote to the second and third lines of
a
poem by using an Annotation node (orange) to point to a Markup node (violet) that
is the
head of an anonymous Markup-to-Text hyperedge, and the text of the annotation also
has
markup (a sky blue Markup node with a name
property of emphasis
points to a single Text node). Neither of these features is available with attribute
markup
in XML because elements must have generic identifiers (= cannot be anonymous) and
attribute
values cannot contain markup. And if the footnote target happens to be something that
would
create overlap in XML (e.g., if it runs from the middle of one line to the middle
of another
and the lines have been tagged explicitly), XML is further encumbered by the prohibition
against overlap.
Insofar as a footnote can be considered metadata about text, the structure illustrated
above represents it as an annotation, but it does not require us to assign a type
to the
target of the annotation as a side effect of referring to it, and it allows us to
add markup
to the footnote text itself.
Data model versus syntax
Syntax is not necessarily the same as a data model. A data model could, at least in
principle, be serialized in multiple ways, and syntax developed to represent one data
model
could be coopted to represent a different one. TAG does not at present have its own
serialization syntax, and the Alexandria Markup implementation described below can
read and
write LMNL sawtooth syntax and TexMECS (parsing the results as a representation of
the TAG
data model, rather than of LMNL or GODDAG), and it is intended to be able to do the
same
with XML syntax.
One challenge of comparing TAG to XML, LMNL, GODDAG, and TexMECS is that TAG, like
LMNL
and GODDAG, is a data model, while XML and TexMECS are defined by their syntax. Perhaps
a
bit surprisingly in the context of Balisage, which describes itself as the markup conference
, our focus here is not on markup (that
is, on syntax and serialization), but on the data models that may be expressed through
markup, which means that for comparative purposes we may sometimes need to infer a
data
model from a syntactic specification.
The situation is especially complicated in the case of XML because although it does
not
have a data model, it also has three almost-data-models: XML DOM [W3C DOM], which is an object model and API; the XML InfoSet [W3C XML InfoSet],
which is an information model; and XDM [W3C XDM], which is a data model
for processing XML. Our inferred data model for XML for comparative purposes here
includes
the seven node types specified in XDM (not the twelve of XML DOM or the eleven types
of
information items of the XML InfoSet), along with the structural properties of the
ordered
tree that are relevant for understanding (but not necessarily adequate for processing)
well-formed XML (e.g., attribute nodes on an element are unordered). Our aim is not
to
create a data model for XML, which lies far outside the scope of this paper, but to
identify
features of the way XML models text that can be used comparatively to help elucidate
features of TAG.
The fact that some of our objects of comparison are serializations and others are
data
models matters because, as the etymology of the term implies, serialization is an
ordered
linear expression, which is not a requirement of data models. If, for example, a paragraph
is exactly coextensive with a quotation, in XML syntax, LMNL sawtooth syntax, and
TexMECS
syntax, the start tag of either the paragraph or the quotation must come first in
linear
order. But in LMNL the relative order of the ranges defined by the tags is not an
obligatory
part of the model, which permits two ranges to begin at the same location in the text,
and
the same is true of TAG. In XML, however, one element must be the parent of the other,
and
the order of the start tags reflects both containment and hierarchy. TexMECS negotiates
this
issue by using different start- and end-tag delimiters to distinguish when the relative
order of the tags is informational and when it is not.
Semantics versus application level
Another challenge for text modeling involves distinguishing properties that inhere
in
the structure of the text being modeled from those that depend on semantics that must
be
interpreted at a higher (application) level. A failure to make this distinction may
have two
types of consequences (which are really aspects of the same thing, the delegation
of
information that should be part of the model to the application layer): either the
application must know that some properties of the model are not informational and
are to be
ignored, or the application must know that there is information that is not represented
entirely by the model and must therefore be added during processing. If, however,
the model
explicitly represents the structural properties of the text and nothing else, the
application level is freed from having to supplement the model, and can concentrate
on
features that are truly application-specific.
Moving structural information out of the application layer and into the model is a
priority in the design of TAG, and here are two illustrations of the issue:
-
The pairing of start and end tags in XML markup is inherent in the markup itself,
and is available during parsing with no reference to semantics. In contrast, the
pairing of XML milestones that are used to simulate container tags as a work-around
for overlap (see the discussion of Trojan markup in DeRose 2004)
depends on semantics. XML applications do not need to know that regular start and
end
tags delimit an element because that information is an inalienable feature of all
XML
documents that is fully specified by the syntax, but they do need to know when empty
tags are being used to simulate the beginning and end of a content object and when
they are not, or which pseudo-start-tags are to be associated with which
pseudo-end-tags. A robust and efficient strategy would represent all structural
features as parts of the model itself, instead of requiring that some of them be
handled through semantic information that is available only at the application
level.
-
Because XML models an ordered hierarchy, element always have order, which requires
the application layer to distinguish situations where order is semantically meaningful
from situations where it isn’t. For example, the TEI <choice>
element has the semantics of associating content objects that do not have a natural
order with respect to one another, such as an abbreviation and its expansion or an
error and its correction. How those should be rendered is the proper business of the
application layer, but the XML model requires that one option proceed or follow the
other even when the order does not represent an inherent, informational property of
the text being modeled. This has the undesirable consequence that, incorrectly (from
the perspective of what the marked-up text means), an XML processor will regard two
TEI documents as different if they differ only in the order of the children of their
<choice>
elements unless the processor is given access to TEI
markup semantics. Imposing an arbitrary order as a schema enhancement (for example,
requiring that an abbreviation always precede its expansion inside a TEI
<choice>
element) will avoid the problem of distinguishing when
two documents should be considered the same or different, but at the cost of making
order informational in some situations and arbitrary in others, that is, of imposing
order on something that is not inherently ordered. A more robust and efficient model
would not specify order when it must then be ignored, so that a processor will know
when order is informational and when it is not from the model, without recourse to
semantics.
Concerning the first of these issues, matching up pseudo-start-tags with pseudo-end-tags
during processing does not arise in TAG not only because TAG does not at present have
its
own syntactic expression (although we can represent some features of TAG by borrowing
LMNL
sawtooth syntax or TexMECS), but also because the fact that TAG permits overlap makes
such
workarounds unnecessary. The second issue is more challenging, and because TAG currently
models text as a single chain of Text nodes, it does not yet distinguish situations
where
order is not informational. But because that is a feature of what text is (that is,
because
the first O of OHCO is as much an issue as the H that follows it), it is a design
requirement that we intend to address as development continues (see Appendix B).
Works cited
[Birnbaum and Thorsen 2015] Birnbaum, David J.,
and Elise Thorsen. Markup and meter: Using XML tools to teach a computer to think about
versification.
Presented at Balisage: The Markup Conference 2015, Washington, DC,
August 11–14, 2015. In Proceedings of Balisage: The Markup Conference
2015. Balisage Series on Markup Technologies, vol. 15 (2015). DOI:
10.4242/BalisageVol15.Birnbaum01. https://www.balisage.net/Proceedings/vol15/html/Birnbaum01/BalisageVol15-Birnbaum01.html
[Bleeker 2017] Bleeker, Elli. Mapping
invention in writing: digital infrastructure and the role of the genetic editor.
PhD
dissertation, University of Antwerp, 2017.
[CollateX] CollateX.
https://pypi.python.org/pypi/collatex
[Coombs et al. 1987] Coombs, James H., Allen H.
Renear, and Steven J. DeRose. Markup systems and the future of scholarly text
processing.
Communications of the association for computing machinery,
30.11 (Nov. 1987): 933–47. doi:10.1145/32206.32209
[DeRose 2004] DeRose, Steven J. Markup
overlap: a review and a horse.
Extreme Markup Languages 2004.
http://xml.coverpages.org/DeRoseEML2004.pdf
[DeRose et al. 1990] DeRose, Steven J., David G.
Durand, Elli Mylonas, and Allen H. Renear. What is text, really?
, Journal of computing in higher education, 1.2 (1990): 3–26.
doi:10.1007/BF02941632.
http://www.cip.ifi.lmu.de/~langeh/test/1990%20-%20DeRose%20-%20What%20is%20Text,%20really%3F.pdf
[Hilbert, Schonefeld, and Witt 2005] Hilbert,
Mirco, Oliver Schonefeld, and Andreas Witt. Making CONCUR work.
Proceedings of Extreme Markup Languages 2005.
http://conferences.idealliance.org/extreme/html/2005/Witt01/EML2005Witt01.xml#Horse
[Huitfeldt and Sperberg-McQueen 2003] Huitfeldt,
Claus and C. Michael Sperberg-McQueen. TexMECS. An experimental markup meta-language
for complex documents.
Revision of 5 October 2003.
http://mlcd.blackmesatech.com/mlcd/2003/Papers/texmecs.html
[Ide and Suderman 2007] Ide, Nancy and Keith Suderman.
GrAF: a graph-based format for linguistic annotations.
Proceedings of the
Linguistic Annotation Workshop, held in conjunction with ACL 2007, Prague, June 28–29,
1–8.
https://www.cs.vassar.edu/~ide/papers/LAW.pdf
[LMNL data model] LMNLWiki. LMNL data
model.
From the Lost Archives of LMNL.
http://www.lmnl.net/prose/data-model/data-model-spec.html
[Peroni et al. 2014] Peroni, Silvio, Francesco Poggi
and Fabio Vitali. Overlapproaches in documents: a definitive classification (in OWL,
2!).
Presented at Balisage: The Markup Conference 2014, Washington, DC, August 5 -
8, 2014. In Proceedings of Balisage: The Markup Conference 2014.
Balisage Series on Markup Technologies, 13 (2014). DOI:
10.4242/BalisageVol13.Peroni01.
https://www.balisage.net/Proceedings/vol13/html/Peroni01/BalisageVol13-Peroni01.html
[Piez 2008] Piez, Wendell. LMNL in miniature.
An introduction.
Amsterdam Goddag Workshop, 1–5 December 2008.
http://piez.org/wendell/LMNL/Amsterdam2008/presentation-slides.html
[Piez 2010] Piez, Wendell. Towards hermeneutic
markup. An architectural outline.
Presentation at Digital Humanities 2010, King’s
College, London. http://piez.org/wendell/papers/dh2010/ The screen shot in this
paper is taken from
http://piez.org/wendell/papers/dh2010/clix-sonnets/ozymandias-map.svg.
[Piez 2014] Piez, Wendell. Hierarchies within
range space: From LMNL to OHCO.
Presented at Balisage: The
Markup Conference 2014, Washington, DC, August 5 - 8, 2014. In Proceedings of Balisage: The Markup Conference 2014. Balisage series on markup
technologies, vol. 13 (2014). DOI: 10.4242/BalisageVol13.Piez01.
http://www.balisage.net/Proceedings/vol13/html/Piez01/BalisageVol13-Piez01.html
[Renear, Mylonas, and Durand 1996] Renear, Allen H.,
Elli Mylonas, and David G. Durand. Refining our notion of what text really is: the
problem of overlapping hierarchies.
Research in humanities computing, ed. Nancy Ide and Susan
Hockey. Oxford: Oxford University Press. 1996.
http://cds.library.brown.edu/resources/stg/monographs/ohco.html
[Sperberg-McQueen and Huitfeldt 2008] Sperberg-McQueen, C. M., and Claus Huitfeldt. “Containment and dominance in Goddag
structures”. Talk at conference on "Processing Text-Technological Resources", Center
for
interdisciplinary research, University of Bielefeld, March 2008. http://cmsmcq.com/2008/bielefeld/slides.html#(1)
[Sperberg-McQueen and Huitfeldt 2000] Sperberg-McQueen, C. M. and Claus Huitfeldt. GODDAG: a data structure for overlapping
hierarchies.
Digital documents: systems and principles: 8th international conference
on digital documents and electronic publishing, DDEP 2000, 5th international workshop
on the
principles of digital document processing, PODDP 2000, Munich, Germany, September
13–15,
2000, revised papers, ed. Peter King and Ethan V. Munson. NY: Springer, 2004,
139–60. A revised version is available at
http://cmsmcq.com/2000/poddp2000.html
[Sperberg-McQueen and Huitfeldt 2008a] Sperberg-McQueen, C. M., and Claus Huitfeldt. .Containment and dominance in Goddag
structures
Presented at Processing text-technological resources, Bielefeld, March
13-15, 2008, organized by the Zentrum für interdisziplinäre Forschung der Universität
Bielefeld. Slides (but not full text) available on the Web at
http://www.w3.org/People/cmsmcq/2008/bielefeld/slides.html
[Sperberg-McQueen and Huitfeldt 2008b] Sperberg-McQueen, C. M. and Claus Huitfeldt. Markup Discontinued: Discontinuity in
TexMecs, Goddag structures, and rabbit/duck grammars.
Presented at Balisage: The Markup Conference 2008, Montréal, Canada,
August 12 - 15, 2008. In Proceedings of Balisage: The
Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008).
DOI: 10.4242/BalisageVol1.Sperberg-McQueen01.
http://www.balisage.net/Proceedings/vol1/html/Sperberg-McQueen01/BalisageVol1-Sperberg-McQueen01.html
[Tennison 2008] Tennison, Jeni. Overlap,
containment and dominance. Jeni’s musings,
2008-12-06.
http://www.jenitennison.com/2008/12/06/overlap-containment-and-dominance.html
[TEI Genetic editions] TEI WG-GE. An encoding model
for genetic editions.
s
[W3C DOM] W3C. What is the Document Object
Model?
Document Object Model (DOM). Level 2 Core Specification. Version
1.0
https://www.w3.org/TR/DOM-Level-2-Core/introduction.html
[W3C XML] W3C. Extensible Markup
Language (XML) 1.0 (fifth edition).
http://www.w3.org/TR/xml/
[W3C XML InfoSet] W3C. XML
Information Set (second edition).
https://www.w3.org/TR/xml-infoset/
[W3C XDM] W3C. XQuery and XPath
Data Model 3.1.
https://www.w3.org/TR/xpath-datamodel-3/#Node