WendellPiezTEI2011

From IntereditionWiki

Wendell Piez

Slides for my talk are here: Media:Interoperability-10mins-slides.pdf

Affiliation

Mulberry Technologies, Inc.

I've been involved in Humanities Computing and Digital Humanities since the mid-1990s. Since 1998 I have worked for a small consulting firm in Rockville, Maryland, providing design, implementation and training services mostly to not-for-profit organizations in the STM (scientific/technical/medical) publishing industry. My perspective on data interchange and interoperability is influenced not only by theoretical interests, but by our clients' experiences: to them, the adoption of a robust standard can be an important enabler and cost saver -- while, at the same time, it can be risky if the standard is not well fitted to their operations and business needs, or if the costs of adopting the standard exceed the costs of "going it alone", including dealing with interchange requirements ad hoc.

The Paradox of Interoperability

The attraction of interoperability is simple but profound: by agreeing in advance with partners on formats and protocols, we can save significant effort (and money), sometimes enough to make an objective achievable that would otherwise be prohibitively difficult. In particular, when such a standard becomes part of the infrastructure (as has happened with Internet protocols or wireless technology), it enables the development of higher-level technologies whose users do not have to be expert in, or even aware of, the lower-level bases of what they are doing.

Yet at the same time, a "standard" also means that we agree to be constrained indefinitely by a set of rules that may or may not be of any direct use to us, and in fact may be prohibitive in their own ways. If nothing else, standards mean (or should mean) backward compatibility, a design goal that amounts, in effect, to deciding in advance to live with our mistakes, even in the light of better experience. In other words, a requirement for interoperability and even standards in support of data interchange can be significant impediments to experiment and innovation.

Whether this is a problem, or an essential feature of a stable platform (or both) depends on who you are and how happy you find yourself with the status quo. In order to keep this mind, it helps to recall that one of the most important early motivations for descriptive markup formats in general, and for the Text Encoding Initiative in particular, is to avoid proprietary lock-in, which is to say a passively enforced, universal dependence on a proprietary data format such as Microsoft Word -- a lamentable situation (or so we think) that would not come about if it weren't for a general requirement for interoperability.

Thus it is partly in consequence of its serving another set of requirements, namely to provide an opportunity for individual projects to customize and develop formats to meet their own particular needs, that the TEI has failed to provide for interoperability or even interchange beyond a certain very rudimentary level. To be sure, the interoperability we achieve through being able to use a commodity toolkit, including parsers, transformation processors, code libraries in many languages, and editing tools, is not insignificant (XML and related technologies by themselves give us a great deal of interoperability); and Guidelines and common validation technologies such as schemas are by no means useless. Yet this nonetheless falls far short of the effortless interchange that has sometimes (even if naïvely) been promised as a reward for an investment in the technology. As far as the TEI is concerned, this is the crux of the matter. Exchanging data between our systems with neither loss of semantics, nor the need to negotiate conversions or tranformations across the boundaries -- "blind interchange" -- is challenging even for XML users whose goals do not include describing things in XML that have never been properly described before. If it is especially difficult for members of the Text Encoding Initiative, that may be an indication that we are succeeding in (some of) what we have set out to do, even while we may be failing in something else.

In fact, this should not come as a surprise if we reflect that the purpose of a common format or language is to enable us, as individuals, projects and institutions, to communicate something different from what others are saying. Even if a common language makes it intelligible in principle, communication does not typically come for free. To understand the "meaning" of an utterance or text has always required human engagement in a process of interpretation. And this is all the more so if the language is not yet common (or even if the way we are using a purportedly common language is not actually very common). What we find with interoperability in the information processing domain is only the same problem we have in other areas. If we have to reinvent the language, or the medium, to get our point across, we take on the related burdens of being often misunderstood (failing to interoperate); of having to teach others, if only by example, our new form of expression, in order to be able to interoperate on a new basis; and of accepting, sometimes, that we have fallen short, and trying something different.

With this paradox in view, it seems to me that interoperability is achievable to exactly the extent that participants are able to commit to it, and not by rhetorical gestures but in the actual design and implementation of projects and technologies. In other words, interoperability is not something we get by talking about it; instead, we have to build, travel and improve the actual communication pathways that will make it possible. Nor should we be confused about the impact of innovation on data interchange or the potential for it. Interchange is impeded by every new tag or tag usage profile -- even or especially if we promote it as part of the lingua franca, with the result that everyone now has to deal with it, paying the costs of doing so.

Interoperability is not free; it both requires investment (over and above the work we put into local solutions to local problems) and imposes opportunity costs. Given this, it is not difficult to see why many parties who promote interoperability as an abstract goal may not actually want it badly enough to be willing to make the sacrifices required, even if they are able to.

Supporting interoperability with a TEI Core, microformats and an HTML binding

On the other hand, we now have enough experience with XML generally and TEI in particular for us to know what an interoperable format for document description (or at any rate, a more interchangeable one) should look like. A core TEI tag set or "Interchange Profile" could be offered -- if not by the TEI Consortium itself, than by another organization (even an informal one). It would be important that this tag set (a maximum, say, of 100-120 element types with clean content models on a simple architecture) would be specifically limited to general purposes. In particular, this means they will exclude tags designed to address requirements for particular applications, or even to describe textual structures not common across the broadest spectrum of documents in scope. Provisions for such special cases could be made through two mechanisms. The format could include "abstract generic elements" (as described in my 2011 Balisage paper at http://balisage.net/Proceedings/vol7/html/Piez01/BalisageVol7-Piez01.html), such as TEI ab and seg, provided as "control points for extension" in the instance. (This phrasing is Michael Sperberg-McQueen's. Such a platform for ad hoc microformats might be considered, if you like, as inside-out architectural forms.) And of course, projects would be free to extend into other namespaces to deploy richer tagging for purposes special to themselves or to any niche communities with which they also wanted to work.

Assuming its content models are not prohibitively strict, such a tag set could be both lightweight (or at least relatively so) and versatile in application. But because it would provide a narrow target, projects that used it would be assured of a level of interchange among themselves, if not perfectly "blind", than at least better than now offered by TEI-all. And when, in order to deploy application-level functionality and/or richer descriptions of content, projects did improvise microformats on top of generic elements or include elements in different namespaces, such usage would be evident, open and readily inspected.

Again, assuming its design were solid, this "Core TEI" could also be mapped to HTML (XHTML and/or HTML5), with bindings back to TEI elements expressed in @class attributes as a microformat. (See DougReside2011.) I do not see this as an alternative solution to a reduced TEI, but as a complementary activity. Given well-defined specifications on both sides, transformations in both directions should be feasible. To an extent, additionally, an HTML profile or "reflection" of TEI Core (in which most TEI semantics would necessarily be asserted by attribute values not element types) could also be validated, for example with Schematron or XProc (by means of a pipeline of transformation and validation).

Ideally, such a TEI Core would be derived from subset formats already defined at TEI for various purposes (there are at least two or three obvious candidates), and actively promoted from within the TEI. It would be important, however, that it be assigned its own namespace, for purposes of transparency (although it is worth noting in passing that wholesale conversion of documents from one namespace into another is not now very difficult, as described in my JATS-Con paper this year; see http://www.ncbi.nlm.nih.gov/books/n/jatscon11/piez/). In support of this activity, TEI could in fact provide its own SIGs and internal initiatives with namespaces for their own experiments, prototypes, and "micro-standards", presenting guidelines for their use alongside the core by any users who need their extended functionalities.

Yet part of the appeal of this idea in my view is that in its design, it does not actually require support from TEI itself, but could be developed as an entirely independent initiative. Of course, this is merely a formal advantage, the point being that this activity in no way contradicts any others already underway at the TEI, and indeed complements them. It is not, in fact, very likely that developers outside the TEI would undertake it, in part because they also have other options for data formats supporting documentary interchange (including HTML, Docbook, DITA, and the NLM/NISO formats, to name only the most prominent). Yet, since those who require interoperability and interchange will be negotiating such solutions for themselves in any case, it is to be hoped that the TEI Consortium would help to organize and promote it.

Some relevant papers from Balisage 2011

Bauman, Syd. “Interchange vs. Interoperability.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2-5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:10.4242/BalisageVol7.Bauman01 (http://balisage.net/Proceedings/vol7/html/Bauman01/BalisageVol7-Bauman01.html).

Kimber, Eliot. “DITA Document Types: Enabling Blind Interchange Through Modular Vocabularies and Controlled Extension.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2-5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:10.4242/BalisageVol7.Kimber01 (http://balisage.net/Proceedings/vol7/html/Kimber01/BalisageVol7-Kimber01.html).

Piez, Wendell. “Abstract generic microformats for coverage, comprehensiveness, and adaptability.” Presented at Balisage: The Markup Conference 2011, Montréal, Canada, August 2-5, 2011. In Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies, vol. 7 (2011). doi:10.4242/BalisageVol7.Piez01 (http://balisage.net/Proceedings/vol7/html/Piez01/BalisageVol7-Piez01.html).