About text sources
From IntereditionWiki
This was put forward by Peter Robinsons as a description of a possible framework for addressing text sources on line...
Documenting texts and text sources for exposure and retrieval
1. Summary
This document provides proposed encodings for the following: 1. Unambiguous labelling of all text sources (all physical instances of texts, as books, and manuscripts, and of all parts of those text sources), of all texts and of all parts of all texts (2.1), and typing of all resources associated with the text sources and the texts (2.2) 2. Provision of metadata structures enabling exposure and retrieval of the resources associated with all text sources and all texts so labelled (3)
Among other functions, this system will support the following operations:
1. For any text source: retrieval of descriptions of the text source, of lists of its pages or other parts, with images of those pages, of lists of texts contained in the text source with transcripts of those texts, including the text on any one page or other part of the text source
2. For any text: retrieval of lists of all parts of the text, of lists of all text sources containing all or any part of the text, of transcripts of all or any part of the text in any text source or in any part of a text source.
For example: this system (when fully implemented!) will allow discovery of all manuscripts containing the first verse of the first chapter of St John’s Gospel in the Greek New Testament; all images of the pages of the manuscripts containing this verse; all transcripts of the text of those pages. The data segments so retrieved could be displayed or submitted to further processing (e.g. searching or collation).
The labelling here offered is implemented as an extension of the Text Encoding Initiative guidelines, in terms of the labelling of text sources and of texts, and as an extension of the Dublin Core implementaion of OAI-PMH, for exposure of the resources related to the text sources and texts so labelled.
First version: submitted by Peter Robinson 20 October 2008.
2. Labelling and typing
2.1 Text source and text labelling
Following Kahn-Wilensky, we propose a two-part naming specification, for identifiers associated with each text source and each text, and for all segments within the text or text source:
1. A naming authority statement, declaring the body responsible for the naming
2. The name itself, expressed as a hierarchical sequence of key/value pairs.
These could be concatenated into a single string, such as:
“Auth=ITSEE/text=CT/”: the whole of the Canterbury Tales, as defined by the naming authority ITSEE.
“Auth=ITSEE/text=CT/part=GP”: the General Prologue of the Canterbury Tales, as defined by the naming authority ITSEE.
“Auth=ITSEE/text=CT/part=GP/L=1”: the first line of the Canterbury Tales, as defined by the naming authority ITSEE.
“Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=1”: the first verse of the first chapter of the Gospel of John in the Greek New Testament, as defined by the naming authority ITSEEINTF.
The above examples name texts. The same system will be used for identification of text sources, that is, a physical object which may contain text:
“Auth=ITSEE-INTF/textsource=01”: Codex Sinaiticus, as defined by the naming authority ITSEE-INTF
“Auth=ITSEE-INTF/textsource=01/quire=37/page=2r”: the second page recto of quire 37 of Codex Sinaiticus, as defined by the naming authority ITSEE-INTF
Within each naming string, the parts must represent a hierarchy: for texts, books contain chapters which contain verses; for text sources, a manuscript may contain quires which contain pages. We suggest that two fundamental types of object are here specified, as the key to the first element following the Auth key value pair: ‘text’ (in FRBR terms, the ‘work’; see http://www.frbr.org/), and ‘textsource’ (in FRBR terms, an ‘item’, a physical instance of the text).
See section 4 for examples of how these labels can be embedded in TEI documents. Alternatively, the key/value statements within the name could be mapped directly to XPath expressions for suitably structured documents.
2.2 Resource typing
The naming system proposed above allows us to associate an identifier with a resource, but makes no statement about the nature of that resource, beyond the typing ‘text’ and ‘text source’. Precise information is required about the resource so labelled: is it a description? An image, or a set of images? A transcription? An edition? Further, in the domain of scholarly editions, there are many editions, many transcripts, many images: and they are not alike.
Thus, a mechanism should be provided to specify, as exactly as appropriate, the nature of the resource. It is important that the terms of the description are explicitly defined, and do not simply presume a common understanding where none may exist. We propose that the same naming authority mechanism be used to specify the authority responsible for the terms used in the description.
As with resource naming, the naming authority and key/value pairs may be concatenated in a single string, as follows:
A transcript of a manuscript, expressed in XML and conformant to the intf-itsee schema:
“Auth=ITSEE-INTF/type=transcript/form=XML/schema=http://intf-itsee.schema”
A 256 bit grey-scale digital image of a manuscript, at 300 dpi against the original, stored in jpg form at 60% compression:
“Auth=ITSEE-INTF/type=facsimile/source=microfilm/color=grey256/resolution=300dpi/form=jpg/
compression=60”
A description of the manuscript, in the TEI P5 msDescription form
“Auth=ITSEE-INTF/type= description/form=XML/schema=http://tei p5 msdescription”
An itemization of the texual contents of a manuscript, in the form of a <msContents> element within a TEI P5 msDescription structure:
“Auth=ITSEE-INTF/type=description/form=XML/schema=http://tei p5 msContents”
3. Metadata structures
This section suggests how the widely supported OAI-PMH mechanisms for exposing resource information might be used with the naming and typing scheme outlined above.
3.1 Identification of naming authority
We envisage that in this system, each naming authority is associated with a domain name and hence a URI, with support for the OAI-PMH protocols directed to that URI.
[Suggestion] The VMR naming authority should identify itself through response to a OAI-PMH ‘Identify’ request, through use of a ‘description’ container, labelled as a ‘name-authority’. The name-authority description container should reference a name authority schema, and should provide one or more <baseURL> element pointing at URLs offering elaborations of the naming scheme.
[Suggestion] The ORE vocabulary specification mechanism may also be appropriate, as a means of structuring and exposing naming authority declarations.
3.2 Metadata for resources: record structure
We propose exposure of this architecture through OAI-PMH Dublin Core formatted records. This will permit satisfactory identification and retrieval of single objects associated with single identifiers within the scheme (in conformancy with the standard requirement of a OAI-PMH protocol). It will also leverage retrieval of more complex operations, dependent on the mechanisms here outlined.
A full OAI-PMH record, submitted in response to a query or discovered to a harvester, has the following form:
<?xml version="1.0" encoding="UTF-8" ?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2002-05-01T19:20:30Z</responseDate> <request verb="GetRecord" identifier="oai:arXiv.org:hep-th/9901001" metadataPrefix="oai_dc">http://an.oa.org/OAI-script</request> <GetRecord> <record> ... </record> </GetRecord>
</OAI-PMH>
Each <record> identifies a single resource (which might itself contain many further resources). Here is a sample record using Dublin Cord, taken from the OAI-PMH documentation:
<record>
<header>
<identifier>oai:arXiv.org:cs/0112017</identifier>
<datestamp>2001-12-14</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<metadata>
<oai_dc:dc
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Using Structural Metadata to Localize Experience of
Digital Content</dc:title>
<dc:creator>Dushay, Naomi</dc:creator>
<dc:subject>Digital Libraries</dc:subject>
<dc:description>With the increasing technical sophistication of
both information consumers and providers, there is
increasing demand for more meaningful experiences of digital
information. We present a framework that separates digital
object experience, or rendering, from digital object storage
and manipulation, so the rendering can be tailored to
particular communities of users.
</dc:description>
<dc:description>Comment: 23 pages including 2 appendices,
8 figures</dc:description>
<dc:date>2001-12-14</dc:date>
</oai_dc:dc>
</metadata>
</record>
Accordingly, the fragments in the examples below will be embedded in a <oai_dc:dc> element, within a <record> element. The examples below presume the use of a ‘xsi:type’ attribute to subtype Dublin Core elements. The values available for this attribute are determined in a ‘uid’ namespace, associated with a uid terms schema (to be developed), as well as by the dcterms schema. This has three possible values: textsource, text and type, corresponding to the three possible typing values of the naming system here defined. The ‘altName’ element is also used, drawn from the TEI msDesc encoding: this illustrates how the OAI-MPH records can mix elements from different namespaces.
3.3 Sample OAI-PMH fragments
3.3.1 A manuscript description
<dc:title>Mingana 10</dc:title> //the sigil or short title used for this ms
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type= description/form=XML/
schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI" >http://url for the description</dc:identifier>
Two title elements are given. The first gives the conventional catalogue identifier; the second gives an alternative name (one might here use instead <msdesc:altname>, having declared the msdesc for this record). The value of <dc:identifier xsi:type="uid:textsource" > is set to the identifier. The TEI element corresponding to this element has the attribute uid:id set to the same value, “Auth=ITSEE/textsource=Mingana10” (this is also, in this case, the content of the <idno> in the <teiHeader>). The <dc:type> element declares that the resource here indicated is a description of the manuscript, in TEI format, according to the schema given.
3.3.2 A set of images of the manuscript
<record>
<header>
<identifier>oai:vmr:itsee:text:2008.02.0084</identifier>
<datestamp>2002-05-01T14:16:12Z</datestamp>
</header>
<metadata>
<oai_dc:dc
xmlns:oai_vmr="our very own xml schema”
xsi:schemaLocation="where we can find it">
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc: identifier xsi:type="uid:textsource" >Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=facsimile/source=digitalimage/color=24bitrgb/resolution=500dpi/form=jpg/compression=60</dc:type>
<dc:identifier xsi:type="dcterms:URI" >
http://www.vmr-itsee.bham.ac.uk/Mingana/10/</dc:identifier>
</oai_dc:dc>
</metadata>
</record>
Note the use of the same label Auth=ITSEE/textsource=Mingana10 to associate the images with the manuscript; the use of type to declare these are images; and the full url for the images, corresponding to the value of the base element in the facsimile associated with this manuscript.
3.3.3 An image of one page of the manuscript
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10/page=2</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=facsimile/source=digitalimage/color=24bit-rgb/resolution=500dpi/form=jpg/compression=60</dc:type>
<dc:identifier xsi:type="dcterms:URI">http://www.vmr-itsee.bham.ac.uk/Mingana/10/2</dc:identifier>
3.3.4 The text contained in the manuscript
This asserts that the manuscript contains the text of the Greek New Testament, and that a description of the manuscript is available. Note the provision of two identifiers to assert this.
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type= description/form=XML/schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI">http://url for the description</dc:identifier>
3.3.5 The transcript of the text contained in the manuscript
This asserts that the manuscript contains the text of the Greek New Testament, and that a transcript of this text is available.
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=transcript/form=XML/
schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI" >http://url for the transcript</dc:identifier>
3.3.6 The text contained on one page of the manuscript
This asserts that this page of this manuscript contains the text of verses 6 and 7 of the first chapter of St John’s Gospel, but does not assert that a transcript of the text is available, instead pointing to a description of the manuscript.
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT book=4/chapter=1/verse=7</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type= description/form=XML/schema=http://teip5...</dc:type>
<dc:identifier xsi:type="dcterms:URI">http://url for the description</dc:identifier>
3.3.7 The transcript of a text of a page contained in the manuscript
This asserts that this page of this manuscript contains the text of verses 6 and 7 of the first chapter of St John’s Gospel, and that a transcript of the text on this page is available. Note the use of the pagetranscript value to indicate that the transcript is of the text of those verses as it appears on this page, which may not be the full text of those verses.
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">
Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT book=4/chapter=1/verse=7</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=pagetranscript/form=XML/schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI">http://url for the transcript</dc:identifier>
3.3.8 The transcript of a text fragment contained in the manuscript
This asserts that this manuscript contains the text of verses 6 of the first chapter of St John’s Gospel, and that a transcript of this text in this manuscript is available.
<dc:title>Mingana 10</dc:title>
<msdesc:altname>10</dc:title
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=Mingana10</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=transcript/form=XML/schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI">http://url for the transcript</dc:identifier>
3.3.9 An edition of the text
This asserts that a particular edition of this text is available. Note that the edition is treated as a text source.
<dc:title>The Nestle Aland 28th Edition of the Greek New Testament</dc:title>
<msdesc:altname>NA28</dc:title>
<dc:identifier xsi:type="uid:textsource">Auth=ITSEE/textsource=NA28</dc:identifier>
<dc:identifier xsi:type="uid:text">Auth=ITSEEINTF/text=GNT</dc:identifier>
<dc:type xsi:type="uid:type">Auth=ITSEE-INTF/type=edition/form=XML/schema=http://tei p5..</dc:type>
<dc:identifier xsi:type="dcterms:URI" >http://url for the edition</dc:identifier>
4. A TEI encoding of a manuscript
This example shows how the labels and typing described above may be embedded into a TEI document. The metadata records as shown above and the links back into the document or to the associated images could then be generated from this encoding. Note particularly that uid attribute values map directly to elements in the OAI-PMH records; other elements (dc:title, msdesc:altname) may be also be derived from the TEI encoded description.
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Mingana 10</title>
</titleStmt>
<publicationStmt>
<distributor>University of Birmingham</distributor>
<idno>Auth=ITSEE/textsource=Mingana10</idno>
</publicationStmt>
<sourceDesc>
<msDesc xml:id="Mingana10" xml:lang="en" uid:textsource="Auth=ITSEE/textsource=Mingana10">
<msIdentifier> <country>United Kingdom</country> <region>West Midlands</region> <settlement>Birmingham</settlement> <institution>University of Birmingham</institution> <repository>Department of Special Collections</repository> <collection>Mingana</collection> <idno>10</collection> </msIdentifier>
245 x 160 mms. 136 leaves, nineteen lines to the page.
The Gospels according to the Harklean Version
No date. The writing is a clear and bold West Syrian script of about A.D. 1300
</msDesc>
</sourceDesc>
</fileDesc>
<revisionDesc>
<change when="2008-10-09">
</change>
</revisionDesc>
</teiHeader>
<facsimile xml:base="http://www.vmr-itsee.bham.ac.uk/Mingana/10/"
uid:type="Auth=ITSEE-INTF/type=facsimile/
source=digitalimage/color=24bit-rgb/resolution=500dpi/
form=jpg/compression=60">
<surface start="#Mingana10-2"ulx="0" uly="0" lrx="5460" lry="3192" uid:textsource="Auth=ITSEE/textsource=Mingana10/page=2" >
<graphic mimeType="jpeg" xml:id="Mingana-10-2" url="Mingana-10-2.jpg"/>
</surface>
</facsimile>
<text>
<body>
<head>The Greek New Testament</head>
<head>The Gospel of John</head>
<head>Chapter 1</head>
....<pb xml:id="Mingana10-2" uid="Auth=ITSEE/textsource=Mingana10/page=2"/>
<ab uid="Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6" n="6">some text</ab>
<ab uid="Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=7" n="7">some text</ab>
</body> </text> </TEI>
5. Bibliography; Useful links
Dublin core: http://dublincore.org/documents/dcmi-terms/ OAI-MPH: http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-ORE: http://www.openarchives.org/ore/0.9/toc.html Dublin core implementation in XML: http://dublincore.org/documents/dc-xml-guidelines/ The ENRICH documentation, for TEI manuscript descriptions (temporary site): http://tei.oucs.ox.ac.uk/ENRICH/Deliverables/referenceManual_en.html Kahn-Wilensky naming scheme: http://www.cnri.reston.va.us/k-w.html http://dublincore.org/documents/dcmi-terms/ : Dublin core metadata terms and extensions http://www.w3.org/RDF/ : resource description formats
6. Example uses
This scheme would allow activities as follows through identification of appropriate URLs from the text, textsource and type attributes associated with the URL in the OAI-MPH header.
6.1 Searching
1. For words in a particular edition of a particular text: eg —
text: Auth=ITSEEINTF/text=GNT;
textsource Auth=ITSEE/textsource=NA28;
type Auth=ITSEE-INTF/type=edition/form=XML
2. For words in a particular part of a particular edition of a particular text: eg —
text Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6;
textsource Auth=ITSEE/textsource=NA28;
type Auth=ITSEE-INTF/type=edition/form=XML
3. For words in all transcripts of a particular text, in all manuscripts and editions for which we have transcripts, by not specifying any particular textsource: eg —
text Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6;
type Auth=ITSEE-INTF/type=edition/form=XML
6.2 Collation of texts
Following (3) in the last example: retrieve the transcripts. An interface might then ask which are to be collated, and offer options for the collation (output format, regularization tables to use, collation tool, etc.)
6.3 Retrieval of images
1. For images of a particular manuscript: eg –
textsource Auth=ITSEE/textsource=Mingana10;
type Auth=ITSEE-INTF/type=facsimile/source=digitalimage/color=24bit-rgb/resolution=500dpi/ form=jpg/compression=60
2. For images of a particular manuscript page: eg –
textsource Auth=ITSEE/textsource=Mingana10/page=2;
type Auth=ITSEE-INTF/type=facsimile/source=digitalimage/color=24bit-rgb/resolution=500dpi/ form=jpg/compression=60
3. For all images of manuscript pages containing a particular text fragment: : eg —
text Auth=ITSEEINTF/text=GNT/book=4/chapter=1/verse=6;
type Auth=ITSEE-INTF/type=facsimile/source=digitalimage/color=24bit-rgb/resolution=500dpi/ form=jpg/compression=60
7. To do
There are many points of overlap between the architecture suggested here and other systems already proposed, and indeed in wide use. The bibliography suggests some areas of contact, for example with refinements of the Dublin Core mechanisms, and through resource description formats. Resource Description Formats seem particularly promising as they are explicitly designed for machine processing of statements such as ‘this digitial object contains a transcript of this text in this manuscript’. What is novel in this architecture is the linking between text sources (manuscripts, printed books, any physical instance of any text), the texts they contain, and the type of digital object with which we are dealing. We welcome suggestions as to the most robust and efficient mapping of the linkings we here propose to these existing metadata formats.