From IntereditionWiki

Revision as of 13:45, 24 November 2012 by Paolo (Talk | contribs)

Morning of 21/11/2012

For names, refer to the list of participants.

Discussion on text modelling

Two main alternatives to XML:

  • range based model (for annotations)
  • variant graph (for collation)
    • Tara is using (with CollateX) a graph-based model more complex sthan Schmidt's one.

Paolo is interested in a text model that can represent textual layers (graphical, alphabetical, linguistic; see Orlandi, Informatica testuale)

  • Tara: we can stretch Schmidt's graph model to represent the different Orlandi's layers. They're still graphs
  • Gregor: variant graphs are not suitable for this; they're not variants. LMNL (range-based model), instead, seems to be fit for Orlandi's "textual layers".

Gregor's presentation

Gregor describes his implementation of a range-based (LMNL-based) textual model, used so far for the back-end of the Faust Project.

  • the implementation is in Java
  • annotations that have a name and are namespaced (this comes from XML)
  • text is a sequence of events
  • layers have names like "TEIW" or "Europeana" and contain texts
  • layers have anchors, so one text can point to another or to ranges of another text
  • one pointer can point to multiple anchors. E. g. layer 'alignment' points to two different anchors (and aligns them)
  • layers can include whatever data (Json, an XML file etc.)
  • a TextRepository is just a collection of those layers. It's something I can query
  • you can create a graph of the layers existing in a text repository
  • TextStream
    • Gregor's model is differnt that XML
      • the SAX API works with XML trees
      • you can walk through the tree there
      • a range-based model, isntead does not have such easy stacks
      • How do you transform XML into range-model? Any element (with opening and closing tags) becomes a range

Each participant's agenda

A round-up on the interests of each of the participants:

  • practicality: what can we build on top of e.g. a range-based model (from the datastore to the presentation layer)
  • query/search functions on top of a text model
  • variant graph vs. range-based models
  • processing (equivalent to XSLT?), querying (equivalent to XPath/XQuery)
  • variant graph: traversal patterns?
  • interfaces, APIs, JS libraries
  • problem of variation and how it is handled on different (conceptual) layers of a text
  • common model? can we find a generalized model incorporating features from the variant graph and a range-based model
  • bridge the gap!
  • integration scenarios; import of existing data, multiple use cases on top of those (what is the smallest thing that could possibly work? -- how do we get there)

Division of labour

See TheHague112012DivisionOfLabour

21/11/2012 Afternoon

Bootcamp's agenda and goals

Interoperability is best achieved through web services. The same does Gregor's Java system (that he's showcased this morning). Also CollateX does so now.

Juxta today shows its underlying range-based model only if you use the API. For the standard user using, it's just XML-in/XML-out.

Moritz and Gregor discuss the JavaScript querying prototype they already built for Java. Client-side or server-side?

Ronald: is a DSL (domain specific language) the way for us to go?

Bootcamp's goals:

  • We want to use the TEI palette (vocabulary) to build a non-XML range-model-based annotation of plain text
  • We want to build tools (a Doman Specific Languge) for texts annotated this way, so we foster the development of this kind of texts encoding/annotation model (alternative to XML)
  • Our goal is to make a console on a website, rather than simply an API
    • so the user can test the potential of a querying language that queries texts marked with a range-based model

The point of doing of this is:

  • overcoming overlapping
  • exploiting the recursive potential lf LMNL (annotating annotations)

We should create a web interface. Gregor sets out an example of its workflow:

  • the text repository:
    • POST
    • <xml/>
  • creation of a new text:
    • GET
    • PUT → 2201 creator
  • creation of an annotation to text 12:
    • (but bear in mind that annotations are texts in their own respect)
  • a query on text 13:
  • a query on the whole repository (all texts in the repository)

Textual layers at Faust project

(This is not directly related with this Bootcamp's project)

At Faust's project they're keying/transcribing a MS twice: one transcription for the diplomatic layer, one transcription for the 'linguistic' (regularised) layer.

  • Issue: how do you align the two texts? They do it now by collating them.
  • Open issue: how will they eventually store (e. g. in XML) this alignment? They've been collaborating with TEI SIG on genetic editions for this, but they have not yet reached a solution for such granular alignment (word-level granularity), so there is no solution yet as to how to store the alignment in XML/TEI. Paolo proposes to create one XML file for each layer (one for the diplomatic layer, one for the 'linguistic' layer, both texts being encoded at 'w'/word level), and a third XML file including only the linking of single words in XML transcription file 1 (diplomatic) with words in XML transcription file 2 ('linguistic').

Search for related projects

We look for other (open source) DH projects possibly doing what we want to do, so we can build on what they already have done.

  • Textus
    • connected with Europeana (RTF for cultural artifacts)
    • they make you do arbitrary annotation (through XPath-like-identified ranges) to any webpage
  • Is Standoff Markup Editor doing what we want to do?
  • Marco is working on CATMA