From IntereditionWiki

Revision as of 14:22, 24 November 2012 by Paolo (Talk | contribs)

Morning of 21/11/2012

For names, refer to the list of participants.

Discussion on text modelling

Two main alternatives to XML:

  • range based model (for annotations)
  • variant graph (for collation)
    • Tara is using (with CollateX) a graph-based model more complex sthan Schmidt's one.

Paolo is interested in a text model that can represent textual layers (graphical, alphabetical, linguistic; see Orlandi, Informatica testuale)

  • Tara: we can stretch Schmidt's graph model to represent the different Orlandi's layers. They're still graphs
  • Gregor: variant graphs are not suitable for this; they're not variants. LMNL (range-based model), instead, seems to be fit for Orlandi's "textual layers".

Gregor's presentation

Gregor describes his implementation of a range-based (LMNL-based) textual model, used so far for the back-end of the Faust Project.

  • the implementation is in Java
  • annotations that have a name and are namespaced (this comes from XML)
  • text is a sequence of events
  • layers have names like "TEIW" or "Europeana" and contain texts
  • layers have anchors, so one text can point to another or to ranges of another text
  • one pointer can point to multiple anchors. E. g. layer 'alignment' points to two different anchors (and aligns them)
  • layers can include whatever data (Json, an XML file etc.)
  • a TextRepository is just a collection of those layers. It's something I can query
  • you can create a graph of the layers existing in a text repository
  • TextStream
    • Gregor's model is differnt that XML
      • the SAX API works with XML trees
      • you can walk through the tree there
      • a range-based model, isntead does not have such easy stacks
      • How do you transform XML into range-model? Any element (with opening and closing tags) becomes a range

Each participant's agenda

A round-up on the interests of each of the participants:

  • practicality: what can we build on top of e.g. a range-based model (from the datastore to the presentation layer)
  • query/search functions on top of a text model
  • variant graph vs. range-based models
  • processing (equivalent to XSLT?), querying (equivalent to XPath/XQuery)
  • variant graph: traversal patterns?
  • interfaces, APIs, JS libraries
  • problem of variation and how it is handled on different (conceptual) layers of a text
  • common model? can we find a generalized model incorporating features from the variant graph and a range-based model
  • bridge the gap!
  • integration scenarios; import of existing data, multiple use cases on top of those (what is the smallest thing that could possibly work? -- how do we get there)

Division of labour

See TheHague112012DivisionOfLabour

Afternoon of 21/11/2012

Bootcamp's agenda and goals

Interoperability is best achieved through web services. The same does Gregor's Java system (that he's showcased this morning). Also CollateX does so now.

Juxta today shows its underlying range-based model only if you use the API. For the standard user using, it's just XML-in/XML-out.

Moritz and Gregor discuss the JavaScript querying prototype they already built for Java. Client-side or server-side?

Ronald: is a DSL (domain specific language) the way for us to go?

Bootcamp's goals:

  • We want to use the TEI palette (vocabulary) to build a non-XML range-model-based annotation of plain text
  • We want to build tools (a Doman Specific Languge) for texts annotated this way, so we foster the development of this kind of texts encoding/annotation model (alternative to XML)
  • Our goal is to make a console on a website, rather than simply an API
    • so the user can test the potential of a querying language that queries texts marked with a range-based model

The point of doing of this is:

  • overcoming overlapping
  • exploiting the recursive potential lf LMNL (annotating annotations)

We should create a web interface. Gregor sets out an example of its workflow:

  • the text repository:
    • POST
    • <xml/>
  • creation of a new text:
    • GET
    • PUT → 2201 creator
  • creation of an annotation to text 12:
    • (but bear in mind that annotations are texts in their own respect)
  • a query on text 13:
  • a query on the whole repository (all texts in the repository)

Textual layers at Faust project

(This is not directly related with this Bootcamp's project)

At Faust's project they're keying/transcribing a MS twice: one transcription for the diplomatic layer, one transcription for the 'linguistic' (regularised) layer.

  • Issue: how do you align the two texts? They do it now by collating them.
  • Open issue: how will they eventually store (e. g. in XML) this alignment? They've been collaborating with TEI SIG on genetic editions for this, but they have not yet reached a solution for such granular alignment (word-level granularity), so there is no solution yet as to how to store the alignment in XML/TEI. Paolo proposes to create one XML file for each layer (one for the diplomatic layer, one for the 'linguistic' layer, both texts being encoded at 'w'/word level), and a third XML file including only the linking of single words in XML transcription file 1 (diplomatic) with words in XML transcription file 2 ('linguistic').

Morning of 22/11/2012

Recap on yesterday's discussion

  • abstract text model
    • why we need a new one (other than XML)
  • two main issues
    • text variation
      • graph-based model
    • annotation
      • a range-based model
      • overlapping is not a problem any longer
  • goal: we want to build a web service, a text repository implementing a LMNL textual model

Since today, we're splitting in two groups (see TheHague112012DivisionOfLabour):

  • Back-end (Java)
  • Front-end (Javascript)

"Immutable texts" open issue


  • Gregor's model is based on the assumption that texts in the text repository are immutable. Any time the editor adds a word, the system stores a new text (with a new ID, a new URI)
  • It is hard to migrate annotations to text 1 into corresponding annotations to text 2
  • But editors normally edit one sentence (text 1), then annotate that sentence. Then, they add another sentence (which creates text 2). In this case, what happens to annotations to annotations to text 2?

Work in the front-end working group

Eximining existing open-source software to highlight and annotate text from Interesting candidates (with drawbacks) - Textus (usability issues) - AnnotateIt (compilation issues) - PundIt - -

And the winner is... Annotator! Here is an installation guide:

Arash installed Annotator into his server. This is what he emailed us: "Here is my annotator page running:

ElastichSearch is running and accessible through port 9200:

Annotator-Store is also running and accessible through port 5000:

I couldn't bind my HTML-Page to the Store yet..."

Problem: how do we markup ranges in the DOM? 1. By inserting spans in the DOM via JavaScript? But when the user inserts a span (annotation) within another span (annotation), then JavaScript will count the characters offset from the closest parent (the older annotation/span). A solution might be 2. Better to mark the span through milestones, and let JS visualise the annotation?

How should JS visualise the annotation? 1. highlight? What if 2 annotations overlap? 2. by inserting parentheses? But they count as characters in the DOM (so they mess up the offset calculation); 3. by inserting IMGs (no characters)! Only problem: the page layout (interlinear space etc.): if the user scales up the page font size, the IMG should scale along properly

Solution 3 seeems OK. These SVG IMGs should look like brackets: e. g. "{", with numbers on top of it to differentiate the start of range 1 from the start of range 2, which otherwise would just look the same. But these numbers must be a part of the SVG image (not text: we don't want them to mess up the offset).

When the user hovers span 1, he (JS) gets a box with the content of the annotation, which can be either a string or a JSon structure.

The display [of the plain text coming from the back-end] [in the browser] should not include any non-empty element, like <tag>...</tag> (to avoid interfering with the offset calculation). So the best solution seems to be:

- one big
 wrapping the whole text, so we keep the txt file line breaks;
- using CSS like in to make sure that the text wraps.

Tomorrow we'll meet in Amsterdam (Joris, Tara and Gregor are presenting there), at the venue of the ESTS conference.