TheHague112012Minutes

From IntereditionWiki

Morning of 21/11/2012

For names, refer to the list of participants.

Discussion on text modelling

Two main alternatives to XML:

  • range based model (for annotations)
  • variant graph (for collation)
    • Tara is using (with CollateX) a graph-based model more complex sthan Schmidt's one.

Paolo is interested in a text model that can represent textual layers (graphical, alphabetical, linguistic; see Orlandi, Informatica testuale)

  • Tara: we can stretch Schmidt's graph model to represent the different Orlandi's layers. They're still graphs
  • Gregor: variant graphs are not suitable for this; they're not variants. LMNL (range-based model), instead, seems to be fit for Orlandi's "textual layers".

Gregor's presentation

Gregor describes his implementation of a range-based (LMNL-based) textual model, used so far for the back-end of the Faust Project.

  • the implementation is in Java
  • annotations that have a name and are namespaced (this comes from XML)
  • text is a sequence of events
  • layers have names like "TEIW" or "Europeana" and contain texts
  • layers have anchors, so one text can point to another or to ranges of another text
  • one pointer can point to multiple anchors. E. g. layer 'alignment' points to two different anchors (and aligns them)
  • layers can include whatever data (Json, an XML file etc.)
  • a TextRepository is just a collection of those layers. It's something I can query
  • you can create a graph of the layers existing in a text repository
  • TextStream
    • Gregor's model is differnt that XML
      • the SAX API works with XML trees
      • you can walk through the tree there
      • a range-based model, isntead does not have such easy stacks
      • How do you transform XML into range-model? Any element (with opening and closing tags) becomes a range

Each participant's agenda

A round-up on the interests of each of the participants:

  • practicality: what can we build on top of e.g. a range-based model (from the datastore to the presentation layer)
  • query/search functions on top of a text model
  • variant graph vs. range-based models
  • processing (equivalent to XSLT?), querying (equivalent to XPath/XQuery)
  • variant graph: traversal patterns?
  • interfaces, APIs, JS libraries
  • problem of variation and how it is handled on different (conceptual) layers of a text
  • common model? can we find a generalized model incorporating features from the variant graph and a range-based model
  • bridge the gap!
  • integration scenarios; import of existing data, multiple use cases on top of those (what is the smallest thing that could possibly work? -- how do we get there)

Division of labour

See TheHague112012DivisionOfLabour

Afternoon of 21/11/2012

Bootcamp's agenda and goals

Interoperability is best achieved through web services. The same does Gregor's Java system (that he's showcased this morning). Also CollateX does so now.

Juxta today shows its underlying range-based model only if you use the API. For the standard user using juxtacommons.org, it's just XML-in/XML-out.

Moritz and Gregor discuss the JavaScript querying prototype they already built for Java. Client-side or server-side?

Ronald: is a DSL (domain specific language) the way for us to go?

Bootcamp's goals:

  • We want to use the TEI palette (vocabulary) to build a non-XML range-model-based annotation of plain text
  • We want to build tools (a Doman Specific Languge) for texts annotated this way, so we foster the development of this kind of texts encoding/annotation model (alternative to XML)
  • Our goal is to make a console on a website, rather than simply an API
    • so the user can test the potential of a querying language that queries texts marked with a range-based model

The point of doing of this is:

  • overcoming overlapping
  • exploiting the recursive potential lf LMNL (annotating annotations)

We should create a web interface. Gregor sets out an example of its workflow:

  • the text repository: textrepo.net
    • POST
    • <xml/>
  • creation of a new text: textrepo.net/12
    • GET
    • PUT → 2201 creator
  • creation of an annotation to text 12: textrepo.net/13
    • (but bear in mind that annotations are texts in their own respect)
  • a query on text 13: textrepo.net/13?q=(and...())
  • a query on the whole repository (all texts in the repository)
    • textrepo.net/?q=...

Textual layers at Faust project

(This is not directly related with this Bootcamp's project)

At Faust's project they're keying/transcribing a MS twice: one transcription for the diplomatic layer, one transcription for the 'linguistic' (regularised) layer.

  • Issue: how do you align the two texts? They do it now by collating them.
  • Open issue: how will they eventually store (e. g. in XML) this alignment? They've been collaborating with TEI SIG on genetic editions for this, but they have not yet reached a solution for such granular alignment (word-level granularity), so there is no solution yet as to how to store the alignment in XML/TEI. Paolo proposes to create one XML file for each layer (one for the diplomatic layer, one for the 'linguistic' layer, both texts being encoded at 'w'/word level), and a third XML file including only the linking of single words in XML transcription file 1 (diplomatic) with words in XML transcription file 2 ('linguistic').


Morning of 22/11/2012

Recap on yesterday's discussion

  • abstract text model
    • why we need a new one (other than XML)
  • two main issues
    • text variation
      • graph-based model
    • annotation
      • a range-based model
      • overlapping is not a problem any longer
  • goal: we want to build a web service, a text repository implementing a LMNL textual model

Since today, we're splitting in two groups (see TheHague112012DivisionOfLabour):

  • Back-end (Java)
  • Front-end (Javascript)

"Immutable texts" open issue

Issue:

  • Gregor's model is based on the assumption that texts in the text repository are immutable. Any time the editor adds a word, the system stores a new text (with a new ID, a new URI)
  • It is hard to migrate annotations to text 1 into corresponding annotations to text 2
  • But editors normally edit one sentence (text 1), then annotate that sentence. Then, they add another sentence (which creates text 2). In this case, what happens to annotations to annotations to text 2?

Front-end: search for related projects

We look for other (open source) DH projects possibly doing what we want to do, so we can build on what they already have done.

And the winner is... Annotator! Here is an installation guide written by Ronald: InstallAnnotateIt

Arash installed Annotator into his server:

Issue: how do we markup ranges in the DOM? 1. By inserting spans in the DOM via JavaScript? But when the user inserts a span (annotation) within another span (annotation), then JavaScript will count the characters offset from the closest parent (the older annotation/span). A solution might be http://stackoverflow.com/questions/4811822/get-a-ranges-start-and-end-offsets-relative-to-its-parent-container 2. Better to mark the span through milestones, and let JS visualise the annotation?

How should JS visualise the annotation? 1. highlight? What if 2 annotations overlap? 2. by inserting parentheses? But they count as characters in the DOM (so they mess up the offset calculation); 3. by inserting IMGs (no characters)! Only problem: the page layout (interlinear space etc.): if the user scales up the page font size, the IMG should scale along properly

Solution 3 seeems OK. These SVG IMGs should look like brackets: e. g. "{", with numbers on top of it to differentiate the start of range 1 from the start of range 2, which otherwise would just look the same. But these numbers must be a part of the SVG image (not text: we don't want them to mess up the offset).

When the user hovers span 1, he (JS) gets a box with the content of the annotation, which can be either a string or a JSon structure.

Only empty elements The HTML that a browser should manage (to display the plain text coming from the back-end) should not include any non-empty element, like <tag>...</tag> (to avoid interfering with the offset calculation). So the best solution seems to be:

23/11/2012

Hacking

On 23/11/2012 the whole team meets in Amsterdam, as Joris, Tara and Gregor are presenting at the ESTS conference.

The back-end group works (in Java) on what Gregor had already built.

The front-end group modifies the Annotator source code.

24/11/2012

Hacking

The first part of the morning is still hacking. Before lunch time, we draw some conclusions and evaluate the state of the work.

Evaluation: what we built, what is still to be built

The back-end now has a functioning API.

The front-end, instead, still needs a lot of work on it. Main issue with it: the original Annotator code is based on the assumption that highlighted ranges are s in the HTML DOM. Which implies a tree-like (XML-like) DOM data model. Our textual model, however, is not tree-based, but range-based. This makes HTML and JavaScript (both based on a tree data model) less useful tools. The Annotator source code should therefore be heavily modified to fit our milestone-based approach. An alternative solution might be to use the http://stackoverflow.com/questions/4811822/get-a-ranges-start-and-end-offsets-relative-to-its-parent-container workaround, and simply put <span>s in the DOM.

Conclusive remarks: new model, new visualisation

Our prototype is based on an "alternative" text model: no longer XML-based (hierarchy, tree), but range-based. This poses basic visualisation issues. As the underlying model is new, we should be prepared to new visualisation solutions.

Issue: how do you show the user a text annotated with annotations that are overlapping and distributed on layers?

Possible solutions:

  • Showing the text on one long line, with parallel lines for annotations (music score/partiture-like). This solution seems appropriate when an annotation layer flows parallel to the text, and the ranges of the text being annotated are granular, e. g. every word has an annotation (like with TEI <w> tag, or in linguist annotation, when every word has an annotation like "noun", "adjective" attached to it);
  • Showing the text in a page-like rectangle at the centre on a large bi-dimensional surface and the annotations as other rectangles around it, imitating a large table where a reader has a book at the centre and other books, notes and papers around it. This solution seems appropriate when the annotations are larger texts in their own respect, and comment on larger spans of the original text (e. g. a footnote commenting on a whole verse of a poem).