Minutes Pisa 24 11 2010

From IntereditionWiki

Working Group 2


1. Proposal from Canada

2 - Distributed microservices as core principle for tool architecture
- Evaluation of the first set of microservices

3 - Establishing what not to do next (infrastructure best left to others)
- Identification of highest added value micro services Interedition could focus
- How to sustain the small but viable Open Source Development Community originating

1. Proposal from Canada Peter Robinson presents a proposal for a Canadian funding (add exact name) of something comparable to a COST Action, giving people from Europe a.o. the chance to visit Canada for meetings. Ray Siemens, Geoffrey Rockwell, Dan O'Donnell a.o. are involved (names will be added). Name of the group/proposal is "Digital Knowledge Communities". Development, maintaining, of digital projects is the topic of the application. Find a better way of involving scholars. Important topic that will be discussed: centralization model versus localized model. Centralization: an institute hosting data and tools scholars can use. Localized: individuals contributing material independently of such central organizations. These two models need, it is thought, to coexist. But we need to find out how they can interact. For this, bootcamps and meetings are planned. The project will run for two years. Development will not be done by the project but by the partners, interacting in this. Deadline is next Tuesday for submitting the proposal. Huygens Instituut will be the Co-Applicant acting for the European partners.

Peter asks for other reports and ideas about involving the general public in transcription activities. Joris reports on how the Huygens Instituut is presenting Elaborate, their online transcription/annotation/publishing tool, to a select group of scholars. A broader audience until now has not been given access, because of the lack of money to host everything on the Huygens servers. In the Netherlands many archives have set up a sort of online transcription room for volunteers who want to transcribe charters. Kathryn refers to project (add name), Julianne mentions TextGrid and a blogging project leading to a book (hackingtheacademy.org). Also the Transcribe Bentham project, new transcribers are contiously signing up. Some volunteers are transcribing quite a lot. Dino refers to a project in Bologna. Asaf reports how easy it is to involve volunteers: people are very enthousiastic. Peter adds that people enjoy to contribute because a.o. their work becomes visible and contributes visibly. He points out that the experience of for instance the Elaborate project that the more voluteers turn up, the work work you yourself will have - how can we organize that and have it funded. Andrea Bozzi remarks that in the projects of his institute the contributors have to follow certain rules, so the institute can guarantee the quality of the transcription. Peter adds that people are already acting and contributing and that we will have to make sure to keep up with them, helping to make everything findable on the web by promoting certain approaches. Hugh reports on the New Testament work. Here again, the centralization creates to much work for the central point. Troy refers to project organizing tools hat were developed to make the whole process easier to manage. E.g. images can be identified in an easy way, creating an index of the available data. These are open tools, which can be used by everyone. Hugh mentions that from about 60 volunteers having signed up, 5 or 6 are very active. Elena states to be a bit sceptical about crowd sources. You always need a check by a good scholar to guarantee the quality of a transcription. It is great to involve the community, but there are some practical problems related to academic quality. Peter adds that this problem is a key question for which we have to look for an answer. Checks done by scholars is a scaling problem; it could lead to less activity bythe volunteers. Kathryn suggests that volunteers could be guided to better quality by giving them points for there work, showing their experience and possible progress in quality. Peter says he is thing about such training suites. Asaf suggests that transcriptions can be vetted, approved by scholars, thus showing what parts are of a scholarly standard. Anissava explains why she is sceptical of offering especially difficult manuscripts for crowd sourced transcription. Asaf sketches how experienced volunteers can start acting as trainers at some point, in this way the whole group is evolving in quantity as well as in quality. Asaf refers to how WikiPedia is organizing these sort of things. Anonymity of the contributors is an important element in this model. Julianne asks if something such as a reference desk could be helpful. Troy remarks that for the New Testament, two transcribers always do the same text - redundancy. Kathryn wonders if a thorough study of how this kind of projects are done and what the experiences are. Peter mentions that always only a couple of volunteers are very active; but they are getting very skilled in this way. Joris adds that such a project could be initiated by Interedition. It would have to be more than just a paper report. Could we actually link a couple of relevant projects, as a kind of proof of concept? This could show interested scholars and volunteers the possibilities. Is there someone who wants to be the "problem owner" of this to organize and manage this possible strand of Interedition? Peter remarks that this could be part of the Canadian project in preparation, in collaboration with Interedition. Joris states that for now, we need to decide on the topic for the next bootcamp. He would think it is important to have a fixed version of CollateX anyhow. He points out that we need a test suite; we have to look for ways to keep the results stable in the future. Joris is of the opinion that Peter may be the best to do this. Joris adds that we need to find a way to attract and 'attach' new microservices. Joris is thinking about a model for that as well. Andrea Scotti wonders if that could be outsourced to e.g. a Max Planck-environment (e-science). Andrea will send Joris the necessary information. Peter points out the advantages of the microservices idea in that the usual tool registries are not needed anymore in this form. Tara remarks that what is done in the bootcamps also depends on who turns up and what they want to do. How do we want to organize that - or do we? Peter agrees that we want to suggest, but do not want to prescribe.

2. Peter introduces the collation tool Interedition working group 2 is developing. Ronald Haentjens Dekker presents the status of this CollateX. He shows an example from Darwin's On the Origin of Species, with results in an alignment table showing identical parts next to each other and the diverging parts in a separate color as well. The result, Peter Robinson remarks, are stunning. Ronald also shows a visualization (generated from the of the variation across the whole episode chosen. Ronald explains that if CollateX works, the editor can really focus on the actual differences and changes, not having to spend a lot of work on the collation. Joris then explains what has been done on the interface level, through the Google app engine. Tara then demonstrates the interface working for short texts. Larger texts cannot be collated yet because Google Apps has a very low response time limit. Dino asks more about the format, e.g. TEI. Tara shows it on screen. Elena then asks if CollateX can also deal with very complex TEI-files. Tara explains that each user can adapt/have adapted to their own use of TEI, e.g. develop several TEI-tokenizers next to each other. Peter enphasizes that it is key that everything should be easy to adapt by others to their own purposes. Peter asks Ronald when CollateX will be ready to use. Ronald explains that that is a difficult question. He releases a new version every few months and will let this know through the Interedition e-mail list. Some can already work with it now, but further development is still very necessary. Daniel asks if parallel aligned translations could also be collated. Joris answers that we could look into this. Daniel refers to his Bergen colleague Koenraad de Smeldt who could be interested in collaboration. Ronald stresses that the collation tool itself does not make any assumption about the structure of the texts. So preprocessing stages may be needed for special types of text. We will have to find out which steps have to be taken.

3. Earlier in Interedition several other tools were mentioned as interesting for development into an Interedition demonstrator: image handling, transcription, aligment. Peter states that some things are already done by (many) others. Should be use our limited funding for those tools, or focus on something only we actually can do? Collation certainly is such a tool. Tara wonders if there are certain gaps in the existence of tools. Are there tools needed for the next steps of scholarly work after collation? Dino suggests collation of semantic descriptions. Andres Scotti remarks that the next step indeed could be to include the tool into a bigger tool. Peter remarks that many tools can be added that directly link to the different sides of CollateX. Andrea Scotti points out that it must be clear who is responsible for what tool, give the credits to the developers. Also ownership of the cloud is important to make clear. That goes beyond CollateX. Joris agrees and states he is happy to hear ideas and suggestions, but it is not something we can do with out limited funds. Andrea Scotti remarks that we need not do it now, but our ideas in this direction should be in our Road Map. Peter wonders if we can wait until we produce the Road Map. Andrea Bozzi refers to metalanguage needs, which we also will have to address in the Road Map because we cannot do much for now. Including NLP tools for different languages would be very useful. Even CLARIN does not have these tools yet. Peter suggests that the developers may have preferences for certain tools. Joris will talk to some people during lunch concerning where to organize the next bootcamp.

raw material to work into the minutes if useful

  • Peter's grant proposal
    • focused on digital knowledge community
    • how to get scholars involved with that
    • eg by offering tools (Bentham Transcription project)
    • problem of institutionalized tools
    • focus also 'localized scholar': any individual scholar anywhere (coral reefe)
    • need both models: Interedition is providing, looking for ways who we can also support individual scholars/models
      • Huygens will head
      • US/Nines (18th connect)
      • Open Scholarly Annotation
   => Dan Cohan (George Mason University)
  -> Can crowd sourced approach be an Interedition focus?
  -> Is outreach indeed (motivation not a problem)
    -> Bozzi: there is a problem of control of added value and quality --> versioning
    -> Pierrazo: Bentham uses retired scholar for quality check --> how to generalize this
  -> concrete: software (in the back end, exchange of text)? Workshops on disseminating the way it works on other tools like ITSEE for the New Testament project (Indexing of images by hand)
  -> Exploring a model for the quality check for crowd sourced transcription (Asaf Bartov and Kathryn and Troy want to cooperate on that)
  => Quality control is a very big issue for e.g. Bozzi, Miltenova, Pierrazo
  => vetting is not possible (create backlog of dead crowd sourced material), but if you are explicit about quality level (points system) + redundancy of transcription
  => The answer is in educating the public (tell it what quality is)
  => Meritocracy & anoymity 
  => STSM: survey on what models and results there are ---> who is willing to managing this

( => Asaf: need to seed (a few pages), create the impression that the community is alive, 'show progress')

- Distributed micro services as core principle for tool architecture

- Evaluation of the first set of microservices

- Establishing what not to do next (infrastructure best left to others)

- Identification of highest added value micro services Interedition could focus

  -> TEI/HTML -> nice visualization
  -> Comparator example?
  -> collating parallel translation -> would be massively useful for EU region
  -> image/transcription - linkage (words matching to graphics)
  -> collation for semantic description

- How to sustain the small but viable Open Source Development Community originating

 - make it a question for the developers on a meeting