RegularisationTool

From IntereditionWiki

Regularisation tool

REST Service

Get all types

  • GET /api/regularisationtype/

Get all scopes

  • GET /api/scope/

Basic Query

  • Gives all the relations for a term
    • cat > [kitten, moggy, hairball]
    • hairball > [kitten, moggy, cat]
  • Basic Query: /api/query/moggy/
  • Another Query: /api/query/cat/
    • TODO Filering by different fields

Dump Rules

  • Outputs all the rules - /api/dump/
    • TODO filtering by username or context or whatever

Add a rule

  • POST to /api/ with the following as the data:
{'external_user': 'bob',
'scope': 'global',
'context': 'T-NT-20001-4-3-16',
'regularisation_type': 'orthographical',
'description': 'I don't care about cat ages',
'token': 'kitten',
'lemma': 'cat'}

  • External user (optional): The user that made the decision - this can be an id or username from the external system - it is not used by the annotation system, does not need to be unique.
  • Scope: how far the decision applies
  • Context: where the user was when the decision was made
  • Type: The type of decision
  • Token: The text for which the decision applies
  • Lemma (optional): The base form for regulariation decision
  • Description (optional): Description or justification of the descision.

Apply rule

POST to /api/apply/

Possible Other Tools to explore

Regularisation elsewhere

  • SOLR has a SynonymFilter (format described here: X)
  • OAC Open Annotation: Beta Data Model Guide has useful formalizations for describing who made the decision, when, and on which part of document or document set


Reusable material for lists of synonyms ... well, reusable in theory...

Use cases for the regularisation tool

Six, sometimes quite different, use cases.

1. Search in documents

  • Aim: qualify strings as relevant to the search
  • Relevance in terms of: synonym, hypernym, hyponym, alternate spelling, homograph, abbreviation.
  • Import the rule-set, apply it to the same or another document, modify the rules
  • When applying the rules: choose what type of relevance to use, whose relevance ratings to use

2. Check a collation: define which tokens are equivalent

  • Aim: qualify collated items as needing regularization or not and qualify the type of regularization to apply
  • Types of regularization: normalization (variant spellings in the same document), standardisation (variant spelling across a group of documents), modernisation (variant spellings for other uses).
  • Import the rules, apply changes to witnesses, rerun the collation, check again
  • When applying the rules: choose what type of regularizaiton to apply, and whose regularization choices to use.

3. Search in a catalogue (such as GBooks)

  • Aim: qualify hits as relevant to the search or not
  • Relevance: yes or no
  • Scope: hit is relevant as the whole document or just parts of the document
  • Actions: save the rules, reuse the same rule-based search later on
  • When applying the search: choose what scope, and whose definitions

4. Search for rhymes'

  • Aim: qualify search results from a rhyming search as relevant or not
  • Relevance: rhyme, near rhyme, visual rhyme, not a rhyme
  • Scope: should usually be global, could be local in some cases
  • Actions: search, look at the suggested rymes, judge them, write rules, reuse rules on other texts

5. Search for "clausulae"

  • Like rhymes, but: different levels of similarity: similar, less similar, different

6. Search for motives

  • Search for segments of text, looking for motives (something like recurring themes), judge results in terms of their relation to the motive
  • Relations: same, variation, opposite, unrelated
  • Action: rewrite the file with only the annotations as a basis for futher comparison of other documents treated the same way.

GUI

Troy, Joris, Tara: discussed an interface that would allow for indicating and identifying relationships between variants, and to generate rules from that for a regularization tool rule engine, or give the added information back to any other service. The GUI will basically be an alignment table on top of a 'workspace'. Clicking on a row will populate the workspace with the variants. Dragging variants on top of each other indicates a relationship a pop up will allow for categorizing the relationship.

Thursday afternoon - Tara and Joris began with prototyping of the web service, based initially on Tara's collection of text collations.

Friday afternaan - Tara & Joris:

  • fixed a long list of bugs in the text tradition library
  • changed alignment table to CollateX JSON format
  • changed SVG rendering to line up variant nodes in the graph that are in the same row
  • changed SVG graph background transparant
  • changed initial scaling of SVG Graph
  • added a 'visor' that will detect what nodes will be displayed from the overall graph in a workspace
  • work in progress on detection of nodes under visor

Some possible relations

  • Synonym: this token's semantic field is (nearly) identical to that of the target token; example: "domestic cat" is a synonym of "house cat".
  • Antonym: this token's meaning is the contrary of the target token's meaning. Example: "love" is the antonym of "hate".
  • Hyponym: this token's semantic field is included into the one of the target token; example: "cat" is a hyponym of "feline", "feline" is a hyponym of "mammal".
  • Hypernym: this token's semantic field includes and goes beyond that of the target token; example: "feline" is a hypernym of "cat", "cat" is a hypernym of "kitten".
  • Homograph: this token's spelling is identical to that of the target token, but their semantic fields are clearly different. Example: "cat" (the Unix command) is a homograph of "cat" (the domestic animal).
  • Spelling variant: this token's spelling is similar but not identical to that of the target token, although the underlying lexical item is identical; the reason may be a typographical error or diachronic or scribal variation. Example: "catt" is an (erroneous) spelling variant of "cat".
  • Abbreviation: this token's spelling represents an abbreviation of the target token; Example: "c." is an abbreviation of "cat".
  • Rhyme: this token rhymes with the target token; Example: "cat" rhymes with "mat".
  • Visual rhyme: this token's last letters are identical to the target token's last letters, but the pronounciation of the two tokens is different. Example: "enough" is a visual rhyme of "through".