The Gothenburg model and CollateX

Outline

  • The Gothenburg model
    • History
    • Goals
    • Components
  • CollateX
    • What it is CollateX?
    • Collation pipeline in CollateX

The Gothenburg model: history

Goals

Identification of the core components of textual comparison at an abstract level

  • common understanding
  • facilitation of collaboration

Components

  1. Tokenization
  2. Normalization/regularization
  3. Alignmemt
  4. Analysis
  5. Visualization/output

Prerequisite: an electronic text version of each witness

1. Tokenization

  • Division of the continuous text into units to be aligned (tokens)

Would you care for a sherbet lemon?

--> Would | you | care | for | a | sherbet | lemon | ?

1. Tokenization

  • Division of the continuous text into units to be aligned (tokens)
  • Any level of granularity
    • Typically: whitespace-delimited words
    • Other options: syllables, lines, phrases, verses, paragraphs, text nodes...

Tokenization: challenges

  • Ambiguity
  • Punctuation
  • Language specific issues: contractions, superscription, etc.
  • Markup

Tokenization challenges: some examples

  • He remarked, “John said, ‘Bout starts at nine.’”
  • He remarked, “John said, ‘It’s ’bout time.’”
  • Tu es un %#@$!
  • Oh d--n it!
  • MASS.

2. Normalization/regularization

  • Normalization during transcription vs. collation
  • Ignore non-substantive variation for comparison
    • Punctuation
    • Upper/lower case
    • Orthographic variation
      • Allographs (letterforms)
      • Abbreviations
normalization

3. Alignment

  • Find the tokens that match
  • Introduce gap tokens when necessary (“omissions”)
alignment

Alignment of tree witnesses

Alignment: challenges

  • Computational complexity

Alignment: challenges

  • Repetition
  • Transposition
aligment
alignment
alignment

Alignment: challenges

  • Order effects

4. Analysis/feedback

  • Intepretation beyond linear alignment
  • Manual intervention?
alignment

analizer

5. Visualization/output

  • Markup for further processing
    • XML, TEI, JSON, GraphViz DOT, LaTeX, etc.
  • Textual visualization, for examination and analysis
    • Textual alignment table
      • Plain text, HTML, PDF
    • Toolkits with additional functionalities: Juxta †
  • Graphic visualization, for examination and analysis
    • Variant graph

Example

  • W1: Introduction à la collation automatique
  • W2: Cours sur la collation automatique
  • W3: En savoir plus sur la collation automatique

Alignment table

W1 Introduction à la collation automatique
W2 Cours sur la collation automatique
W3 En savoir plus sur la collation automatique

Variant graph

variant graph

Collation and/or visualization tools

CollateX

  1. What it is CollateX?
  2. Collation pipeline in CollateX

Flavors

  • Java
  • Web app
  • Python

Advantages of CollateX

  • Data formats
    • Input: Anything and everything (JSON)
    • Output: Anything and everything (JSON)
  • Control over each step of the pipeline

Collation pipeline in CollateX

  • Default behaviours
  • Parameters

Tokenization in CollateX

  • It divides the text into tokens using whitespaces as delimiter
  • Punctuation is tokenized separately from alphanumeric characters
Example: Peter's cat.
Peter ' s cat .

Normalization in CollateX

By default, it removes trailing white space at the end of tokens.

Pretokenized and normalized input

JSON file as input: Each token may present a normalized version

CollateX alignment parameters

Progressive alignment

  1. start by comparing two versions,
  2. transform the result into a variant graph, then
  3. compare another version against that graph, and
  4. merge the result of that comparison into the graph;
  5. repeate the procedure until all versions have been merged.

Analysis in CollateX

Exact vs. near (fuzzy) matching

  • A: And Ron pulled out a fat grey rat
  • B: And Ronald pulled out a gray rat

Exact matching

A And Ron pulled out a fat grey rat
B And Ronald pulled out a gray - rat

Near matching

A And Ron pulled out a fat grey rat
B And Ronald pulled out a - gray rat

CollateX outputs

  • Alignment table: ASCII, CSV, TSV, HTML,XML, XML-TEI, JSON
  • Variant graph: SVG

Alignment table

W1 Introduction à la collation automatique
W2 Cours sur la collation automatique
W3 En savoir plus sur la collation automatique

Variant graph

variant graph

TEI


<cx:apparatus xmlns:cx="http://interedition.eu/collatex/ns/1.0" 
xmlns="http://www.tei-c.org/ns/1.0">
    <app>
    	<rdg wit="W1">Introduction à</rdg>
    	<rdg wit="W2">Cours</rdg>
    	<rdg wit="W3">En savoir  plus</rdg></app>
    <app><rdg wit="W1"/><rdg wit="W2 W3">sur</rdg></app>
    la collation automatique</cx:apparatus>	
						

Bibliography

  • Bourdaillet J. & Ganascia J.-G. (2007): “Practical block sequence alignment with moves.” LATA 2007 - International Conference on Language and Automata Theory and Applications, 3/2007.
  • Dekker, R. H. & Middell, G. (2011): “Computer-Supported Collation with CollateX: Managing Textual Variance in an Environment with Varying Requirements.” Supporting Digital Humanities 2011. University of Copenhagen, Denmark. 17-18 November 2011.
  • Interedition Development Group (2010-): CollateX - Sofware for Collating Textual Sources. https://collatex.net/
  • Hoover, David L. (2015): “The Trials of Tokenization.” DH2015, University of Western Sydney, Australia, June 29–July 3, 2015.
  • Needleman, Saul B. & Wunsch, Christian D. (1970): A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48 (3), 443–53.