David Bamman, Alison Babeu, Gregory Crane, Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection

We present here a method for automatically projecting structural information across translations, including canonical citation structure (such as chapters and sections), speaker information, quotations, markup for people and places, and any other element in TEI-compliant XML that delimits spans of text that are linguistically symmetrical in two languages. We evaluate this technique on two datasets, one containing perfectly transcribed texts and one containing errorful OCR, and achieve an accuracy rate of 88.2\% projecting 13,023 XML tags from source documents to their transcribed translations, with an 83.6\% accuracy rate when projecting to texts containing uncorrected OCR. This approach has the potential to allow a highly granular multilingual digital library to be bootstrapped by applying the knowledge contained in a small, heavily curated collection to a much larger but unstructured one. Source: http://hdl.handle.net/10427/70398

