TEI-L discussion on grapheme encoding

A recent discussion on the TEI-L discussion list started in November and continued in December 2014 with the following threads:

The following is a selection of some interesting statements and positions.

Sebastian Rahtz

Full post

People may not realize how easy this is in XSLT. The following template 
is a simple test to see whether the input uses any of the alchemical characters. 

 <xsl:template match="/"> 
    <xsl:if test="matches(., '\p{IsAlchemicalSymbols}')"> 
      <xsl:message>Text has alchemical characters</xsl:message> 
    </xsl:if> 
  </xsl:template>

In a later post, S. Rahtz produced the output of this complete XSL script (that, however, did not include 'standard' Latin Unicode characters):

you may be amused by this new output from the XSL script 
I referree to earlier, which lists the actual character used from 
each character range, and the corresponding code point 

     <table> 
         <tr> 
            <td>General Punctuation</td> 
            <td>‘’“—</td> 
            <td>8216 8217 8220 8212</td> 
         </tr> 
         <tr> 
            <td>Latin-1 Supplement</td> 
            <td>ñç&nbsp;þ</td> 
            <td>241 231 160 254</td> 
         </tr> 
         <tr> 
            <td>CJKUnifiedIdeographs</td> 
            <td>日本語中文</td> 
            <td>26085 26412 35486 20013 25991</td> 
         </tr> 
         <tr> 
            <td>Latin Extended-A</td> 
            <td>ę</td> 
            <td>281</td> 
         </tr> 
      </table>

And eventually, in a third post, he linked to a complete XSLT script that he created to sort out to which Unicode ranges the characters used in an XML file belong.

Janusz S. Bien

Full post

Janusz S. Bien's students prepared a tool to autogenerate such a list of used characters.

Stuart A. Yeates

Full post

It would be useful to include in the TEI header

an XML fragment that generated a
<langUsage/> tag with <language/> tags and a summary of the characters
used

Martin Holmes

Full post

Here at Council we've discussed several projects with huge files in 
which it's helpful and convenient to generate all this info offline and 
then store it in the header for easy harvesting

Martin Mueller

Full post

He agreed with M. Holmes:

In principle you could
generate them on the fly, but you can say that about a lot of stuff in the
header.

Roberto Rosselli Del Turco

Full post

He makes a larger use of <g> than the Guidelines suggest:

         <glyph xml:id="sins">
           <glyphName>LATIN SMALL LETTER INSULAR S</glyphName>
           <mapping type="codepoint">U+A785</mapping>
           <mapping type="diplomatic">ꞅ</mapping>
           <mapping type="normalized">s</mapping>
         </glyph>

Sebastian Rahtz

Full post

there is a discussion currently in the TEI Technical Council about the idea
of a TEI "Lint", a tool to assess and profile a TEI instance document or set of documents.
One of the things that could do is report on the usage of characters outside
specified ranges, and consider how best to report it.

Paul Schaffner

Full post (very interesting)

The objective was and is to replicate the utility of
SGML character entities in a P5/XMl/UTF-8 world.

Frederike Neuber

Full post

I 'just' want to describe their shape. Further I do not want
to refer from <g> to <glyph>, as I do not want to use <g> in <text> at all.

Paolo Monella

Full post

Orlandi suggests that when an encoder creates a digital transcription of a manuscript
or another primary source with a pre-modern writing system, they should give a
formal complete list of (and possibly a description of) _each_ grapheme or glyph
of the writing system of that source. 

[...]

I've been working for a while trying to find a way to encode such a _complete_
list in TEI P5 somehow. I even gave a talk on this at the TEI 2013 conference
in Rome (http://bit.ly/1zKZPsq). But:
    1) the Guidelines prevent you from re-defining characters
    that are already existing (and defined) in Unicode,
    2) and if, nonetheless, you create a <char> or <glyph> element
    for each grapheme (from a to z and beyond), you then have to
    use all <g> elements in the <text>.


[...]

I would like to have a TEI way to encode such a complete list in the header and
then have some mechanism to formally bind each character in the <text> with the
elements of that list. Sebastian mentioned a TEI "Lint". Paul mentioned
SGML text entities. I could also mention an old SGML feature that I'll call
"archaeology of methodology" in a talk I'll give on Thursday on the <charDecl>
topic (http://www.unipa.it/paolo.monella/dixit2014/index.html):
the WSD - Writing System Declaration. Not that I want to raise the dead,
but I think that some mechanism like that could be a good practice
for those encoding textual sources on pre-modern writing systems.

A paragraph I didn't include in my post

In Orlandi's "Saussurean" view, all (especially pre-modern) writing systems are "non-standard", not only because they (may) include "non-standard" graphemes (such as alchemic symbols), but because they make a specific use of punctuation, they have or don't have a Latin u/v or i/j distinction, etc. All these symbols, today, are more or less easily included in Unicode, but only a complete list of encoded graphemes (with a shor description of each) can define their meaning in that specific (graphical) semiotic system.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.