Paolo Monella, Information and functionality in scholarly digital editions

Patrick Sahle wrote that a scholarly digital (not digitized) edition (SDE) "cannot be given in print without a significant loss of information or functionality". Daniel Kiss argued that with enough pages, every quantity of information might be "given in print" too. This talk tackles the general question of the "digital added value" in a SDE -- compared to print -- from the perspective of the information/functionality hendyadis. We can perform Hjelmslev's "analyis" on raw character data and formally identify "entities" both on the syntagmatic axis (textual structures and relations) and on the paradigmatic axis (tokens, lemmas, stylistic features, named entities). In the Italian tradition of Digital Humanities this operation is commony called "formalizzazione" or "codifica". In the English/international terminology, the key terms for it are "markup" or, more generically, "annotation". 1. On this basis, the first possible added value of a SDE is that we can formalize and visualize complex relations within the text (structure, syntax, metatext), at its threshold (paratext) and beyond a monolithic/abstract concept of "Text" (versions, text/document). An important issue arises with the information itself: once the concept of text "explodes" and includes metatext, paratext, parallel versions and material philology, the quantity of information grows exponentially, in a "fractal" way (one paragraph, two versions, four glosses -- one for each version -- and so on). Is it worth to encode it digitally? Also, visualization is the only function commonly applied here, which exposes these digital philology applications to Kiss's argument. Ultimately, the question is: how much does each area of textual studies (papyrology, epigraphy, classical/medieval/genetic philology etc.) want to invest on the digital recording of such "fractal" information, with the sole purpose of visualizing it? It depends on how much each area is focussed on the plural nature of the text (versions) and on the documents bearing the texts. 2. The second possible added value lies in the semiotic concept of "isotopy", defined by Greimas as "un ensemble redondant de catégories sémantiques". If we formally identify entities on the paradigmatic axis (tokens, lemmas, stylistic features, named entities/Linked Open Data), algorithms can identify isotopies throughout a text -- that is, they can track the recurrence of entities of the same class, such as lemmas of the same lexical field, similar linguistic and stylistic features, place names etc. The question now becomes: what do we do with those isotopies? 2.1 We can apply simple algorithms to create a linear visualization of the isotopy, i.e. of the recurrence of elements of the same class (highlighting, search, indices, maps). A possible objection here regards both information and functionality: print editions might theoretically record/visualize trivial information (such as the linguistic annotation of a morphologic category) through formatting, but they do not do so because the mere visualization of such basic information would not bring any strong scientific advantage. If, instead, the information is more meaningful (e.g. people, names, concepts), also print indices in a book may track it throughout the text. Which suggests that mere linear visualization of isotopies does not necessarily provide a compelling added value of SDEs over print editions. 2.2 In addition, we can apply more complex algorithms to further process an isotopy (the recurrence of some elements) and produce secondary data with non-linear outputs. Examples of such algorithms include topic modelling, stylometry, word vectors or, if we use entities/Linked Open Data entities as input, social network analysis (for people) and network analysis (for places and other concepts). Outputs include tables, graphs and other forms of complex data visualization. In this case, the added value is apparent both in terms of information (the data produced is new, meaningful, and it is not encoded manually, but produced by software, thus removing the issue of limited time/human resources) and in terms of functionality (data is produced dynamically based on analysis algorithms and their adjustable parameters). 3. A third category of fairly apparent added values regards the social dimension of SDEs: the very availability of large plain text corpora (with the connected basic functions of browse and string matching search); social editing (based on shared research infrastructures such as papyri.info); Open Science (resource interoperability based on APIs, data reuse based on Open Data repositories). In conclusion, compelling arguments for the added value of SDEs certainly come from the functionalities in the third category above (3. social dimension) and from the information and functionalities of category 2.2 (complex algorithms that process isotopies and produce a non-linear output). The advantage produced by category 2.1 (simpler algorithms that produce a linear visualization of isotopies) is less compelling. This suggests that the development of computational text analysis methods (2.2) is a key challenge for digital philology. As for category 1 (visualization of textual relations), only those areas of textual studies which are more deeply concerned with "plural" texts and with the text/document relation currently find it convenient to invest vast resources (in terms of time, training and funding) to encode that kind of potentially "fractal" information.

Paolo Monella Curriculum
DH bibliography
Paolo Monella home page