S. Bhattacharyya, Words in a World of Scaling-up: Epistemic Normativity and Text as Data

Here are some extracts summarizing the article (but the point of interest for me is the argument about variant spellings in under-represented non-western languages): "[...] when the apparatuses of knowledge production are computational tools that operate on archives consisting of large corpora of digitized text. Can this kind of apparatus, too, end up being complicit in the production of erasure or loss when reading from the archive? We want to make two broad points pertaining to this complicity. [...] Our second point is that this same complicity also makes its appearance as world literature [...] tends to become, in our time, a hegemonic, universal category. [...] The computational tool that we consider in this paper is the HathiTrust Bookworm. [...] “distant reading,” [...] We draw attention to an interesting problem that arises when the queried word in a tool like the HathiTrust Bookworm is one that is from a non-European language, but which occurs within European-language texts — with the word occurring in the text in the roman alphabet, that is, in transliterated form. We found that the occurrence of such words was being underreported, and sometimes not being reported at all. [...] low-frequency words cannot be put in the index, and have to be treated, effectively, as if they never occurred in the corpus. [...] There is, however, a different and more complex problem, that is of interest to us here. This problem has to do with variant spellings of words. [...] [variant-spelling transliterations of languages not originally written in roman scripts; contrast this with e.g. Turkish etc., in which each word has one standard roman-script spelling] [...] The only way that the substantial content of this heterogeneity can become legible is if it has always already been coded/translated (or, as in this case, transliterated) in terms of the hegemonic forms of knowledge-organization embodied in the apparatuses of knowledge production and storage [...]. the notion of world literature itself raises in its own way questions similar to those relating to the ones we have discussed in this essay: the difficulty of representing sensory continuity through discrete, determinate objects [...] the dilemma facing Nigerian writers (and, by extension, any writers writing from the global periphery) writing in a “global” language like English in making a choice as to whether to describe objects and concepts unfamiliar to a global readership by the use of a single, unglossed, indigenous-language word, which may fail to make its referent legible to the western reader [...] The examples constitute instances of the kind of standardization or grammatization enacted by apparatuses of knowledge".

