They saw it, onu, 它, coming: An information theoretic study of cross-linguistic variation in personal pronouns,
Personal pronouns share with other nominal constructions, such as common nouns and demonstratives, the basic function of selecting and identifying a referent. However, they are special in that they depend on the context and are inherently rooted in interlocutor roles (speaker, addressee, other). As proxies for information that is already known to the hearer, we could expect their usage to be similar across different languages, as well as little variation between pronouns in the same language. But as is known from previous studies, many languages of the world do not require overt independent personal pronouns in subject position; and different pronouns have very different roles in spoken and written text, depending most importantly on person and case. In this study, we aim to capture the predictability of personal pronouns using the information theoretic measure of surprisal, which characterizes the information value of a word given its preceding context. Our data come from mini-CIEP+, a parallel corpus of literary texts, from which we sample 17 languages from 8 language families. We compute the surprisal of personal pronouns in these languages at different context sizes, using mGPT language models. Then, linear mixed effects models are fitted with surprisal-based response variables and a range of independent variables, including the frequency of the pronoun and various morpho-syntactic parameters (syntactic role, number, person, and others). These parameters are extracted from grammars and from mini-CIEP+, which is automatically annotated in the Universal Dependencies framework using pre-trained models. We find universal effects of frequency and near-universal effects of position on surprisal, but other variable estimates differ widely between languages both in terms of which variables are relevant and their polarity. We conclude by stating that this type of quantitative study could shed further light on the usage of different types of nominal referents across languages, for which corpus-based typology is ideally positioned.