Although the relationship between sound and meaning in language is mostly arbitrary, there exist pockets of so-called systematicity: clusters in which particular forms recur with particular meanings.
One example of systematicity is the existence of phonaesthemes. Phonaesthemes are recurring patterns of sound and meaning that occur below the morphemic level, which is traditionally considered the most “basic” level of meaning in language. For example, English words beginning with gl- are more likely to have a meaning associated with vision or light (e.g. glitter, glimmer, glisten, glare, etc.), and words beginning with sn- are more likely to have a meaning associated with nose and mouth (e.g. snarl, snout, sniff). There are obviously exceptions (e.g. gland, snow, etc.), but the associations are still stronger than you’d expect by chance, and there’s even evidence suggesting that speakers internalize these associations to some extent (Bergen, 2004).
In this blog post, I’ll describe a novel approach developed by Noah Smith’s group at University of Washington to automatically identify phonaesthemes in language.
How it works
In their paper, Liu et al (2018) introduce a new “two-stage” approach to identifying phonaesthemes. The research problem is as follows: how does one find particular sounds, or groups of sounds, which aren’t morphemes, that nevertheless co-occur frequently with particular meanings?
The authors use word embeddings, learned vectors based on which contexts any given word occurs in, as their representation of meaning. See here for a more in-depth discussion of distributional semantic models, but the basic idea is that words that occur in similar contexts are likely to have similar meanings: thus, the “meaning” of a word is represented as a vector of all the contexts in which that word occurs. This means that words that occur in similar contexts will have similar vectors, e.g. similar “meanings”. The advantage of this approach is that words, which are typically discrete symbols, can be compared in continuous vector space––e.g. The “similarity” between two words can be approximated as the distance between their vectors.
Avoiding the “morphology trap”
Having collected their word meanings, a naive approach to finding form-meaning systematicity would be to correlate the respective distances between word forms and word meanings. The problem with this approach is that words with similar forms are also likely to share similar morphology––which of course we already know. For example, the words walk and walks have the same root morpheme (walk); correspondingly, they have very similar word forms, and we can assume that they have similar word vectors. But while this is an example of systematicity, it’s not a particularly novel one: it’s long been accepted that there are compositional units of meaning in language, which we call morphemes, and that these units contribute similar meanings across the words in which they occur. So how do we find examples of sub-morphemic systematicity?
Most approaches in the past have overcome this issue by side-stepping it, and using only monomorphemic words, e.g. words with only one morpheme. If we know our dataset only consists of monomorphemic words, then no two words should share any morphemes. This approach has been used successfully by others (Dautriche et al, 2017), but it severely limits the data you can work with, and it’s essential that your dataset truly consists of monomorphemic words.
Liu et al (2018) overcame this problem using a clever two-stage approach. First, they fit a linear regression model to predict word embeddings from morphological structure––e.g., to what extent does morphology predict the meaning of a word? They then used the residuals from this model, e.g. everything that morphology failed to explain, as the dependent variable in a second regression model (discussed below). The logic here is that the first model will explain all of the variance in word meanings that morphology can explain; anything leftover is “fair game” for sub-morphemic systematicity.
Using regularization to identify phonaesthemes
Now that the authors had a set of residual variance to explain, they used sub-morphemic structure, e.g. vectors encoding which sounds appeared at which parts of a word, as their predictors. But unlike in the first stage, they didn’t simply use OLS regression. Instead, they used sparse regularization, a statistical technique that imposes a penalty on the number of predictors with non-negative coefficients in a model. This is sort of a “culling” procedure: instead of allowing >100 (or more, in some cases) different predictors to explain some variance in the dependent variable, sparse regularization methods like LASSO regression or elastic net regularization minimize the number of non-negative predictors.
In this case, regularization was useful for streamlining which sub-morphemic features had non-negative coefficients. The features with the largest coefficients are most likely to be phonaesthemes, meaning the authors could directly extract them from the model.
As a final check, the authors ran an experiment asking whether the alleged “phonaesthemes” (according to the model) agreed with human judgments. Participants were asked to judge how well each word fit its meaning, on a scale from 1 to 5. Words containing phonaesthemes selected from the model were judged to “fit their meaning” at an average of .58 points higher than other words.
Liu et al (2018) presented a relatively simple model (with reproducible code available online!) to automatically identify phonaesthemes––patterns of sub-morphemic form-meaning systematicity in language. This model, along with others, e.g. Gutiérrez et al (2016), gives computational linguists the tools to investigate sub-morphemic systematicity at a much larger scale, potentially even across languages.
More generally, the existence of sub-morphemic form-meaning systematicity calls into question traditional assumptions in Linguistics about which elements of language contain meaning.
Bergen, B. K. (2004). The psychological reality of phonaesthemes. Language, 80(2), 290-311.
Dautriche, I., Mahowald, K., Gibson, E., & Piantadosi, S. T. (2017). Wordform similarity increases with semantic similarity: An analysis of 100 languages. Cognitive science, 41(8), 2149-2169.
Gutiérrez, E. D., Levy, R., & Bergen, B. (2016). Finding non-arbitrary form-meaning systematicity using string-metric learning for kernel regression. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 2379-2388).
Liu, N. F., Levow, G. A., & Smith, N. A. (2018). Discovering Phonesthemes with Sparse Regularization. In Proceedings of the Second Workshop on Subword/Character LEvel Models(pp. 49-54).