21st May 2021

One word to rule them all: understanding word embeddings for authorship attribution

Artjoms Šeļa and Maciej Eder (Instytut Języka Polskiego PAN)

With an advent of deep learning in natural language processing, the ways in which a text could be represented became much more complex and much less transparent. From simple estimations of word frequency distributions, technologies shifted to context-aware embeddings and neural network generalizations. These opaque representations made their way to the authorship attribution with obvious improvements across different tasks. Yet, this improvement did not bring us closer to understanding of authorial style. Rather the opposite thing happened, and we have obscured the reasons for feature effectiveness in attribution tasks.

To understand the effect of complex representations on authorship attribution, we propose a simple experimental setup with a basic word embedding model that represents words by their context (or co-occurrence with other words in immediate proximity). Our approach is two-fold: first, we want to use context-aware embeddings for authorship attribution and measure their performance. For this, we use “text2vec-lite” procedure, where each text becomes represented by its identity vector that is based on the distribution of a single frequent word. Second, we exploit word embedding model to continuously remove context from the equation, while still using embeddings for attribution. We continue to train models on shuffled texts, randomly insert non-existent words that serve as new identity vectors and perform other unspeakable atrocities.

Our results show that it’s possible to completely remove context-dependent semantics from texts, yet embedded vectors will still perform well for authorship attribution task. This strongly suggests that contextual representation in authorship attribution remains dependent on the sheer frequency of the most frequent units of language. This raises questions for modern deep learning approaches in stylometry: yes, we can black-box a multitude of linguistic effects into a text representation, but do we really should?


The meeting will take place live at Zoom at 1 pm. To participate please fill in the survey: https://forms.gle/4K1MJ7V9JW8MDKmq7 – the link to the meeting will be sent sent to the email address passed in the form.

The first part of the meeting (the lecture) will be recorded to be later uploaded to our YouTube channel. While we will only be recording the slides and speaker’s audio, we kindly ask that those of you who do not want to risk accidental sharing of your personal image turn off the cameras and turn them back on in the second part of the meeting, a discussion, which will not be recorded.