Albert Leśniak and Małgorzata Czachor (IJP PAN)
How small a corpus can be? The efficiency of TF-IDF keyword extraction in relation to the corpus size
A method called term frequency – inverse document frequency (TF-IDF ) is a widely used algorithm for extracting keywords. The obtained score depends on two quantities: the frequency of the word in a document (term frequency, or TF) and the number of documents that contain this word (inverse document frequency, or IDF). Whereas TF is an intrinsic feature of every single text, IDF is based on the entire corpus (or more precisely, on the fraction of texts containing a given word, in the total number of the texts included in the corpus), therefore the larger the number of documents on which IDF is based, the more reliable the outcome. The aim of the talk is to answer the question what is the minimal corpus size for TF-IDF, or to be more precise, to what extent diminishing the size of the corpus affects the effectiveness of TF-IDF. The study is based on four corpora: Interia.pl (220 000 texts), weekly magazines (220 000 texts), Gutenberg Library (29 750 texts) and short extracts from Gutenberg Library (29 750 texts). The IDF was first computed using all the texts, then, iteratively, on a decreasing number of them, thus in each iteration IDF is based on a different (smaller) number of texts. Since in each iteration IDF is different also the keyness (the TF-IDF score) is different. Still, the results are surprisingly stable. The scores obtained from a very small corpus are not very different from ones based on the entire considered collection of texts.
The results show that small corpora constitute a reliable basis for this algorithm. The conclusion is interesting by itself, but also vital for practical applications. Processing large corpora is still a work- and time-consuming process, therefore, provided it does not cause a dramatic drop of efficiency, training the IDF on a smaller corpus is beneficial.
Link to the Zoom discussion after the meeting: https://zoom.us/j/92355384866?pwd=ckl2bmRZYWxmVEs3RFVVVDRuNlQ4dz09