Mandelbrot-Zipf-Rényi law
Marek Czachor (Politechnika Gdańska)
If you take any sufficiently long text (or corpus of texts) and count the number of occurrences of each individual word, you will observe an interesting regularity: if dealing with natural language, the graph presenting the arrangement of words in descending frequency order will always take the same shape. This is the so-called Zipf distribution (Zipf, 1935).
The following three charts show typical corpus data (Shakespeare and Dickens; top charts) and the binding time of carbon monoxide to myoglobin, the protein responsible for the functioning of our muscles (for different temperatures; bottom chart).
The similarity is obvious, which suggests the existence of a general statistical principle beyond linguistics or molecular biology. In each of the above graphs, three areas can be distinguished: left (horizontal; first bend down), middle (a fragment of a straight line) and right (second bend down). The middle area is described by classic Zipf’s law (Zipf, 1935). The first and second areas are jointly described by Zipf-Mandelbrot’s law (Mandelbrot, 1965). We are interested in the third area, or rather the law which would cover all three areas, because the commonly used formulas do not explain why the line of the graph “collapses” at the bottom. What is more, we do not want to simply guess a certain mathematical formula, but to derive it from general principles.
It turns out (Czachor-Naudts, 2002) that the “first cause” may be one of the basic principles of thermodynamics, namely the process of achieving the so-called thermodynamic equilibrium – a phenomenon known to us from everyday life as the cooling down of unfinished coffee. In the case of Zipf’s law, the “trick” consists of properly defining the mean value, which ultimately leads to the entropy of Rényi (Rényi, 1960). The concept of entropy plays a key role both in the theory of thermodynamics (Clausius, 1865) and in the theory of information (Shannon, 1948). In both of these theories, it refers to the level of uncertainty or dispersion of the system.
Thus, one can speak of the eponymous Mandelbrot-Zipf-Rényi law, which unifies all three data areas (left: Mandelbrot, middle: Zipf, right: Rényi). This law (with no focus on the mathematical details) will be the subject of our meeting.
The meeting will take hybrid form. To participate online please sign up here: https://forms.gle/4K1MJ7V9JW8MDKmq7 Attention, this time the meeting is at 12.00!