The MLR score is calculated by adding up each quotient of the text coverage
rate of nine voclists in proportion to the model coverage of the corpus.
Each quotient is multiplied by the number of lemmas in each voclist, and
is divided by 1000. Voclist 2 and higher have a weighted multiplication
factor in the denominator. Vermeer uses these weights because most of the
texts that she investigated did not have two million tokens, but only about
1000 tokens. She explains this saying, ‘a huge corpus has relatively more
hapaxes, and relatively higher coverage percentages in the lower frequency
ranges’ (Vermeer, 2004: 181). Table 1 exemplifies the calculation of the
In the case presented in Table 1, there were 971 tokens in the speech data of a child, of which 41 were not in the lists (e.g. particular names of children); 832 of 930 tokens were found in the first voclist, and the text coverage rate of the list was 89.5%. This coverage rate was divided by 85.3 (the model coverage rate), multiplied by 1000 (the number of lemmas in the list), and divided by 1000. The score for voclist 1 was 1.00. The MLR score was calculated by adding up the scores for the nine voclists, resulting in 4.65. In the case presented in Table 1, the MLR score of 4.65 indicates that this child was supposed to have a productive vocabulary size of 4650.
To validate the MLR, Vermeer (2004) gathered spontaneous speech data of
16 native Dutch children and 16 ethnic minority children with Dutch as
a second language, and analysed them with the MLR. The children’s MLR scores
were compared with their scores on a receptive vocabulary task and a definition
task, and with various type/token-based measures. The results show that
the MLR differentiated between the two groups with obvious differences
in vocabulary, correlated significantly with the vocabulary tasks administered
to the same children, and was independent of syntactic abilities and text
Vermeer (2004) does not discuss how she decides the weight of each model coverage rate in the MLR formula (see Table 1). Van Hout and Vermeer (2007: 108) simply state that ‘this formula is explainable, but on the other hand far from elegant. For the time being, we are only interested in the power of the frequency approach in making calculations of lexical richness more reliable and useful.’ These weights in the formula were calculated based on 2 million words in a Dutch corpus by applying them to a text with 1000 tokens. We do not know how to adapt the MLR measure to English written data in which each text contains only a few hundred tokens. Vermeer’s (2004) idea of estimating the productive vocabulary size from a language production is unique; however, it is difficult to adapt this measure to different settings.