A reassessment of frequency and vocabulary size in L2 vocabulary teaching1

The high-frequency vocabulary of English has traditionally been thought to consist of the 2,000 most frequent word families, and low-frequency vocabulary as that beyond the 10,000 frequency level. This paper argues that these boundaries should be reassessed on pedagogic grounds. Based on a number of perspectives (including frequency and acquisition studies, the amount of vocabulary necessary for English usage, the range of graded readers, and dictionary defining vocabulary), we argue that high-frequency English vocabulary should include the most frequent 3,000 word families. We also propose that the low-frequency vocabulary boundary should be lowered to the 9,000 level, on the basis that 8–9,000 word families are sufficient to provide the lexical resources necessary to be able to read a wide range of authentic texts (Nation 2006). We label the vocabulary between high-frequency (3,000) and low-frequency (9,000+) as mid-frequency vocabulary. We illustrate the necessity of mid-frequency vocabulary for proficient language use, and make some initial suggestions for research addressing the pedagogical challenge raised by mid-frequency vocabulary.


Introduction
Frequency has long informed the principled selection of vocabulary 2 in L2 teaching pedagogy. Paul Nation, a long-term exponent of this approach, breaks vocabulary into four categories: high-frequency words, academic words, technical words and low-frequency words (e.g. 2001a: 11-12; 2011: 12-13). His basic message, most recently reiterated in a 2011 'Thinking Allowed' piece (Nation 2011), is that teachers and materials writers need, in effect, to make a 1 The ideas in this paper were developed jointly by the two authors. A preliminary conceptualization of the ideas was jointly presented at AAAL 2011, and a revised version was presented by the first author as a plenary talk at Alberta TESL 2011. This paper is a slightly revised version of the plenary talk, with improvements suggested by five reviewers. 2 A serious limitation of the discussion in this paper is that it is based around individual word forms/families, and does not take account of the ubiquitous nature of formulaic language. This is because most vocabulary research to date has only counted individual word forms/families. See Simpson-Vlach & Ellis (2010) and Martinez & Schmitt (in press) for two phrasal lists which aim to address this deficiency. cost/benefit analysis of vocabulary to decide whether or not any particular lexical item merits instruction or inclusion (see also Nation 2001b). High-frequency vocabulary is extremely useful for learners, so should be explicitly addressed. Academic vocabulary is worth focusing on for learners wishing to study in English, and the same goes for technical vocabulary for learners focusing on specific-purpose domains. Conversely, in Nation's view, low-frequency vocabulary occurs so infrequently that it is not worth spending classroom time on. Rather, teachers should teach vocabulary learning strategies to learners, so they can learn these rarer words on their own.
While we agree with the cost/benefit approach, we feel that recent research has made the four-part categorization untenable as a pedagogic description. The key evidence is a more recent study by Nation (2006), in which he uses a solely frequency-based approach instead of the four-part categorization. In it, he calculates that a reader needs knowledge of 8-9,000 word families to read a diverse range of authentic texts in English without unknown vocabulary being a substantial handicap. This vocabulary size takes us far beyond high-frequency vocabulary; in fact it takes us beyond current definitions of high-frequency, academic and technical vocabulary combined. If it takes this much vocabulary for proficient English use, there clearly needs to be a focus on vocabulary beyond that covered by the high-frequency, academic and technical categories.
If frequency-based descriptions of English are to be of value to language practitioners, the extent and boundaries of high-and low-frequency vocabulary need to be carefully defined. For English, high-frequency vocabulary has traditionally been operationalized as around the first 2,000 most frequent word families. 3 Conversely, low-frequency vocabulary has been characterized in various ways, ranging from anything beyond 2,000 word families all the way up to all of the word families beyond the 10,000 frequency level. However, it is unclear whether these traditional boundaries (which were never established in a rigorous manner) are set at the optimal levels, especially given Nation's higher vocabulary size targets (i.e. 8-9,000 word families for independent proficient use).
Frequency-based descriptions of English will also have to consider how to conceptualize the many thousands of word families which come between the high-frequency level and Nation's 8-9,000 family target (that is, do the academic and technical categories cover these thousands of families?). One problem is that in discussions of frequency, general vocabulary is usually discussed in terms of 1,000 word categories of decreasing frequency. However, academic and technical vocabulary are subsets of general English which cut across these 1,000 word-bands, and the four-part categorization stemming from Nation's early work does not take account of this. Thus when analyzing texts or planning what to teach, it is important to recognize that the notions of academic/technical vocabulary do not necessarily fill the gap between highand low-frequency bands.
This paper attempts to address these issues by reviewing the scope of both high-and lowfrequency vocabulary from multiple perspectives and suggesting new boundaries for each which make better sense in terms of what learners can do with various vocabulary sizes.
It will then explore the vocabulary between the high-and low-frequency levels, and argue that the academic and technical categories do not adequately cover it. We introduce a new MID-FREQUENCY category to cover this in-between frequency band, illustrate the benefits of knowing words in this band, and argue that these words need to be addressed in a principled way in language pedagogy.

What is high-frequency vocabulary?
The most frequent 2,000 word families form the traditional cut-off point for high frequency vocabulary, a tradition widely cited in teacher guidebooks and research publications (e.g. Nation 1990Nation , 2001aRead 2000;Schmitt 2000;Thornbury 2002). In this section, we will look at the origins of this figure and explore whether it is still appropriate. The figure of 2,000 comes largely from the influence of the General Service List (GSL) (West 1953). The GSL includes a little over 2,000 headwords (essentially word families) and has been an important resource for teachers and material writers for many decades. The 2,000 figure was reinforced by research on oral discourse. Schonell, Meddleton & Shaw (1956) studied the speech of Australian workers and found that approximately 2,000 word families covered around 99% of this discourse. It was thus concluded that around 2,000 word families were sufficient to engage in daily conversation. Basing his decision on this historical background, Nation set the initial frequency level for both his influential vocabulary research tool (VocabProfiler, see the 'Classic VP' on the Lextutor website www.lextutor.ca) and his widely-used vocabulary test (Nation 1990) at 2,000 families, further reinforcing this level as the established initial stage of vocabulary, and by default, high-frequency vocabulary.
As can be seen, the origins of the 2,000 figure largely come from frequency counts and research over 50 years old. Given the increase in vocabulary research over the past 20 years, it seems reasonable to revisit the frequency issue to determine whether 2,000 is still the best boundary for high-frequency vocabulary, or whether an adjusted figure would prove more useful. We will explore this issue from a number of perspectives including frequency, coverage, acquisition and use.

Frequency evidence
The first type of evidence to explore is the nature of the frequency distribution of vocabulary. It is well known that a small number of word types occur very frequently and make up the majority of running words in discourse. Conversely, a very large number of types occur very rarely, and make up the remainder of running words. This is illustrated in Table 1 and Figure 1 which look at Nation's (2006) analyses of nine written and spoken corpora (including the Brown, Kohlapur and Wellington written corpora, and the LUND spoken corpus); the general shape of these distributions would be similar for most other corpora. The written corpora include texts from sources such as novels and newspapers, while the spoken corpora include speech from sources such as everyday conversation with friends and family and people calling in to radio programs. 1st 1,000 78-81 5 81-84 5 2nd 1,000 8-9 5-6 3rd 1,000 3-5 2-3 4th-5th 1,000 3 1.5-3 6th-9th 1,000 2 0.75-1 10th-14th 1,000 <1 0 . 5 Proper nouns 2-4 1-1. For our discussion, the key feature of Table 1 and Figure 1 is the rapidly declining coverage obtained as vocabulary becomes less frequent. The first 1,000 word families 4 clearly do the bulk of the work in English (in large part due to the extremely high frequency and coverage of function words 5 ). The second 1,000 contributes a much smaller, but still useful, amount of coverage, as does the third 1,000 to a lesser extent. But by the fourth 1,000 families, the coverage drops substantially, with only a maximum of 3% for 2,000 families (4th and 5th 1,000). Beyond this, the coverage return gets increasingly small. It could be argued that high-frequency vocabulary is that which occurs before the coverage percentages become so small that it is unlikely that the words will occur frequently across a wide range of texts. There is not a clearly identifiable cut-off point (unless we limit high-frequency vocabulary to the first 1,000), but frequency distributions across a range of corpora (see Tables 1-3) show that beyond the 2-3,000 frequency levels, frequency of occurrence drops off to low levels. This suggests that high-frequency vocabulary would include the most frequent 2-3,000 word families in English.

Frequency and incidental acquisition
Further insight is provided by a small frequency/acquisition study carried out by Cobb (2007). He was interested in whether vocabulary at various frequency levels occurred often enough to be learned merely from incidental exposure (on the generous operating assumption that six occurrences were sufficient). He looked at 30 target words (10 from each of the 1,000, 2,000 and 3,000 levels) to see how often they occurred in a 517,000-word extract of the Brown written English corpus (divided into three types of discourse: press, academic and fiction).
He found that at least eight out of the ten target words from the first 1,000 and seven from the second 1,000 frequency levels occurred six or more times. At the third 1,000 level, this dropped to between three (academic) and five (fiction) words. This suggests that the 3,000 level is the lowest frequency which we can consider 'high-frequency' in terms of learning opportunities from reading, and even then the frequency is starting to become marginal. Cobb also assembled a 300,000-word corpus of novels from the author Jack London, and found that only 469 (57%) of the 817 3,000 level word families occurred six times or more, further illustrating that at the 3,000 level, learning opportunities begin to taper off quickly. This situation would deteriorate even further for word families at the 4,000 and 5,000 levels and beyond.

Frequency and use
We can also look at the frequency issue from the very practical standpoint of the amount of vocabulary a speaker needs to function in English. In terms of high-frequency vocabulary, this relates to the ability to use English at the basic, but still useful, end of the proficiency continuum (we will address higher levels of proficiency in our discussions of low-and midfrequency vocabulary later). Little work has been done on the lexical requirements for the productive skills (speaking and writing), but a few studies have been carried out on reading and listening. If learners wish to read a wide range of authentic novels or newspapers without assistance, then Nation (2006) calculates that it takes knowledge of the most frequent 8-9,000 word families to cover 98% 6 of this type of text, based on his wordlists derived from the British National Corpus (BNC). Note that this does not mean a total vocabulary size of 8-9,000 word families, but good knowledge of the word families up to these specific frequency bands: a learner's total vocabulary size may include some word families beyond these bands. If we allow for lower comprehension expectations and use a less stringent coverage figure of 95%, this would still entail knowledge of word families up to the 4-5,000 frequency bands, plus proper nouns (Laufer & Ravenhorst-Kalovski 2010). Even this lower figure would appear well beyond any reasonable definition of high-frequency vocabulary, so it seems that reading a range of authentic texts is not possible with high-frequency vocabulary alone. However, reading would still be possible using graded readers (see section 2.4 below).
Listening at a conversational level (e.g. listening to narrative stories) appears to require a lexical coverage of only 95% 7 (van Zeeland 2010), and this entails a vocabulary size of 2-3,000 families. For example, Adolphs & Schmitt (2003) found that it took a little over 2,000 word families to reach 95% coverage of the five-million-word CANCODE 8 corpus, and around 3,000 individual word forms to reach 95% coverage of the 4.2-million-word conversational sub-section of the spoken component of the BNC. Nation's (2006) analysis of approximately 200,000 words of unscripted speech in the Wellington Corpus of Spoken English showed that 3,000 word families plus proper nouns achieved a coverage of 96+%. Webb & Rodgers (2009a) analyzed the language of 88 television programs and found that knowledge of the most frequent 3,000 word families (plus proper nouns and marginal words (oh, uh, mmm and ah)) provided 95.45% coverage (this ranged from 2,000 to 4,000 word families in different TV genres). They also analyzed 318 film scripts (2009b) and found that the most frequent 3,000 word families provided 95.76% coverage (the range was 3-4,000 word families depending on the movie genre). Given these results, it seems that knowledge of the most frequent 3,000 word families should provide the lexical resources to largely understand (and presumably produce) conversational English. This vocabulary size may still be too small to enable full comprehension and enjoyment, but it seems to make listening texts accessible enough to be useful for many purposes, including using texts for learning English. Overall, if aural competency is believed to be a basic language skill, then this evidence supports the argument for considering the first 3,000 word families as high-frequency vocabulary.

Graded readers
While we have seen that reading authentic texts requires more than just high-frequency vocabulary, graded readers offer a pathway to begin reading with more limited lexical resources. A number of graded reader series are offered by various publishers, generally 2010; Schmitt, Jiang & Grabe 2011). Coverage of 95% is workable, but less than ideal. Of course, knowing this amount of vocabulary does not guarantee reading comprehension, as reading involves more than just vocabulary knowledge, but research indicates that if readers know enough words to cover 95-98% of a text, they are likely to obtain 60-68% comprehension of that text (Schmitt, Jiang & Grabe 2011). 7 Participants in this study achieved about 75% comprehension of the listening passages at the 95% lexical coverage rate, compared to 96% comprehension at 100% coverage. Staehr (2009) found evidence that advanced listening (using the Certificate of Proficiency in English (CPE) listening test) requires 98% coverage of the passages. 8 The Cambridge and Nottingham Corpus of Discourse in English, a five million word corpus of unscripted spoken English.
beginning at the 200-400 word level, and topping out at around 3-3,800 words (the last stage level in the Oxford series reaches 5,000.) For example: The fact that most graded reader series finish at around the 3,000 word family level implies that a vocabulary size of 3,000 word families is an important stage for ESL learners. However, as Tom Cobb (personal communication) notes, graded reader schemes seldom rely in any disciplined way on word frequency for their levels, but instead on the much looser idea of total number of headwords. For example, Oxford Bookworms' Elephant Man is described as containing 400 headwords, but Cobb's informal Lextutor analysis shows that only about three-quarters of headwords (families) come from the first 1,000 frequency band, with the rest being widely distributed through the 2-9,000 frequency bands. Still, despite the lack of a consistent frequency procedure among graded readers, the point remains that in terms of vocabulary size, 3,000 families seems to be a key figure. As a result, it remains a reflection of the basic vocabulary of English, and by extension, informs what might usefully be considered high-frequency vocabulary.

Lexicography and dictionary defining vocabulary
Dictionaries are a key lexical resource, giving access to a vast number of lexical items, but the monolingual dictionaries produced for native speakers can be difficult for learners to use, simply because the vocabulary in the definitions can often be as difficult as the word being looked up. Lexicographers producing learner dictionaries have considered this problem, and a typical solution is to create lists of DEFINING VOCABULARY, with which all of the entries in the dictionary are defined. The words selected for inclusion in these defining lists are judged to have particular utility for describing a wide variety of meanings, and are typically the highest frequency vocabulary in English. The extent of these defining vocabulary lists can give some indication of both (1) the most important vocabulary in English and (2) the extent of the vocabulary which learners need to know towards the beginning of their studies in order to effectively use English-medium learner materials. The lists range from about 2,000 to 3,000 words depending on the publisher, for example: A Lextutor analysis of these defining vocabulary lists shows that over 90% of their contents come from the first 3,000 most frequent word families, and over 95% from the first 4,000 families. This confirms that word utility (as judged by a variety of lexicographers) is very strongly related to high word frequency. If we accept that the most useful and widely applicable vocabulary is largely captured by these defining vocabularies (which correspond strongly with frequency), this suggests that the first 2-3,000 word families provide a workable definition of high-frequency vocabulary.

Defining high-frequency vocabulary
The goal of this section was to determine the most useful parameters of high-frequency vocabulary. The traditional boundary of high frequency has been 2,000 word families, but according to most of the above perspectives, this seems too low. On balance, it seems that 3,000 word families is a more pedagogically useful criterion. While learners can obviously communicate to some extent with much smaller vocabulary sizes than this, it appears that 3,000 word families represent an important milestone in language development. More vocabulary than this would allow learners to communicate in a wider range of situations, but the rapid decay in frequency of occurrence (Table 1 and Figure 1) makes it very difficult to consider vocabulary beyond the 3,000 level as 'high-frequency'. Therefore, we propose that the first 3,000 word families of English be considered high-frequency (and thus maximally useful) vocabulary. As Cobb (2007: 41) observes, 'The first three of Nation's BNC lists (i.e. the 3000 most frequent word families) represent the current best estimate of the basic learner lexicon of English'. The evidence presented here provides a sound basis for setting the upper limit of high frequency vocabulary at the 3,000 most frequent word families.

What is low-frequency vocabulary?
We now look at the other extreme of the frequency continuum, where vocabulary becomes so infrequent that it has very limited utility. The obvious way of setting the boundary of low-frequency vocabulary is by looking at frequency distributions. However, while the nature of the frequency distribution of English words makes it feasible to suggest a reasonable cut-point for high-frequency vocabulary, this is not the case for low-frequency vocabulary. Nation (2006) used the first fourteen 1,000 level frequency bands from the BNC to determine the percentage of coverage across nine spoken and written corpora (See Table 1). Table  2 illustrates his results for just the Lancaster-Oslo-Bergen (LOB) 1-million-token corpus of written British English, but the other corpora produced similar results. From about the 6,000 level onwards, the additional coverage for each 1,000 band of vocabulary is very small indeed, at just a fraction of a percentage point. This makes it impossible to set a frequency level where the coverage falls off in a noticeable way; rather, at these lower frequency levels there is a gradual and relatively consistent tailing off. This is obvious if we examine the traditional 10,000 level. The coverage gained at this level (0.32%) is not much different than higher (6,000 = 0.70%) or lower (14,000 = 0.10%) frequency levels. Thus frequency information by itself gives little real help in setting a low-frequency boundary.
There are two other common ways of conceptualizing low frequency vocabulary. The simplest conceptualization -as a high/low frequency dichotomy -is untenable, as the vocabulary immediately beyond the 3,000 high-frequency cut-off point (i.e. at the 4,000 and 5,000 levels) is clearly too useful to be written off as low-frequency vocabulary. The other is related to the selection of vocabulary in pedagogic materials. Here vocabulary is commonly conceptualized as in the graph below: High-frequency vocabulary AWL Low-frequency vocabulary (frequent in all discourse) (frequent in (rare in all discourse) academic discourse) In this conceptualization 9 , academic vocabulary (as exemplified by Coxhead's Academic word list (AWL) 2000, 2011) is the next 'band' to teach after high-frequency vocabulary, and everything after that is de facto low-frequency vocabulary, as it is rarely addressed in any principled manner. A review of textbooks (e.g. Richmond 2007;Beglar & Murray 2009;Smith-Palinkas & Croghan-Ford 2009) aimed at the highest levels of intensive English programs shows that explicit treatment of vocabulary rarely goes beyond the AWL even though students exiting these programs will progress directly into university study where even introductory textbooks require knowledge of vocabulary up to the 9,000 frequency level (Sutarsyah, Nation & Kennedy 1994).
The AWL was conceived of as academic support vocabulary which exists BEYOND the high-frequency general vocabulary of English, which Coxhead operationalized as the 2,000 word families in the General Service List (GSL) (West 1953) 10 . However, it is easy to see that the above tripartite division of vocabulary is not viable when the AWL is subjected to a Lextutor BNC-20 frequency analysis. In fact, we find that 64.3% of the AWL headwords are from the first 3,000 most frequent words in English, while the 4,000 level gives 81.5% coverage and the 5,000 level 92.1% coverage (Cobb 2010). Thus although high-frequency vocabulary and academic support vocabulary may be considered different conceptual categories of lexis, in reality, the 3,000 word families of high-frequency vocabulary largely subsume the AWL (see also Hancioglu, Neufeld & Eldridge 2008), so low-frequency vocabulary cannot reasonably be defined as the lexis beyond high-frequency+AWL vocabulary, despite what we commonly see in pedagogic materials.
Probably the most fruitful method of establishing a general boundary of low-frequency vocabulary is a usage-based approach. Hazenberg & Hulstijn (1996) analyzed one corpus of contemporary written Dutch and another of academic Dutch in order to determine how much vocabulary was needed to manage university study. They concluded that it took around a minimum of 10,000 base words (essentially word families) to obtain adequate coverage of these corpora. Dutch and English are different (but closely related) languages, but the 10,000 figure began to be cited for English as a figure which would allow advanced language use (such as study at university). It was also given credence by Nation's (1990) setting of the most advanced level on his Vocabulary Levels Test at the 10,000 level, even though the test preceded Hazenberg & Hulstijn's empirical evidence. The result was that anything beyond 10,000 word families (which enabled advanced use in the Dutch context) came to be accepted as a rather impressionistic boundary for English low-frequency vocabulary.
A more recent and relevant empirical study is Nation's (2006) corpus study. He analyzed a range of English authentic texts (novels, newspapers), and calculated that it requires knowledge of the most frequent 8-9,000 word families (+proper nouns) to reach the 98% coverage which is thought to enable efficient reading. It took less vocabulary to cover the spoken corpora at 98% (5-6,000 word families). If 8-9,000 word families is enough to enable both listening to and reading a wide range of texts without being unduly constrained by a lack of vocabulary knowledge, then low-frequency/utility vocabulary can plausibly be defined as anything beyond this frequency level, that is, vocabulary beyond the 9,000 frequency band (9,000+). Support for Nation's 8-9,000 word families figure is given by an analysis of the Corpus of Contemporary American English (COCA) (Davies 2008). The 425+ million token COCA is a very large corpus of current American English, with a substantial spoken component (for the following analysis, numerals, words with apostrophes and proper nouns were excluded, leaving 402,646,672 tokens). In terms of size, balance and currency it is now the best corpus of general English in existence. Using Nation's BNC frequency lists, we find that the most frequent 9,000 word families cover 95.5% of the COCA (Table 3). This means that the most frequent 9,000 word families cover over 95% of a huge amount (400+ million words) of very diverse written and spoken English. The average person would come across much less English than this, and importantly, many fewer different words. Thus the lexical coverage figures would be higher for the amount of language any individual person might be exposed to (Nation 2001b), so Nation's (2006) 8-9,000 figures are likely to get close to 98% coverage for individual users, especially if numerals and proper nouns are assumed to be known.
Based on this recent corpus evidence, we therefore propose that the low-frequency boundary be moved down from the traditional 10,000+ level to the 9,000+ level. While this may not seem like a large change, the 'saving' to learners is significant if they do not have to master these additional 1,000 word families.

Mid-frequency vocabulary
The previous sections have argued that high-frequency vocabulary in English extends up to about 3,000 word families, and that low-frequency vocabulary begins at about the 9,000 frequency level. This leaves a great gap between the 3,000 and 9,000 levels which has not been systematically addressed before. We propose to label this in-between frequency band MID-FREQUENCY vocabulary. It is important that this frequency band is given a name, because it allows the field to recognize it as a discrete phenomenon, with its own unique benefits for users, and pedagogical challenges for language practitioners.

3,000 9,000
High-frequency Mid-frequency vocabulary vocabulary vocabulary Low-frequency

The nature and benefits of mid-frequency vocabulary
Perhaps the best way of discussing mid-frequency vocabulary is by giving examples and explaining how mid-frequency vocabulary relates to language use. The list below exemplifies the type of words at each 1,000 level in the mid-frequency band: 3,001-4,000: academic, consist, exploit, rapid, vocabulary 4,001-5,000: agricultural, contemporary, dense, insight, particle 5,001-6,000: cumulative, default, penguin, rigorous, schoolchildren 6,001-7,000: axis, comprehension, peripheral, sinister, taper 7,001-8, 000: authentic, conversely, latitude, mediation, undergraduate, 8,001-9,000: anthropology, fruitful, hypothesis, semester, virulent It is definitely worth learning mid-frequency words like these, because research demonstrates that accumulating increasing amounts of vocabulary in the mid-frequency range leads to very clear rewards.
One very important reward is the ability to engage with English for authentic purposes, such as watching movies. For example, Webb & Rodgers determined that knowing 3,000 word families provides a little over 95% coverage for a range of television programs (2009a) and movies (2009b). This may be enough to enable a reasonable degree of comprehension, but there still would be around 4-5% unknown vocabulary items, which translates to about 3.9 unknown words per minute. The authentic purpose for watching television and movies is typically pleasure, and this amount of unknown vocabulary may affect the learners' ease of viewing, and therefore enjoyment. However, L2 listeners who know 98% of the words would face only 1.6 unknown words per minute, which should enhance the viewing experience. Achieving 98% coverage is largely dependent on mastering words in the midfrequency range: in movies, around 5,000 word families for horror, drama and crime, and up to 9-10,000 families for war and animation. One might expect content-dense television programs such as news broadcasts to require even more vocabulary, and this would be correct: Webb & Rodgers found that it took 4,000 word families to reach 95% coverage and 8,000 families to reach 98%. Because the usual purpose of watching the news is to be informed, it would presumably take nearer the 8,000 figure to fully exploit this information-rich form of communication.
'Authentic purpose' rewards also apply to reading. One very common purpose is the reading of novels and magazines for pleasure, and this pleasurable reading should not be overly taxing. Carver (1994) explored the relationship between the relative difficulty of written texts and the amount of unknown words in those texts. The study involved two different text types (fictional and factual) and native English primary school and postgraduate university students. He concluded that easy texts generally contained around 0% unknown words, difficult texts around 2% or more unknown words, and texts that were of an appropriate difficulty level around 1% unknown words. This suggests that a 98% coverage figure is not too demanding for pleasure reading, and Nation's (2006) calculations using this figure indicate a vocabulary requirement of 8-9,000 word families + proper nouns, again entailing a large amount of mid-frequency vocabulary. Likewise, Nation found that a similar level of vocabulary is necessary to read a range of newspapers.
Another very important authentic purpose is the reading of English textbooks in Englishmedium education. For that matter, even university students who are studying for degrees in their L1 are increasingly finding that their subject textbooks are in English: for example, in Germany, Sweden, Taiwan and Thailand (Pecorari et al. 2011). As the student's purpose is to extract information from these texts, good comprehension is essential. Laufer & Ravenhorst-Kalovski (2010) found that university students in Israel needed enough vocabulary to cover 98% of the examination reading texts (6-8,000 word families) in order to obtain a score on a university entrance examination which indicated they could read academic material independently (with or without the aid of a dictionary). However, even the ability to read with some guidance and help required 95% coverage, entailing knowledge of 4-5,000 word families. Thus even assisted reading in an educational setting requires a considerable progression into mid-frequency vocabulary.
Two other points are of interest in the Laufer & Ravenhorst-Kalovski study. First, while the students with vocabulary sizes of 6-8,000 word families typically achieved examination scores which exempted them from taking an English reading skills class, students with vocabularies of 4-5,000 families achieved scores which required one semester of this class, and those with smaller vocabularies required two or three semesters. However, informal reports from both teachers and learners indicated that many of the students with a vocabulary size of around 3,000 families continued to have difficulties with reading even after they had completed the required three semesters of English support classes. So any time and effort put in by students in learning mid-frequency vocabulary before beginning university was paid back when they did not have to take semesters of English reading classes. Furthermore, if they lack this vocabulary, they may not be able to achieve the necessary levels for reading university academic texts, even with the help of supplementary reading classes.
The second point is that improvement in reading test scores was closely connected with progression through the mid-frequency vocabulary. An increase of vocabulary from the 4,000 to 5,000 frequency levels increased reading scores just as much as an increase from the 3,000 to 4,000 levels. In fact, the best improvement in the reading scores came from vocabulary increases from the 5-6,000 and 5-7,000 levels. Thus even though the percentage of text coverage decreases as one moves through mid-frequency vocabulary (e.g. 2.2% from 3,000 to 4,000 vs. 1.3% from 5,000 to 7,000), the later stages of mid-frequency vocabulary seem just as important, if not more so, for effective reading.
A different kind of reward relates to the fluency with which a learner can use their vocabulary. Laufer & Nation (2001) looked at the relationship between vocabulary size on the Vocabulary Levels Test (Schmitt, Schmitt & Clapham 2001), and the speed at which learners could answer items on that test. They found that increased speeds on higher-frequency 3,000 and AWL sections began only when learners reached a vocabulary size of around 5,000 word families. Furthermore, the more vocabulary known beyond this level, the faster the speed, with the size/speed relationship strongest at the 10,000 frequency level (r = .67). Thus a learner's extension of vocabulary into the mid-frequency levels corresponds with not only knowledge of that lexis, but also improved speed of access for both mid-and high-frequency words. While increased speed in answering a vocabulary test is not the same as accessing vocabulary in the four skills, the results are suggestive. A lack of fluency can have a major impact on the way English can be used, even by highly proficient learners. McMillion & Shaw (2008) contrasted Swedish and British university biology students reading English texts and concluded that the advanced Swedish learners of English could reach virtually the same comprehension levels as the British students. However, the Swedish students consistently read at rates 25% slower than the British students, which means these students may be disadvantaged in two ways. First, they need to spend 25% more time reading in order to reach comprehension levels on a par with L1 readers. Second, when this time is not available (such as under exam conditions), they will not be able to demonstrate comparable levels of comprehension.
Our discussion of mid-frequency vocabulary highlights its importance for operating in English across a range of topics and situations. But what of learners who are specializing in one area; can they make do with specialized English, such as business or medical English? Lists of technical vocabulary have been promoted as a way of focusing the vocabulary study in such specific domains (e.g. Hyland & Tse 2007). These lists vary widely in both their scope and how much coverage they provide of the specialized texts in the target domain (e.g. 113 word families with 3.7% coverage of theology lectures (Lessard-Clouston 2010); 623 word families for 12.24% coverage of medical research articles (Wang, Liang & Ge 2008) and 2,000 word families for 95% coverage of foundation-level engineering textbooks (Ward 1999)). We agree that using such lists can be a useful aid in determining which of the mid-frequency words to focus on first, but it is important to realize that high-frequency + technical words are not enough to cope with domain-specific texts; that is, mid-frequency vocabulary is still required. There are a number of reasons for this: 1. Text coverage of high-frequency + academic + technical vocabulary often does not reach 95-98% (e.g. Chung & Nation 2003;Fraser 2005), so knowledge of mid-frequency vocabulary may be necessary to reach these coverage levels. 11 2. While a number of technical words have very specialized meanings and are lowfrequency, many have more generalized meanings and come from the high-and midfrequency bands. Thus, learners who know high-and mid-frequency vocabulary have a head start when learning lists of technical vocabulary.
3. Technical words are often defined in texts, but one must know the surrounding words (high-and mid-frequency) in order to understand the definitions. 4. The compilers of technical lists normally take a very narrow approach to defining learners' needs, such as being able to read engineering textbooks or understanding theology lectures, which does not take into account possible wider or longer-term needs, such as speaking English in the workplace or reading the newspaper. Mid-frequency vocabulary is necessary to participate in this wider range of activities.

The lack of a principled approach to teaching mid-frequency vocabulary
We have seen the benefits of developing a relatively large vocabulary, but the three different frequency bands (high, mid and low) have been treated quite differently in teaching. Highfrequency vocabulary is already addressed to some extent by teaching pedagogy, as textbooks, word lists, graded readers and learner dictionaries all focus on this vocabulary. Additionally, the high frequency of this vocabulary means that learners will be relatively well-exposed to it in any input they receive. Unfortunately, many learners still do not master highfrequency vocabulary, even after 1,000 hours or more of English instruction (Laufer 2000). We suggest that, as a minimum, English language programs emphasize teaching high-frequency vocabulary up to the 3,000 frequency level.
On the other end of the frequency continuum, low-frequency vocabulary is not typically useful enough to warrant an explicit focus, and Nation (1990) has long argued that it should be left to learners to deal with it themselves through the use of learning strategies. This seems sensible, but despite this, the topic-based focus of many materials means that lowfrequency vocabulary regularly gets explicit attention because it is seen to be necessary for the comprehension of particular reading or listening texts. It would be useful for materials writers to either gloss this vocabulary and/or use text-profiling tools (e.g. Lextutor) to minimize the lowfrequency vocabulary and replace it where possible with either high-frequency vocabulary (if the task purpose is fluency practice) or mid-frequency vocabulary (if the purpose includes learning new words) (Nation 2009).
This leaves mid-frequency vocabulary, which is much more problematic. It is not often addressed pedagogically, yet we have seen its considerable importance and benefits. We thus have a situation where vocabulary needed by learners is not addressed in any principled way. Some teachers might assume that vocabulary will somehow be 'picked up' from exposure to various language activities within the classroom and from natural input outside the classroom.
Unfortunately, there is some evidence that mid-frequency vocabulary is not typically used or taught in classrooms by teachers to any great extent. Perhaps unsurprisingly, Horst, Collins & Cardoso (2009) found that the vast majority of cases of direct vocabulary teaching in primary ESL classrooms (Grade 6) focused on high-frequency vocabulary, with very little focus on midfrequency vocabulary. However, there is not necessarily a greater emphasis on mid-frequency vocabulary at later stages of language learning. Tang & Nesi (2003) studied the teacher talk of two secondary school teachers in Guanzhou and Hong Kong and found that only 6% and 12% respectively of their vocabulary went beyond 3,000 word families (in these cases, the first 2,000 + a 1,000-item list made up of words from secondary school and university texts). Horst (2010) analyzed 32 hours of classroom discourse from a high-intermediate/advanced adult ESL class. Of the 121,967 words of teacher talk, 118,330 (97%) were high-frequency vocabulary, and only 2,521 (2%) came from the mid-frequency band. Furthermore, there was generally not enough repetition of these words to facilitate acquisition. Thus across a variety of teaching contexts, the opportunities for learning mid-frequency vocabulary from teacher talk remain surprisingly low. This conclusion is supported by Folse's (2010) finding that not only are cases of explicit vocabulary instruction relatively rare, but when they do occur they are usually not done in a way that facilitates remembering or recycling; for example, the instruction may be given orally with no accompanying visual cues, or without drawing the whole class's attention to the word.
Nor does mid-frequency vocabulary seem to be systematically addressed in textbooks. Matsuoka & Hirsh (2010) analyzed the vocabulary from the best-selling New Headway Upperintermediate English textbook and found that high frequency vocabulary (GSL + AWL + proper nouns + 32 other word families that were assumed to be known) provided 95.5% coverage of the textbook's 44,877 running words. Of the 1,005 remaining word families, 66.4% occurred only once and only 12.1% occurred five times or more. While textbooks are typically used under teacher guidance, which may lead to more noticing and engagement with the target vocabulary than with unassisted reading, these figures are still not promising. The authors of the series state that the books contain a 'very strong lexical syllabus' (Soars & Soars 1996: v), but Matsuoka & Hirsh's results show that this upper-intermediate textbook provides few opportunities for learning words at mid-frequency or beyond. But what of the other levels? We submitted the single words from the wordlist in New Headway Intermediate to a Lextutor BNC-20 frequency analysis and found that it includes 440 word families, of which only 110 come from the mid-frequency band. The wordlist from New Headway Advanced includes 782 families, with only 427 mid-frequency families. Given the vocabulary requirements outlined in this paper, both the total number of target words and the number of mid-frequency words seem rather small. While we do not know how much recycling of mid-frequency vocabulary there is throughout these two books, the small amount of recycling in New Headway Upper-intermediate suggests that it is probably not enough to reliably promote acquisition, unless teachers take up this particular vocabulary for active instruction in the classroom. Furthermore, even if there is recycling across the levels in the series, the length of time required to get through even one level means that the time between meetings is too long.
Hsu (2009) examined the 20 international General English (GE) textbooks used at her university in Taiwan (ranging from low intermediate to advanced) in order to determine how much vocabulary was required to achieve 95% coverage of the reading passages. The main articles in each book were analyzed for word frequency using Nation and Heatley's (2002) RANGE program with the BNC lists. Her findings show little uniformity between the level of the textbook and the vocabulary required both within and across textbook series (Table 4). This study illustrates the lack of a standardized approach to vocabulary in language textbooks, particularly in relation to reading difficulty, with materials writers seemingly unaware of vocabulary grading (through frequency) to consistently aid reading comprehension and develop vocabulary through the textbook levels. For example, the advanced Reading for real required 4-4,500 word families to reach 95%, while the low intermediate Reading for success 2 required 7-7,500 families. Hsu reports that the Taiwanese high school curriculum covers 2,000 words. Text coverage of 95% can be considered an appropriate instructional level (leaving 5% of the words available to be learned) for learners aiming to become independent readers (98%+ coverage). However, few of the books in this study offer optimum learning conditions for increasing learners' vocabulary size or improving their reading ability. Clearly there needs to be more consistency across textbook series, but this can only happen if vocabulary grading becomes a primary consideration of textbook writers. Hsu's figures clearly show the importance of mid-frequency vocabulary for reading, because for every textbook except Select readings intermediate, substantial amounts of mid-frequency vocabulary are necessary to get to 95% coverage. The studies reviewed in this section clearly show that mid-frequency vocabulary is necessary for a wide range of language uses, but also that neither teacher talk nor textbooks appear to address it in a principled manner. This raises a number of pedagogic issues, some of which we will consider in the next section.

Research agenda for mid-frequency vocabulary
Mid-frequency vocabulary poses a serious pedagogic challenge in how to deal with the thousands of word families in the band. We feel that explorations in the following areas would go some way towards providing insight into how to address this challenge.
• What is the total vocabulary input when both teacher input and materials input are combined? Research to date has tended to focused on one or the other. • At what rates can we reasonably expect learners to acquire vocabulary? Milton (2009: 89) surveys a range of studies and concludes that 'learners, as a very general average, appear to gain about four words per hour from regular classroom contact.' Is this rate a cognitive learning constraint or an artifact of an insufficient focus on vocabulary? • It takes words to learn words. Many learning strategies (e.g. using dictionaries, keeping vocabulary notebooks) rely on knowledge of high-frequency vocabulary. Is it possible for language programs to set out more ambitious early vocabulary targets and achieve them through a 'vocabulary flood' of the 3,000 high-frequency words? • To what degree is it feasible to manipulate the occurrences of mid-frequency vocabulary in learning materials to enable sufficient recycling to occur? Is it only possible to do this with computer-based materials or can it be done in traditional textbooks? • Is it possible to develop a series of more advanced graded readers in which midfrequency vocabulary is supported through techniques such as glossing or elaboration in the text (e.g. Nation 2009)? • To what extent can computerized vocabulary-learning programs contribute towards learners' ability to use vocabulary in communicative contexts? • Should a standard vocabulary size be attached to different textbook levels (e.g. lower intermediate, advanced), so that textbooks can be more comparable across series, and to ensure lexical progression within a series?

Conclusion
The main purpose of this paper has been to provide workable, empirically-based definitions of high-, mid-and low-frequency bands, and to highlight mid-frequency vocabulary so that it can be discussed as a phenomenon in its own right. We have highlighted a number of areas which require further research to determine how mid-frequency vocabulary should be addressed pedagogically. We hope that the concept of mid-frequency vocabulary will lead to more realistic vocabulary size targets in language programs and learner materials and classroom research into their effectiveness.