Corpora have revolutionised the way we describe and analyse language in use. The sheer scale of collections of texts, along with the appropriate software for structuring and analysing this data, has led to a fuller understanding of the characteristics of language use in context. However, the development of corpora has been unbalanced. The assembly of collections of written texts is relatively straightforward, and as a result, the field has a number of very large corpora which focus on mainly written texts, although often with some spoken elements included, e.g. the COCA (520m words), GloWbE (1.9b), and the enTenTen (19b). In addition, a number of corpora now include samples of language used in social media and other web contexts alongside more traditional written and (transcribed) spoken language samples, e.g. the Open American National Corpus (planned corpus size 100m words, mirroring the British National Corpus). Conversely, the development of spoken corpora has lagged behind, mainly due to the time-consuming nature of recording and transcribing spoken content. Most of the spoken corpora that exist consist of material that is easily gathered by automated collection software, such as radio talk show and television news transcripts and other entertainment programming (e.g. the spoken elements of COCA). The nature of this spoken discourse is described as unscripted, however, it is certainly constrained, e.g. talk show radio has certain expectations about how the host will moderate the discussion. While the scripted/constrained oral content in these spoken corpora has proved informative in terns of the nature of spoken discourse (see Adolphs and Carter, 2013; Raso and Mello, 2014; Aijmer, 2002 and Carter and McCarthy, 1999 for notable studies), it is no substitute for spontaneous, unscripted oral discourse. Furthermore, even the automated collection of scripted/constrained spoken discourse has not yet enabled the development of large spoken corpora of a size comparable to the largest written corpora (e.g. the spoken component of the 100m British National Corpus is only 10m words with a further 10m words added in the new spoken BNC2014). The 10m word subcorpus of the BNC contains 4m words of spontaneous speech, and is controlled for a number of sociolinguistic and contextual variables. There are a number of smaller spoken corpora available, e.g. the Michigan Corpus of Academic Spoken English (MICASE) which at just under 2m words is both modest in size and quite specialised in content. This trend is reflected in other corpora of spoken discourse. Spontaneous spoken discourse forms a large part of everyday language use, and the development of larger and more representative corpora of spontaneous oral language is therefore desirable to inform linguistic description. The main constraint to this ambition has always been the time-consuming nature and financial cost related to the compilation of such corpora. Spoken corpora provide a unique resource for the exploration of how people interact in real-life communicative contexts. Depending on how spoken corpora are annotated (as discussed below), they present opportunities for examining patterns in, for example, spoken lexis and grammar, pragmatics, dialect and language variation. Spoken corpora are now used in a variety of different fields from translation to reference and grammar works, to studies of language change. The need for spontaneous unscripted corpora seems uncontroversial, however, compiling such corpora in the traditional way remains a formidable task. Advances have been made in other areas utilizing the power of people volunteering information about what they think and do. This approach is often referred to as crowdsourcing, and it holds the promise to both overcome some of the difficulties outlined above, and to add useful aspects to corpus compilation which traditional methods cannot offer.
This paper thus explores a new approach to collecting samples of naturally occurring spoken language samples, which may allow researchers to take advantage of the burgeoning area of information crowdsourcing. Instead of relying on the typical recording and transcribing of spoken discourse, crowdsourcing may allow the collection of real-time data ‘in the wild’ by having participants report the language they hear around them. Specifically, we aim to investigate the level of precision and recall of the ‘crowd’ when it comes to reporting language they have heard in real certain contexts, alongside the use of a crowdsourcing toolkit to facilitate this task. This method of ‘reporting’ usage does come with its own issues of course, many of which have been highlighted in the literature on Discourse Completion Tasks (Schauer and Adolphs, 2006), and can merely be regarded as a proxy for usage. Investigating user memory in this context can therefore only be regarded as a first step in assessing the overall viability of the proposed approach to collecting language samples. As a focusing device for selection of reported language samples, we draw on the use of formulaic phrases, an area that have received considerable attention from different areas in applied linguistics.
Adolphs, S., Knight, D., Smith, C., & Price, D. (2020). Crowdsourcing Formulaic Phrases: towards a new type of spoken corpus. Corpora, 15(2), 141-168. https://doi.org/10.3366/cor.2020.0192