Corpus Linguistics for Language Teaching and Learning

    Summary: Readily available collections of written or spoken language and easy-to-use analytical tools offer exciting possibilities for using the techniques of corpus linguistics to improve language teaching and learning. This project focuses on Korean, for which there has been explosive growth in corpora and software but no prior work on pedagogical corpus linguistics. Products will include a volume of studies applying corpus linguistics to problems in Korean language teaching and learning and a technical report on the uses of language corpora in teaching foreign languages in general.

    Corpus linguistics studies take advantage of the existence of large collections of language production (written or spoken language) in order to investigate a language. It bases its descriptions on the empirical characteristics of language production (rather than chiefly on theory, or speaker intuition). The past decade, and especially the past several years, have seen an explosion in corpus linguistic studies. This is due to several causes: first, personal computers now have the speed and storage capacity to process huge corpora (often involving tens or hundreds of millions of words — the equivalent of hundreds or thousands of thousand-page books) in a few seconds. If one wants to find out how a word is used, for example, one can pull up hundreds of examples, in context, in a matter of seconds in a convenient display using readily available and inexpensive tools. Second, there now exist easily accessible and scientifically prepared collections of language — large and well-structured corpora — which the individual can easily use on a personal computer. Third, the World Wide Web itself now contains an enormous amount of language, again readily accessible to the individual user. The Web has also made the distribution of scientific corpora and corpus tools easy and convenient, as well as provided a forum for corpus linguists to interact — thus driving the field forward. Fourth, the field of natural language processing by computer (and artificial intelligence in general) has been exploring the ways in which probabilistic models can improve processing: these probabilistic models require tools that investigate the statistical structure of language output, and this of course involves corpus studies. 

    Foreign language pedagogy is now beginning to see new possibilities for recent advances in corpus linguistics to improve language teaching and learning. To be sure, the results of older “classic” corpus-based studies of word frequency have long been of interest to language teaching, informing decisions about materials development, grading of materials, and assessment. These early studies, from the period before the mid-1990s, were produced by specialists working with mainframe computers at major universities and research institutes. Certainly, corpus linguistics was nothing that an ordinary classroom teacher or learner could possibly do. What we are seeing now is something quite different and potentially revolutionary. Readily available corpora and easy-to-use tools can now be used on the spot in a language teaching context, by teachers and learners without extensive training in computational linguistics, and studies of linguistic features can be tailored to specific pedagogic context and learning requirements. Thus corpus linguistics fits in with the current emphasis on authentic materials and on task-based language teaching — emphases of other Hawai‘i NFLRC projects. 

    This NFLRC project, to be accomplished during the first two years of the grant cycle, will have limited aims. We will not attempt to develop either corpora or corpus-analytic tools. Both of these are expensive and time-consuming endeavors which would require substantial funding from additional sources (however, such corpora and corpus tools are being developed at a rapid rate for many languages). Rather, the goal of this project is to demonstrate the potential of corpus based studies for foreign language pedagogy. This will be accomplished through three primary foci: 

    • Collection and dissemination of information on available corpora and corpus-analytic tools. This information will be presented in a form specifically for language teachers and learners and will be available in continually updated, Web-accessible documentation. The project will prepare summaries of available materials, together with evaluations of their applicability, and references to any existing documented uses. We will immediately develop a website on our server containing this information. During the course of the project, the Web summaries will be continually updated. 
    • Training in the use of corpora in language teaching and learning, with emphasis on the context of foreign language instruction in the US. A summer institute is planned for the first summer of the grant cycle (2003). The summer institute, taught primarily by UH experts, but with participation of visiting faculty, will include such topics as the use of tools, concordances, search programs, and text editors; the use of collocation studies in instruction; frequency and the statistical structure of language; genre, field, and register; the use of corpus linguistics in assessment; and direct use of corpora in student-teacher collaboration and by learners alone. 
    • Preparation and dissemination of focused corpus-based practical usage studies in the targeted languages. One of the most used aspects of existing corpus linguistic work (almost exclusively ESL) has been the production of tightly focused practical usage studies of particular linguistic or discourse pragmatic areas (such as the use of the progressive, hedges in scientific reports, word usage, lexical-semantics of argument structure, relative frequency of relative clause types, etc.). We will produce a series of such small scale studies of specific areas of usage. The choice of areas will be informed in part by work in analysis of learner corpora, in part by teacher and learner perceptions of need. These will be produced in draft form to be used as materials for the 2003 summer institute and will be included in a final technical report on the uses of language corpora in teaching foreign languages. 

    The project will concentrate on Korean as the primary demonstration language, chosen for several reasons: the presence of large numbers of advanced graduate students, many experienced teachers, and specialists interested in corpus studies of Korean at UH, the fact that Korean is a language of great and growing national importance, and the fact there is now the beginning of what we confidently expect will be an explosive growth in available corpora and software. Finally, there is virtually no existing work in pedagogical corpus linguistics yet for Korean. In the second year of the project, some attention will also be given to projects in Japanese, for which all of the same arguments can be made. In addition, the NFLRC will provide some assistance in the form of lecturer release time to Professor Yuphaphann Hoonchamlong, Professor of Thai, who is experienced in Thai corpus linguistics and is seeking external funding for a project to create a corpus of spoken Thai that will be useful for Thai language pedagogy. 


    • Collect information on available corpora and corpus-analytic tools for Korean and disseminate in a web-accessible format designed specifically for language teachers 
    • Carry out a series of focused, small-scale corpus-based practical studies demonstrating the uses of language corpora for language pedagogy 
    • Prepare a draft version of a manual consisting of the compendium of available resources and small-scale research projects described above, plus a review of how language corpora have been used to inform language pedagogy to date 
    • Offer summer institute for US-based language teachers on the uses of language corpora in language teaching, with a focus on Korean as the demonstration language – Corpus Linguistics for Korean Language Learning and Teaching


    • Collect information on available corpora and corpus-analytic tools for Japanese and Thai and disseminate in a web-accessible format designed specifically for language teachers 
    • Update information on available corpora and corpus-analytic tools for Korean 
    • Prepare the final version of a technical report on the uses of language corpora in teaching foreign languages