Potentials of Language Documentation: Methods, Analyses, and Utilization
LD&C Special Publication No. 3
Edited by Frank Seifart, Geoffrey Haig, Nikolaus P. Himmelmann, Dagmar Jung, Anna Margetts, and Paul Trilsbeek
1. The threefold potential of language documentation
Frank Seifart, pp. 1–6
In the past 10 or so years, intensive documentation activities, i.e. compilations of large, multimedia corpora of spoken endangered languages have contributed to the documentation of important linguistic and cultural aspects of dozens of languages. As laid out in Himmelmann (1998), language documentations include as their central components a collection of spoken texts from a variety of genres, recorded on video and/or audio, with time-aligned annotations consisting of transcription, translation, and also, for some data, morphological segmentation and glossing. Text collections are often complemented by elicited data, e.g. word lists, and structural descriptions such as a grammar sketch. All data are provided with metadata which serve as cataloguing devices for their accessibility in online archives. These newly available language documentation data have enormous potential.
PART ONE: METHODS
2. Prospects for e-grammars and endangered languages corpora
Sebastian Drude, pp. 7-16
This contribution explores the potentials of combining corpora of language use data with language description in e-grammars (or digital grammars). We present three directions of ongoing research and discuss the advantages of combining these and similar approaches, arguing that the technological possibilities have barely begun to be explored.
3. Language-specific encoding in endangered language corpora
Jost Gippert, pp. 17-24
The paper addresses problems of corpus building and retrieval resulting from code-switching, which is a characteristic feature of endangered language recordings. The typical appearance of code-switching phenomena is first outlined on the basis of data collected in the DoBeS ‘ECLinG’ project, which dealt with three endangered Caucasian languages spoken in Georgia: Tsova-Tush (Batsbi), Udi, and Svan. The problem of language-specific retrieval is illustrated with examples showing the usage of the word da in Tsova-Tush contexts, which represents, as a homonym, either a native copula form (‘it is’) or the Georgian conjunction ‘and’. The subsequent section discusses the annotation requirements that are necessary to automatically distinguish the languages involved in code-switching, with a focus on the emerging ISO standard 639-6. It is argued that the fine-grained distinction of varieties and subvarieties and their interrelationship – as aimed at in this standard – requires a thorough reconsideration if it is to be applied in the markup of corpus data.
4. Unsupervised morphological analysis of small corpora: First experiments with Kilivila
Amit Kirschenbaum, Peter Wittenburg, and Gerhard Heyer, pp. 25-31
Language documentation involves linguistic analysis of the collected material, which is typically done manually. Automatic methods for language processing usually require large corpora. The method presented in this paper uses techniques from bioinformatics and contextual information to morphologically analyze raw text corpora. This paper presents initial results of the method when applied on a small Kilivila corpus.
5. A corpus linguistics perspective on language documentation, data, and the challenge of small corpora
Anke Lüdeling, pp. 32-38
This paper deals with issues of corpus design that might prove problematic for the study of under-resourced languages, e.g. in language documentation. It argues that it is not yet well understood which linguistic and extra-linguistic (predictor) variables cause linguistic variation (i.e. the response variable), which means that the scope of a linguistic finding cannot always be assessed. In order to deal with this problem, it is argued that we need a flexible corpus architecture with the option of adding meta-data to corpora/sub-corpora at any point in time.
6. Supporting linguistic research using generic automatic audio/video analysis
Oliver Schreer and Daniel Schneider, pp. 39-45
Automatic analysis can speed up the annotation process and free up human resources, which can then be spent on theorizing instead of tedious annotation tasks. We will describe selected automatic tools that support the most time-consuming steps in annotation, such as speech and speaker segmentation, time alignment of existing transcripts, automatic scene analysis with respect to camera motion, face/person detection, and the tracking of head and hands as well as the resulting gesture analysis.
PART TWO: ANALYSES
7. Bilingual multimodality in language documentation data
Marianne Gullberg, pp. 46-53
Most people in the world speak more than one language, making bilingualism the norm rather than the exception. Furthermore, speakers generally also move their hands – they gesture – in coordination with speech and language in nontrivial ways. Bilingualism and multimodality should thus be on research agendas focused on the nature of linguistic systems and language use in context, yet they are often overlooked. Conversely, research and theorizing on bilingualism and multimodality is often based on Western-European, standardized languages, and little is known about other linguistic contexts. This paper makes the point that language documentation data has the potential to inform theoretical and empirical studies of linguistics, bilingualism and multimodality in entirely new ways, and, conversely, that documentation work would benefit from taking the bilingual and multimodal nature of its data into account.
8. Tours of the past through the present of eastern Indonesia
Marian Klamer, pp. 54-63
The past twenty years have seen a variety of data being collected from largely undocumented languages in eastern Indonesia, an area hitherto almost unknown. Such data are valuable in reconstructing the history of this area at a macro-level. In addition, as research in particular areas becomes more fine-grained, it is possible to combine linguistic data with data from oral history and ethnographic observation in order to reconstruct the migration histories of specific speaker groups. A case study of such a micro-level reconstruction is presented here.
9. Data from language documentations in research on referential hierarchies
Stefan Schnell, pp. 64-72
This paper outlines potentials of documentary linguistics for typological research in referential hierarchies. Specifically, I will demonstrate how the analysis of original text data from the Oceanic language Vera’a enhances knowledge about referential hierarchy effects in the domains of number marking and morphosyntactic properties of objects. With this language-specific research as a background, I will outline ways in which original text data from language documentation projects can be used in cross-corpus investigations of aspects of referential hierarchies across languages.
10. Information structure, variation and the Referential Hierarchy
Jane Simpson, pp. 73-82
Silverstein (1976)’s hierarchy of features and ergativity (Referential Hierarchy) was proposed to capture apparent systematic variation with respect to word-class (pronouns versus nouns) in the expression of the grammatical functions Subject and Object and the semantic roles Agent and Undergoer linked to these functions. An assumption of the original hierarchy was obligatoriness of marking, rather than optionality (i.e. choice of marker or its absence). Optionality is often associated with a semantic/pragmatic force additional to straight expression of grammatical function. This additional meaning may determine reanalysis and subsequent change in the morphosyntactic expression of Subject/Object/Agent/Undergoer. Along the way, apparent counter-examples to the Referential Hierarchy may be created. To understand the counter-examples, and test the descriptive adequacy of the Referential Hierarchy, better language documentation is needed.
11. How to measure frequency? Different ways of counting ergatives in Chintang (Tibeto-Burman, Nepal) and their implications
Sabine Stoll and Balthasar Bickel, pp. 83-89
The frequency of linguistic phenomena is standardly measured relative to some structurally defined unit (e.g. per 1,000 words or per clause). Drawing on a case study on the acquisition of ergativity by children in Chintang, an endangered Tibeto-Burman language of Nepal, we propose that from a psycholinguistic point of view, it is sometimes necessary to measure frequencies relative to the length of the time windows within which speakers and hearers use the language, rather than relative to structurally defined units. This approach requires that corpus design control for recording length and that transcripts be systematically linked to timestamps in the audiovisual signal.
12. On the sociolinguistic typology of linguistic complexity loss
Peter Trudgill, pp. 90-95
The nature of the human language faculty is the same the world over, and has been so ever since humans became human. This paper, however, considers the possibility that, because of the influence which social structure can have on language structure, this common faculty may produce structurally different types of language under different sociolinguistic conditions. Changing sociolinguistic conditions in the modern world are likely to have the consequence that, in time, the only languages remaining in the world will be severely atypical of how languages have been throughout most of human history.
PART THREE: UTILIZATION
13. Visualization and online presentation of linguistic data
Hans-Jörg Bibiko, pp. 96-104
This contribution gives an introduction to state-of-the-art techniques for the visualization and online presentation of linguistic data and world-wide linguistic diversity, such as linguistic maps and online dictionaries, using a software environment called R. The aim is to draw linguists’ attention to the possibilities offered by these techniques and to give some practical hints as to how they can be used specifically for linguistic and language documentation data.
14. Language archives: They’re not just for linguists any more
Gary Holton, pp. 105-110
While many language archives were originally conceived for the purpose of preserving linguistic data, these data have the potential to inform knowledge beyond the narrow field of linguistics. Today language archives are being used by people without formal linguistic training for purposes not necessarily envisioned by the original creators of the language documentation. The DoBeS Archive is particularly well-placed to become an important resource for cultural documentation, since many of the DoBeS projects have been interdisciplinary in nature, documenting language within its broader social and cultural context. In this paper I present a perspective from a legacy archive created well before the modern era of digital language documentation exemplified by the DoBeS program. In particular, I describe two types of non-linguistic uses which are becoming increasingly important at the Alaska Native Language Archive.
In its first two sections this paper briefly discusses two models of language documentation projects: the hierarchical model, in which the language documentation corpus (LDC) serves as a resource for the development of educational materials (EMs), and the integrative model, which integrates the production of EMs into the LDC and makes them a resource for linguistic research. The third and the fourth section describe how the integrative model was applied in the Teop Language Documentation Project and what kind of linguistic research topics it provides.
16. From language documentation to language planning: Not necessarily a direct route
Julia Sallabank, pp. 118-125
In this paper I will consider how documentary linguists can provide support for community language planning initiatives, and I will discuss some issues. These relate partly to the process of language documentation: what and who we choose to document, how we define ‘language’ and how we deal with language variation and change; and partly to community attitudes and dynamics
17. Online presentation and accessibility of endangered languages data: The General Portal to the DoBeS Archive
Gabriele Schwiertz, pp. 126-128
Data depositories containing language documentation corpora are generally well structured, well maintained, and include large collections of many under-researched languages. However, they are not yet conceived of as resources that can be easily consulted on scientific or non-scientific questions pertaining to one of those languages. A general portal to the DoBeS archive has been created to facilitate access to the data, to attract more users to the archive, and to lower the threshold for users outside the linguistic community to access the data. The structure and the main features of this portal will be presented in this paper.
18. Using language documentation data in a broader context
Nick Thieberger, pp. 129-134
On the one hand we have never seen as much fieldwork and recording of small and endangered languages as we have over the past decade. On the other hand linguists are now also much more aware of the need to create records that can be reused by the people we record and that will still be available for their descendants. Our own descendants, the future researchers who will use our records, will also need to be able to find and make use of our research. The fragility of digital records means we need to pay attention to their curation over time and create suitable repositories if they do not already exist. In order for these aims to be achieved, we need to establish work practices now that allow the data to move easily from creation to the archive and to community use.