Camp, R. (1993). The place of portfolios in our changing views of writing assessment. In Bennett, R. S., & Ward, W. C. (Eds.). Construction versus choice in cognitive measurement: Issues in constructed response, performance testing and portfolio assessment (pp. 183-212). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
Assessment is shifting away from the use of one-time simulations of real-world performance. The effects of this kind of traditional assessment, which views learning as a mastery of discrete points, have been reflected in educational practice for a number of years. However, the belief that learning is composed of complex, interrelated tasks leads to the need for change in assessment procedures. Assessment should involve real-world tasks that occur within the classroom, but which can be communicated to interested parties beyond the classroom as well. Measurement of these learning contexts and processes has to account for more than traditional validity concerns, and should consider social, cognitive, and consequential variables. For writing assessment along these lines, portfolios offer a viable approach.
Carroll, B. J. (1985). Second language performance testing for university and professional contexts. In Hauptman, P. C., LeBlanc, R., & Wesche, M. B. (Eds.). Second language performance testing (pp. 73-88). Ottawa: University of Ottawa Press.
Carroll stresses the practical aspects involved in performance test development, focusing on the goal of providing test clients with guidance in decision making. Given the instability of any theory of language to date, he supports the creation of tests based on communicative settings where language will be used. Thorough needs analyses divulge the salient features of these settings, and a performance test should reveal the examinee's ability with respect to use of these features. Without a comprehensive theory, validation must be based on the interplay of correlational measures between the test and other means of assessment or estimation, and also on subjective judgments.
Cherryholmes, C. H. (1988). Construct validity and the discourses of research. American Journal of Education, 96(3), 421-427.
Cherryholmes describes the relationship between various approaches to research methodology and construct validity. He borrows from Cronbach and Meehl (1955) in identifying a construct as an attribute of people that is assumed to be tapped by tests, and from Cronbach (1971) in suggesting that a construct is a means of organizing experience into categories. Validation, it follows, is the process of demonstrating the relationship between a construct and the rest of the world. Cherryholmes describes five different approaches to construct validity.
Chittenden, E. (1991). Authentic assessment, evaluation, and documentation of student performance. In Perrone, V. (Ed.). Expanding student assessment (pp. 22-31). Alexandria, VA: Association for Supervision and Curriculum Development.
Chittenden addresses gaps between instructional practice and traditional testing formats (especially as exemplified in elementary education). He sees a need for the organization of products and processes of the classroom in such a way that all constituencies understand and accept. Portfolios or work-sample approaches to organizing classroom features offer a means to "implement assessment practices that (a) capitalize on the actual work of the classroom, (b) enhance teacher and student involvement in evaluation, and (c) meet some of the accountability concerns of the district" (p. 23). Through the implementation of more authentic assessment measures and techniques, assessment and classroom practice can be increasingly integrated and incorporated into a single educational approach.
Cizek, G. J. (1991). Confusion effusion: A rejoinder to Wiggins. Phi Delta Kappan, 73, 150- 153.
Cizek points out perceived flaws with Wiggins' call for testing reform along performance assessment lines. He suggests that measurement-driven instruction will continue under performance assessment, and that performance assessment will be unable to cover all instructional goals. He further maintains that performance assessment will be subject to the same problems of corruptibility as those found in multiple-choice or objective assessment. Whereas face validity of performance assessments make be greater than that of other forms of assessment, Cizek also contends that other forms of validity will be more seriously threatened (and that performance assessments will not stand up under fire as demonstrably valid forms of assessment). Performance assessments should also be scrutinized for inherent problems in reliability of assessment practices (including task and rater score effects). Cizek concludes that, while the education system certainly needs reforming, lessons from past efforts at test development and implementation should not be forgotten, and the uninformed acceptance of performance assessment as a cure-all for instructional woes should not be perpetuated.
Clark, J. L. D., & Grognet, A. G. (1985). Development and validation of a performance-based test of ESL "survival skills." In Hauptman, P. C., LeBlanc, R., & Wesche, M. B. (Eds.). Second language performance testing (pp. 89-110). Ottawa: University of Ottawa Press.
Clark and Grognet describe the development of an English language performance test for refugee survival skills. They sought to create a test based on English language abilities that were minimally necessary to successfully "survive" in the United States, thus the actual language used in the test was to be low proficiency level. The test was to be based on language tasks considered essential for surviving in typical situations faced by refugees entering the U.S. The test needed to cover all four language skills in authentic situations, and it needed to function equally well with all L1 backgrounds. It also needed to be administrable by non-specialist volunteers in a short amount of time, and it was to provide diagnostic information for pedagogic purposes.
Cronbach, L. J. (1988). Five perspectives on validity argument. In Wainer, H., & Braun, H. (Eds.). Test validity (pp. 3-17). Hillsdale, NJ: Erlbaum.
Cronbach holds that any validation argument links concepts, evidence, social and personal consequences, and values. In this expanded sense, it is insufficient to rely upon the old tripartite system of validation. Critical validity argument enables defensibility of claims and illuminates areas in need of change. Validation is a never-ending process for educational tests, which are subject to social power and philosophy. Validity arguments must either submit to prevailing beliefs and values or successfully overturn them. Cronbach proposes five perspectives from which questions about tests (and validity) arise:
Previous focus on reliability in scoring procedures, test items, descriptors, etc. detracted from the amount and quality of information provided in writing assessment. Furthermore, the impromptu and limited nature of writing tests did not tell the whole story of student writing. If students are more closely involved in the assessment process (at all levels), then information about written products as well as the writing process can be generated. A more direct type of assessment, wherein students perform with knowledge, could lead to beneficial effects on the nature of writing education.
Portfolios should offer the opportunity to solicit multiple samples of writing across genres and purposes, provide evidence of the real-world processes involved in creating writing, and provide evidence of student awareness of writing processes and strategies through critical self-reflection. Portfolio assessment approaches can be defined along a continuum from conservative (and less challenging to traditional views of testing) to free-form (and much more challenging to traditional testing). More conservative forms attempt to preserve concepts like consistency and reliability in measurement, while more free-form versions stress the importance of tapping the individual nature of each writer and each classroom context.
Three examples of large-scale portfolio assessment are given (The Miami University Writing Portfolio Program, The Vermont Writing Assessment, and The Arts PROPEL Writing Portfolio). Difficulties encountered in these efforts include: the small number of complex and interdependent performances leads to difficulty in sorting out which abilities lead to which performance characteristics; difficulty of performances cannot be graded or compared; intervening conditions cannot be determined very effectively and efficiently; and students have varying degrees of access to resources. Evaluation of portfolios should take its cues from the experiences of holistic evaluation of impromptu writing samples. Communication of the results of portfolio assessment will involve the re-education of society, which has come to view achievement and learning in terms of numeric score reports. Measurement theory will have to change in order to accommodate the very valuable information resource that is offered by portfolio assessment.
Tests should involve language use in the way in which it is actually used in contexts that are specified as necessary, be it in an integrated format, in an order that is difficult to rate, using content areas that interfere with ratings, etc. Although performance tests focus on desirable products, they nonetheless offer valuable information about what testees can and cannot do with respect to given functions and situations. They should be supplemented in educational practice with the use of process assessment instruments, as they cannot be used to tap the processes involved in acquiring language abilities. The detail needed from a needs analysis depends on the given situation and the type of decision-making.
Carroll agrees that traditional statistics of probability will find performance assessment samples wanting, but he suggests that this does not indicate that we should abandon them. Rather, subjectivity and intuition, as well as a basis in sound content coverage and test construction, will have to supplant some of the reliance on traditional statistical aspects to test validation and reliability. Carroll recommends combining more "objective" and more "subjective" measurements in order to develop a performance profile of an examinee.
Although Carroll thinks that such performance testing should be able to inform sound educational practice, with respect to language teaching, he also cautions that "existing language tests give the answer to a question which few people ask and fail to give the answers which everyone seems to expect" (p. 78). Tests are not ever completely successful at demonstrating whether or not an examinee actually has the linguistic ability to do something, nor can they diagnose exactly what learning/teaching needs to take place for the examinee to achieve these goals. Nonetheless, Carroll suggests that by using parallel assessments by teachers as a kind of external validation, and by observing the operational effectiveness of tests, L2 performance assessments can offer a valuable tool to the L2 teaching profession.
Mainstream research methodology uses construct validity as a means of identifying the relationship between a measured identity and a theoretical rationale. Unfortunately, a hard and fast, or a logically deduced, identity is a non-existent entity. Identity is always probabilistic. In order to validate a construct, or an identity, a nomological net (a set of theoretical laws) must be constructed around the construct, and due to the transient nature of nomological net meanings, discourse leading to consensus must occur. Historically, the discourse surrounding construct validity has gone through the following major phases: Campbell and Fiske (1959) -- convergent and discriminant validation with multitrait-multimethod matrices; Cronbach (1971) -- what does the instrument really measure, to validate is to investigate, invention of cleaner NetWorks that account for negative evidence; Messick (1974) -- meaning, validity, and construct validity are interdependent, which measurements are used is a matter of purpose; Cook and Campbell (1979) -- statistical power, sensitivity of measures, reliability, homogeneity, avoidance of social reproduction, etc. Cherryholmes suggests, borrowing from Quine (1953), that "[c]onstruct validity and research discourses are shaped, as are other discourses, by beliefs and commitments, explicit ideologies, tacit world-views, linguistic and cultural systems, politics and economics, and how power is arranged" (p. 428). The belief in validation by systematic quantitative research and methodology represents one way of conceiving of constructs. However, different discourses produce different truths, thus construct validity is not only empirical and logical, but it is inherently rhetorical. Other methods beyond the positivist-empiricist tradition must be considered.
Phenomenology consists of ethnographic and field research methodologies that look to the subjects of research for the validation of constructs; that is, the discourse of research should emanate from subjects. First-order constructs emerge from the ways in which subjects make sense of the world, and researchers attribute second-order explanatory constructs that attempt to account for the first-order constructs. "The possibility of suggesting precise logical guidelines for what counts as understanding and explanation is rejected. Construct validation moves closer to life as experienced and lived because secondary constructs of social researcher-theorists are explicitly derivative from and thereby validated by first-order constructs of everyday life" (p. 433).
Critical theory addresses construct validity from the perspective of inherent bias and misconception in the discourse of researchers as well as subjects. The introduction of normative and material constraints attempts to minimize distortion from bias, and therefore build towards consensus of those involved. However, within this framework, anyone can speak, as long as the goal is consensual agreement. All sources of information should be considered, including empirical-analytic, phenomenological, systems vindication of value orientations, and rational social choice.
Drawing from Foucault, Cherryholmes suggests: "Interpretive analytics emphasizes history and power in the production of the present and asks why researchers and those they study use the words, utterances, metaphors, analogies, and models they do when they make statements, arguments, and inferences" (p. 438). Power produces knowledge under this conceptualization. Researchers produce that which the dominant norms (power) will reward, thus reproducing society in their work. Constructs come from these social norms and are manifested in authorless discourse -- they reflect the truth of a time as political tools. Construct validity involves determining the most dangerous of constructs. Positivistic devices, such as multitrait-multimethod validation, can serve to distract this determination of the danger in constructs, through the appeal to explanatory metanarratives. Construct validity cannot be disentangled from history, society, linguistic community, or power.
Deconstruction agrees substantially with ideas brought out, but not focused on, in mainstream views. Following Nietzsche, constructs are in fact the denial or falsification of things, because they try to delimit things to one, given interpretation. With unmediated, or infinite, perceptions of the world, such delimitation is always relative. Deconstruction removes any concrete starting or finishing points from the search for construct meaning. It also dissuades the use of categorical distinctions, which are invariably the result of ideology and power. "Deconstruction demonstrates an ever-present instability in meanings of constructs and measurements" (p. 446).
Cherryholmes concludes that construct validity reduced to the level of debate between quantitative and qualitative approaches is misleadingly simplistic. The validation of a construct, or an approach, is much broader, including investigation of power and ideology, the interests of all groups involved, and the critical and pragmatic interrogation of the discourses involved in construct creation and use.
Chittenden refers to a number of definitions to clarify his authentic assessment approach. Assessment should encompass a breadth and depth of evidence gathering procedures, including: Observations by both teachers and students (potentially the richest information source, involving ratings, narratives, checklists, logs, etc.); Performance samples of student work (documentation of student accomplishments); and Tests or test-like procedures (used to check student learning). This assessment approach involves multiple traits and multiple methods for producing lines of evidence about learning and student progress. Such assessment schemata allow for the liberation of classroom procedures from domination by standardized testing instruments.
In order to enable program and classroom level evaluation, all evidence should be revealed to interested parties (made public). Parameters of assessment devices and techniques should be available to and understood by teachers, students, and any other shareholders in the assessment process. Evaluation should result in public debate about a program, but any participants in such a debate should have access to the available evidence.
Assessment should occur within the classroom, and it should "build on classroom practices while extending them in some directions" (p. 27). Observations and other instances of evidence gathering should occur in a variety of settings and over multiple occasions.
Authentic assessment can be used to keep track of student activities (using work folders, inventories, journals, checklists, etc.), check up on student learning (to observe whether assumed, and often imposed, goals of learning are being reached), and find out about what is going on in the classroom (inquiry into the processes and products of a classroom and their meaningfulness to students and teachers). This assessment approach should always involve a summing-up process whereby evidence is presented to all interested shareholders.
The resulting instrument (the Basic English Skills Test) consisted of a core section, that covered the four skill areas, and a literacy section, that involved more advanced reading and writing situations and tasks. The core section involved a face-to-face administration and used a picture based test booklet and corresponding rating sheets. Situations and functions covered in the core section included: giving autobiographical information, describing pictures, time-telling, giving instructions, handling currency, comprehending signs, communicating in health care environments, and filling in biographical information. The literacy skills section involved: comprehending dates, understanding trade figures and information, using reference materials (e.g., telephone books); understanding a driver's training manual, and identifying time and quantity implications. Writing involved formulaic (filling in a check) and more open-ended tasks (responding to opinion questions).
All of the test items were rated according to criteria regarding the comprehensibility and fluency of the language produced, as well as the comprehension exhibited in responding to the tasks.
Based on SCALAR item analysis, numerous poorly functioning items were discarded, and final versions of the test showed very high levels of internal consistency (with the exception of the writing section). Interrater reliabilities were quite high as well. Validity was assumed to be high based on general language proficiency ratings assigned by the refugee program staff, which correlated relatively highly with test scores. Future test development should seek to establish greater external validity through collaboration with refugee teachers in order to "develop descriptive scales of English proficiency which are skill and situation specific for use by the instructors in evaluating their students' language performance as exemplified in the training program" (p. 107). Predictive validity should also be confirmed by longitudinal studies of refugee language performances in non-academic settings.
(1) The functional perspective approaches the value of tests from their particular worth, rather than their inherent truth. "The worth of an instructional test lies in its contribution to the learning of students working up to the test, or to nest year's quality of instruction" (p. 5). Consequences of tests, inferencing, and reporting must be considered in validation, especially those that impinge on the rights and life chances of individuals.
(2) The political perspective demands tests be compatible with prevailing belief systems. However, in democratic or liberal education, validation arguments should remain disputatious and should be sensitive to all individuals involved. Simply matching tests to curriculums is insufficient if curriculums only reinforce the authoritative status quo.
(3) The operationist perspective minimizes interpretation and the infelicities that it can bring about by requiring an adequate set of well-defined behavior. This enables comparison between persons or groups. However, the relationship between domain and goals should be scrutinized. This perspective also perpetuates the predominant conception of a behavior, and could be challenged by measurement of a variable from multiple perspectives.
(4) The economic perspective contains empirical validation efforts (which focus on adequate criterion selection) as well as the attribution of a value to decisions. Validity of these efforts at placement and classification are severely limited by the use of common-scales or common criterion for all examinees. Extrapolation or generalizability from one construct to a broader range or to other constructs is a matter of determining the fit or relevance to new examinees and new situations.
(5) The explanatory perspective looks into alternative interpretations or explanations for test results. This process can be thought of as validation through challenge. Contextualism is a good route to follow. By offering a generalization and locating the conceivable boundaries in which it holds true, the beginnings of construct validation can be set into motion. Closure, however, should not be expected.