W


Wesche, M. B. (1987). Second language performance testing: The Ontario Test of ESL as an example. Language Testing, 4, 28-47.

Wesche discusses the possibilities that performance testing offers to the language assessment field, and references the OTESL as an example of language performance test development. Performance-based language assessment involves the demonstration of L2 proficiency by manipulation of tasks with content and context representative of situations in which learners will likely use the language. Performance assessment has found widespread use in direct (approximating reality) testing of specifiable job skills, the primary rationale being the predictive validity involved. Performance tests may be direct, work sample, or simulations. For language testing, PTs have been administered for certification and achievement testing in areas where examinees must demonstrate the ability to function effectively in a given context. PTs require a restricted set of situations and a homogenous set of examinee needs within those situations. The extension of PTs to general language testing follows from two sources of impetus: (1) that indirect (multiple-choice, for example) tests fail to make good predictions about real-life language use; and (2) that performance testing has a positive washback effect on teaching curricula.
The OTESL Post-Admission Test takes as its objectives the determinance of ESL speaker readiness for participation in field-specific academic studies, and the provision of diagnostic information with respect to examinee skill area needs. The test consists of general reading, a discipline-related reading and listening module (science and technology versus social sciences), and a semi-directed interview. The test creators hoped to provide useful diagnostic information and introduce a washback effect.
The development of the OTESL involved seeking answers to the following issues:
(1) Identification of homogeneous need groups and corresponding procedures and details (also, what to do with outliers). This led to the science/non-science distinction.
(2) Determining the discourse types, subject matter, and authentic tasks to sample, and the degree to which they should be emphasized.
(3) Creation of scoring criteria reflective of authentic, academic scoring criteria, consequences, and judgments.
(4) The implementation of individualized administration and scoring procedures (especially for productive skills testing). This included the training of scorers for all phases of testing, and efforts at reliability maintenance (oral proficiency rating workshops).
(5) Generic determinance of reliability and validity (primarily achieved through inferential statistics and correlation with the TOEFL).
Wesche concludes that where washback effect, predictive validity, motivation for examinees and participants, and specific, context-related diagnostic feedback is desirable, performance assessment along the lines of the OTESL is worth the effort.
The article includes proficiency band descriptors and report forms.

Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan, 70, 703-713.

Wiggins holds that "a true test of intellectual ability requires the performance of exemplary tasks" (p. 703). Tests should therefore be composed of real-world challenges and standards, and they should be aligned with instruction of individual students in school contexts. An authentic test should check a student's abilities against essential criteria. Since tests inevitably have an effect on what teachers teach and thereby set the instructional standards of a school or district, tests must provide students with intellectually worthwhile activities, and teachers must be involved in test development (since they are the most intimately aware of student needs). This kind of non-standardized testing presents numerous difficulties in terms of reliability, cost, efficiency, etc. Nonetheless, traditional forms of top-down, norm-referenced testing do not offer the depth of information needed about actual student abilities, and instruction related to such testing does not prove intellectually stimulating. Tests are needed that: have educational value, provide useful information about actual student abilities with real-world tasks, show strengths and weaknesses of students when performing such tasks, and communicate and justify themselves to interested parties.
In designing authentic tests, performances must be chosen which reflect what students need to be able to know and do. Evidence of knowledge or ability has to be defined in terms of the given performance tasks, and multiple and varied testing must take place in order to reveal patterns of student performance abilities. Tests must be based on "actual challenges, standards, and habits needed for success in the academic disciplines or in the workplace" (p. 706). Sophisticated performance criteria must be delineated in order to encourage associated abilities. Wiggins offers several examples of tests with such performance criteria (including the ACTFL proficiency guidelines).
In order to ensure quality and legitimate information, assessment must be in depth. Often, the reasons behind an examinee's answer to a test question are not revealed unless further questions are asked. This kind of exploration of examinee choices provides valuable diagnostic information. If teachers are the ones doing the assessing, then they are also better able to associate examinee responses with individual abilities. Of course, in order to ensure reliability in the application of such assessment activities, criteria must be agreed upon, shared, and made public. Norm-referenced testing, however, cannot provide similar kinds of valuable information, and it may be overly subject to variables that are not accounted for (and which cannot be accounted for in one-shot testing that does not follow up on the reasons behind examinee choices).
Authentic tests should therefore adhere to the following criteria. They should involve multiple judgmental criteria that are agreed-upon and public, and that are judged by trained judges. They should have some collaborative elements. They should be contextualized and complex. They should measure real-world task essentials, and standards must be authentic and obvious to students. They should invite and encourage student or examinee input and feedback.

Wiggins, G. (1991). A response to Cizek. Phi Delta Kappan, 72, 700-703.

Wiggins recaps the rationale behind implementation of performance assessment in the classroom. Primarily, performance assessment should be implemented in order to test certain qualities that cannot be measured by objective, standardized tests. Performance assessment has enjoyed a lengthy history of effective decision-making in the professional domains as well as in adult educational contexts, and it has been shown to have positive effects on the alignment of instruction with real-world endeavors. As standardized, objective testing cannot "exemplify, measure, and evoke high-quality performance" (p. 702), direct assessment seems to be the obvious alternative. Direct assessment of performance on desired tasks is the only means of assessment that can provide the necessary kind of feedback for quality decision-making in the classroom as well as on a larger scale. Assessments themselves should be judged based on their performance in having "a measurable effect on teaching and learning" (p. 703).

Wolf, D. (1989). Portfolio assessment: Sampling student work. Educational Leadership, 46, 35-39.

Wolf suggests that evaluation in the schools should be more directed toward the real-world kind of self-observation and informed critique that professionals (citing examples of artists and musicians) engage in. She offers alternatives in response to the problems that she sees inherent in typical once-over, one-time testing (assessment from without, focus only on test performance but not on full range of knowledge/intuition, first-draft nature, achievement over development). We need to be evaluating learning in such a way that we encourage real-world critique and reflection, but simultaneously gather necessary information about student growth (for their purposes as well as for system ends). Portfolios like those used by professionals (artists, writers, inventors, etc.) offer an effective means of encouraging such evaluative processes.
The PROPEL project employs portfolios to sample a range of student works and to encourage reflection and a critical approach to individual and group work. The resulting portfolio-based assessment has the following benefits: increased integrity and validity of information about learners; increase in student responsibility for the learning process; an enlarged view of what learning is and of what should be learned; the opportunity to examine the process of learning; a means of tracking development; a long-term account of what and how students learn. Teachers not only get a better picture of student growth and development, but they also receive richer feedback on their own professional skills and development. This encourages their own critical self-reflection.

Wolf, D., Bixby, J., Glenn, J. III, & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. Review of Research in Education, 17, 31-74.

Wolf et al suggest that multiple-choice, standardized testing has come to dominate educational practice, but that it is incapable of providing the diagnostic kind of information that is needed for monitoring and promoting student learning. They therefore argue for performance assessments that are multidimensional in that they measure not only a student's ability to manipulate relevant information and skills, but that also do so in the context of real undertakings. Such assessment should measure students' progress longitudinally, and should enable learners to demonstrate their depth of understanding of a particular context and content. It should offer the opportunity to critique and improve work that is being used as an individual measure. In order to further argue for this direction in educational assessment, Wolf et al compare two epistemological views on learning, and they discuss the resulting testing or assessment cultures.
The epistemology of intelligence has dominated educational practice and measurement for the past century. This approach to human ability sees intelligence as a unitary, fixed, and immutable trait. Its measurement has traditionally involved the use of standardized instruments that produce a normal distribution of examinees, but which reveals little about learner achievement with respect to any real-world criteria. This kind of intelligence testing has been shown to be based on statistical errors and to favor examinees with access to social norms and knowledge found in Western, middle-class culture. This epistemology also values certain kinds of knowledge (theory-building, acquisition of concepts, symbolic manipulation) over problem-solving knowledge of a practical or situated variety. It further stresses the individual processing of knowledge over group processes (e.g., debate or collaboration).
The epistemology of intelligence has lead to a culture of testing that is predominated by the idea that educational progress and learning can be reduced to a matter of scientific measurement. This testing culture is defined by the following characteristics: the use of relative ranking (norm-referencing) as the primary means of separating examinees; testing formats that have no analog in real-world, non-test performances; tests that do not evaluate anything beyond the individual and isolated performance; tests that function only as instruments of measurement, without disturbing any other processes involved in education and learning, and which do not need to do anything beyond show the results of the measure (as in the high-tech means of scoring contemporary standardized tests).
A second epistemology is concerned with thoughtfulness that is not immutable, not subject to rank ordering, and not scaleable. This epistemology acknowledges everyday forms of thought and knowledge and views them as equally valid sources of understanding. Learning cannot be rarefied, rather it must occur in fixed contexts (learners have been shown to respond differentially to motivating, concrete, high-stakes situations). Learning occurs in qualitative and uneven ways, and does not follow an even, linear progression. Learning constitutes an "individual's understanding of how to apply what she or he knows" (p. 50) rather than simply how much someone knows. The act of thinking, along these lines, is also a social act. Understanding is enhanced as an act of community involvement and discussion. Such a view indicates that individual's have a range of knowledge and competence that is dependent on social and community factors to some extent.
This epistemology fits most closely with a culture of assessment (as opposed to a culture of testing). An assessment culture focuses on "criterion-referenced evaluations of student learning in which what students can and cannot do is clearly stated" (p. 52). It is concerned with longitudinal assessments of accomplishment and depth of understanding. As such, performance assessments of students accomplishing genuine tasks takes the place of multiple-choice tests of knowledge. Performance assessments allow for a range of possible answers to a task problem, employ real-world functions like defending one's work, exhibiting a performance, and manipulating resources. Assessment along these lines should provide an instance, or repeated instances, of learning. Portfolio-based assessment gives the opportunity to assess longitudinally, to re-evaluate work, and to reflect on the meaning of collections of student work. Students "deserve sustained opportunities to internalize standards and ways of questioning and improving the quality of their work" (p. 59).
Wolf et al conclude with precautions for the development and implementation of these alternatives to standardized testing, and they stress the need for a critical tradition relative to the emerging genre of performance assessment. Efficiency in creation and sampling of performances, equity among a variety of learners being assessed, and evidence that assessments are tapping the genuine performance skills that have been identified are all essential.


X


No entries for "X"


National Foreign Language Resource Center Homepage

Language