Wesche, M. B. (1987). Second language performance testing: The Ontario Test of ESL as an example. Language Testing, 4, 28-47.
Wesche discusses the possibilities that performance testing offers to the language assessment field, and references the OTESL as an example of language performance test development. Performance-based language assessment involves the demonstration of L2 proficiency by manipulation of tasks with content and context representative of situations in which learners will likely use the language. Performance assessment has found widespread use in direct (approximating reality) testing of specifiable job skills, the primary rationale being the predictive validity involved. Performance tests may be direct, work sample, or simulations. For language testing, PTs have been administered for certification and achievement testing in areas where examinees must demonstrate the ability to function effectively in a given context. PTs require a restricted set of situations and a homogenous set of examinee needs within those situations. The extension of PTs to general language testing follows from two sources of impetus: (1) that indirect (multiple-choice, for example) tests fail to make good predictions about real-life language use; and (2) that performance testing has a positive washback effect on teaching curricula.
Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan, 70, 703-713.
Wiggins holds that "a true test of intellectual ability requires the performance of exemplary tasks" (p. 703). Tests should therefore be composed of real-world challenges and standards, and they should be aligned with instruction of individual students in school contexts. An authentic test should check a student's abilities against essential criteria. Since tests inevitably have an effect on what teachers teach and thereby set the instructional standards of a school or district, tests must provide students with intellectually worthwhile activities, and teachers must be involved in test development (since they are the most intimately aware of student needs). This kind of non-standardized testing presents numerous difficulties in terms of reliability, cost, efficiency, etc. Nonetheless, traditional forms of top-down, norm-referenced testing do not offer the depth of information needed about actual student abilities, and instruction related to such testing does not prove intellectually stimulating. Tests are needed that: have educational value, provide useful information about actual student abilities with real-world tasks, show strengths and weaknesses of students when performing such tasks, and communicate and justify themselves to interested parties.
Wiggins, G. (1991). A response to Cizek. Phi Delta Kappan, 72, 700-703.
Wiggins recaps the rationale behind implementation of performance assessment in the classroom. Primarily, performance assessment should be implemented in order to test certain qualities that cannot be measured by objective, standardized tests. Performance assessment has enjoyed a lengthy history of effective decision-making in the professional domains as well as in adult educational contexts, and it has been shown to have positive effects on the alignment of instruction with real-world endeavors. As standardized, objective testing cannot "exemplify, measure, and evoke high-quality performance" (p. 702), direct assessment seems to be the obvious alternative. Direct assessment of performance on desired tasks is the only means of assessment that can provide the necessary kind of feedback for quality decision-making in the classroom as well as on a larger scale. Assessments themselves should be judged based on their performance in having "a measurable effect on teaching and learning" (p. 703).
Wolf, D. (1989). Portfolio assessment: Sampling student work. Educational Leadership, 46, 35-39.
Wolf suggests that evaluation in the schools should be more directed toward the real-world kind of self-observation and informed critique that professionals (citing examples of artists and musicians) engage in. She offers alternatives in response to the problems that she sees inherent in typical once-over, one-time testing (assessment from without, focus only on test performance but not on full range of knowledge/intuition, first-draft nature, achievement over development). We need to be evaluating learning in such a way that we encourage real-world critique and reflection, but simultaneously gather necessary information about student growth (for their purposes as well as for system ends). Portfolios like those used by professionals (artists, writers, inventors, etc.) offer an effective means of encouraging such evaluative processes.
The OTESL Post-Admission Test takes as its objectives the determinance of ESL speaker readiness for participation in field-specific academic studies, and the provision of diagnostic information with respect to examinee skill area needs. The test consists of general reading, a discipline-related reading and listening module (science and technology versus social sciences), and a semi-directed interview. The test creators hoped to provide useful diagnostic information and introduce a washback effect.
The development of the OTESL involved seeking answers to the following issues:
(1) Identification of homogeneous need groups and corresponding procedures and details (also, what to do with outliers). This led to the science/non-science distinction.
(2) Determining the discourse types, subject matter, and authentic tasks to sample, and the degree to which they should be emphasized.
(3) Creation of scoring criteria reflective of authentic, academic scoring criteria, consequences, and judgments.
(4) The implementation of individualized administration and scoring procedures (especially for productive skills testing). This included the training of scorers for all phases of testing, and efforts at reliability maintenance (oral proficiency rating workshops).
(5) Generic determinance of reliability and validity (primarily achieved through inferential statistics and correlation with the TOEFL).
Wesche concludes that where washback effect, predictive validity, motivation for examinees and participants, and specific, context-related diagnostic feedback is desirable, performance assessment along the lines of the OTESL is worth the effort.
The article includes proficiency band descriptors and report forms.
In designing authentic tests, performances must be chosen which reflect what students need to be able to know and do. Evidence of knowledge or ability has to be defined in terms of the given performance tasks, and multiple and varied testing must take place in order to reveal patterns of student performance abilities. Tests must be based on "actual challenges, standards, and habits needed for success in the academic disciplines or in the workplace" (p. 706). Sophisticated performance criteria must be delineated in order to encourage associated abilities. Wiggins offers several examples of tests with such performance criteria (including the ACTFL proficiency guidelines).
In order to ensure quality and legitimate information, assessment must be in depth. Often, the reasons behind an examinee's answer to a test question are not revealed unless further questions are asked. This kind of exploration of examinee choices provides valuable diagnostic information. If teachers are the ones doing the assessing, then they are also better able to associate examinee responses with individual abilities. Of course, in order to ensure reliability in the application of such assessment activities, criteria must be agreed upon, shared, and made public. Norm-referenced testing, however, cannot provide similar kinds of valuable information, and it may be overly subject to variables that are not accounted for (and which cannot be accounted for in one-shot testing that does not follow up on the reasons behind examinee choices).
Authentic tests should therefore adhere to the following criteria. They should involve multiple judgmental criteria that are agreed-upon and public, and that are judged by trained judges. They should have some collaborative elements. They should be contextualized and complex. They should measure real-world task essentials, and standards must be authentic and obvious to students. They should invite and encourage student or examinee input and feedback.
The PROPEL project employs portfolios to sample a range of student works and to encourage reflection and a critical approach to individual and group work. The resulting portfolio-based assessment has the following benefits: increased integrity and validity of information about learners; increase in student responsibility for the learning process; an enlarged view of what learning is and of what should be learned; the opportunity to examine the process of learning; a means of tracking development; a long-term account of what and how students learn. Teachers not only get a better picture of student growth and development, but they also receive richer feedback on their own professional skills and development. This encourages their own critical self-reflection.