McClelland, C. (1991). Portfolios: Solution to a problem. In Belanoff, P., & Dickson, M. (Eds.). Portfolios: Process and product (pp. 165-173). Portsmouth, NH: Boynton/Cook Publishers.
McClelland discusses the shift in classroom focus that results from implementation of a portfolio approach to evaluation versus a system of continuous graded response to student writing. Whereas grading forces a teacher to authoritatively impose the what and how of progress on students, portfolios that are not graded until the end of a term allow students and teacher to work together on the evolution and development of student writing. Work can be kept track of through letters, conferencing, group response days, and first draft days. Students begin to feel ownership over their own writing, and the teacher becomes something like an expert reader and resource. Criteria for grading work completed throughout the semester can actually evolve from the work itself, and students are able to self-assess according to criteria that they had a hand in developing. Through self-evaluation they can reflect on their body of work as a whole, and they can indicate how they think individual pieces fit into their development as a writer. Discrepancies between teacher assessment and student self-assessment of the portfolio product can generally be mediated through examination of other sources of information (like attendance, drafts, homework, conferences, participation).
McNamara, T. F. (1990). Item response theory and the validation of an ESP test for health professionals. Language Testing, 7(1), 52-75.
McNamara employs IRT analysis to investigate the validity of the Occupational English Test for health professionals in Australia. The OET is taken by immigrating health professionals, and is used to assess the workplace communicative ability of candidates. It consists of listening, reading, speaking, and writing sections. The speaking and writing sections consist of profession-specific performance tasks. In order to determine the validity of the rating categories, the rating process, and the relationship between the categories and the constructs purported to be measured, McNamara applies Rasch IRT analysis to the speaking and writing tests.
McNamara, T. F. (1995). Modeling performance: Opening Pandora's box. Applied Linguistics, 16(2), 159-179.
McNamara compares several language competence models and reflects on their applicability to language performance testing. He suggests that the fact that language performance tests have existed for several decades longer than language competence models has led to a stress on operationalization of tests without well-developed theoretical substance. McNamara argues, along the lines of Hymes (1972), that language performance capacities should be well-characterized, as opposed to simply accounting for everything that a person can do with a language in terms of a model of linguistic knowledge.
McNamara, M. J., & Deane, D. (1995). Self-assessment activities: Toward autonomy in language learning. TESOL Journal, 5(1). 17-21.
McNamara and Deane offer three self-assessment activities which, combined, encourage autonomy on the part of university ESL students in approaching language learning. The procedures also enable teachers to individualize classroom learning and acquire a more complete picture of students' views on their own learning. The three activities are:
Mehrens, W. A. (1992). Using performance assessment for accountability purposes. Educational Measurement: Issues and Practice, 11(1), 3-9, 20.
Mehrens critiques the possibilities for extended use of performance assessment in terms of program accountability. He suggests that performance assessment simply represents tests that simulate a criterion situation to a greater degree than other tests, and argues that there is no absolute distinction between PTs and other tests (as all tests consist to some extent of representative performance). Another characteristic of PTs is the "heavy reliance on observation and professional judgment in the evaluation of the response" (p. 3).
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), 5-11.
According to Messick, validity is "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment" (p. 5). Validity bears on qualitative as well as quantitative evaluations, and includes all means (experimental, philosophical, statistical, etc.) by which theory-based hypotheses are interpreted and evaluated. Validation of test inferences (validation cannot be conducted on testing instruments or procedures in isolation) must be based on the theory and data involved. The focus of validity is on: meaning, relevance, and utility of scores; the value implications of using scores for actions; the functional worth of scores; and the consequences of their use.
Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.
In this article, Messick considers the particular demands made by performance assessments on the test validation procedure. He suggests that the enthusiasm surrounding the perceived positive educational consequences of performance testing could result in less than adequate consideration of the evidential basis for validity claims (including the claims about positive consequences). Performance assessments are claimed to be authentic and direct by nature, but these terms have little grounding in terms of demonstrated validity, and they imply that other measurement procedures are not authentic as well as indirect. The question must be asked (and validation seeks to answer) to what criterion and for what purpose a performance assessment is authentic. Two major threats to validity exist for performance assessments, as well as for any evaluation procedure: construct underrepresentation and construct-irrelevant variance.
Miller, M. D., & Legg, S. M. (1993). Alternative assessment in a high-stakes environment. Educational Measurement: Issues and Practice, 12, 9-15.
Miller and Legg consider the use of alternative assessment, as opposed to traditional standardized assessment, under high-stakes testing conditions. Performance-based testing should compensate for the negative impact on teaching practice and the limited content covered by teaching to standardized tests. Performance testing has been effectively used in diagnostic situations and as a part of multi-faceted approaches to information gathering, but the extrapolation to high-stakes environments should be evaluated (i.e., tests resulting in impact decisions on students, comparisons of schools and districts, impact decisions on teachers, etc.). Evaluation requires two approaches: evidence of psychometric properties being assessed, and consequences of the use of alternative assessments in high-stakes situations.
Mishler, E. G. (1990). Validation in inquiry-guided research: The role of exemplars in narrative studies. Harvard Educational Review, 60(4), 415-442.
Mishler addresses the irrelevancy for inquiry-guided research of standard approaches to validation. He suggests that validation needs to be viewed as the social construction of knowledge, with the central concept in validation being that of interpretation. Abstract and logical laws of scientific inquiry and process have little to do with the inexact day-to-day functioning of investigation. Thus, validity cannot be determined by the application of a formalized assessment paradigm; it is dependent on judgment and interpretation of those involved in the process. Validation must be considered not as a rule-governed process, rather as an inductive account of all available information.
Mohan, B., & Low, M. (1995). Collaborative teacher assessment of ESL writers: Conceptual and practical issues. TESOL Journal, 5(1), 28-31.
Mohan and Low examine the process of collaborative teacher assessment in rating ESL student compositions, referring to the experiences of one group of teacher raters at the university level. The task of developing an integrative evaluation format for academic contexts can benefit from the use of collaborative assessment, or the process of different teachers meeting and discussing and jointly rating essays in a consistent and agreed upon fashion. Due to the integrated nature of many university ESL courses, language and content are separately evaluated only with great difficulty. Through the process of collectively defining the criteria for effective use of the language to express content, and then applying these criteria as a group, shared meaning of a kind of standard are established. Repeated practice and discussion is essential to the success of such collaborative assessment. By repeatedly deliberating and trying to agree on a rating, and by resolving the natural tensions that develop due to different rater backgrounds, a viable shared meaning can be reached. Such collaborative assessment should be further researched, with the inclusion of second language researchers and discourse analysts in the collaborative effort.
Moss, P. A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62(3), 229-258.
Moss initiates a review of validity issues particularly relevant to performance assessment (PA) with a comment on the "potential power of carefully designed performance assessments to document and encourage critical, creative, and self-reflective thought" (pp. 229-230). Generally, PA permits students latitude in approach to the task at hand, results in fewer responses that are based on an integration of skills and knowledge, and requires the use of expert judges for evaluation.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher, 23(2), 5-12.
Moss observes the traditional, psychometric property of reliability in assessment from a hermeneutic perspective. Traditionally, reliability is assumed to be a necessary, but not a sufficient, condition for valid testing practice (reliability being essentially the extent to which an assessment is free of measurement error effect, i.e., the consistency of evaluations of examinee performance). Recent alternative forms of assessment (e.g., performance), which encourage individuality in response, integrate skills and knowledge, and rely on expert judgment for evaluation, achieve acceptable levels of traditional reliability only with great difficulty. Nonetheless, alternative forms of assessment are needed for their positive curricular effects and to avoid the drawbacks of standardized assessment (too much teaching to the test). Moss applies hermeneutics to the concept of reliability in order to determine its necessity in valid forms of alternative assessment.
Moss, P. A., Beck, J. S., Ebbs, C., Matson, B., Muchmore, J., Steele, D., Taylor, C., & Herter, R. (1992). Portfolios, accountability and an interpretive approach to validity. Educational Measurement: Issues and Practice, 11(3), 12-21.
Moss et al offer a rationale for the use of portfolios to communicate with audiences outside the classroom about student learning. Because assessment often directs what teachers teach and students learn, assessment instruments tend to be imbued with the power of determining a curriculum. Although large-scale performance assessment is complementing or replacing standardized multiple-choice assessment, dependence on this one form of alternative assessment could continue to limit learning opportunities. One problem stems from the fact that large-scale assessment approaches impose views of an outside authority on the classroom learning process; students and teachers become disenfranchised from their own teaching and learning. A portfolio assessment model can offer not only the chance for teachers and students to collaborate, reflect, and make choices based on their own goals and interests, but it can also enable the sharing of development and achievement with "stakeholders outside the classroom" (p. 13). In addition, the information provided is richer, and it reflects the process of thinking and revising (in writing, for example). Assessment should be placed in the hands of individual teachers who interpret and reflect on student work which documents their growth over time. Teachers have to be responsible for such efforts, as they are most privy to intimate knowledge of the learning situations involved.
Problems can arise from students who have grown accustomed to grade-driven instruction and who demand grades as a means of gauging their own progress. Although the teacher should explain exactly why there is not a focus on grades, and a focus instead on improvement in writing, one possibility for grade reporting during the semester could involve a mid-semester review (but with the stipulation that students themselves must also complete an extensive mid-semester self-evaluation). Procrastination can also be aggravated by a portfolio approach that does not assign grades until the end of the semester.
The benefits are well worth any potential drawbacks, however. A portfolio approach requires focus on the creation of texts, not on the assignment of grades. Students have time to exert ownership over their work; they can receive extensive feedback from multiple sources; quality of work improves. Portfolio evaluation allows students the time and opportunity to learn something, not just to get a grade on something.
The 15 minute role-play speaking test consisted of raters assigning ratings on a six point scale for the following categories: overall communicative effectiveness, intelligibility, fluency, comprehension, appropriateness, and resources of grammar. Rasch analysis provided information on person ability, item difficulty, and difficulty between the different categories. Ratings and person abilities were found to co-vary to a highly reliable degree, and category consistency (over two administrations) for the speaking test was found to be quite high.
The forty minute writing task was similarly rated for the following categories: overall task fulfillment, appropriateness of language, comprehension of stimulus, control of grammar and cohesion, and control of spelling and punctuation. Again, person by item fit and category consistency were determined to be quite high.
By examining fit statistics and confirming trends with step-wise regression, the speaking test was found to have what looked like one dependent variable (the holistic category) and one predictor variable (resources of grammar), the last accounting for around 70% of variance in ratings. This result matches claims in the literature that grammatical or structural accuracy have an overbearing effect on ratings of oral production, especially considering that raters for the OET were trained to focus on communicative effectiveness at the expense of accuracy.
Similarly, the writing test was found to have a holistic or dependent variable (overall task fulfillment) and a predictor variable (grammar).
The OET findings show what raters are in fact basing their ratings on, as opposed to what rating criteria suggest they are judging. The high reliability of the tests suggests that raters are agreeing on something that is not necessarily made explicit in the rating criteria. A posteriori analysis (as with Rasch analysis) can reveal such disproportions and lead to test design improvements, rater retraining, or even to the rethinking of the validity of language categories and criteria from a theoretical perspective.
Problems for defining and measuring language performance include: (1) difficulty in generalizing from one observed instance of language behavior to other instances (how to draw inferences); (2) spelling out the actual criteria for assessment (importance attributed to different performance aspects); (3) the role of non-linguistic factors must be accounted for (performance in a L2 will likely reflect some aspects of L1 speaking ability), and native speaker performance should not be assumed to be representative of the top measure of ability; (4) need for theoretical background to inform research and implementation.
Comparisons of existing models (Hymes, 1972; Chomsky, 1965; Canale & Swain, 1980; Canale, 1983; Bachman, 1990) leads to the postulation of strategic competence as some kind of language-independent variable in language performance which can distinguish between native speakers as well as non-native speakers. Generally, the models build towards accommodation of ability for use with knowledge of a language, but also include crucial general cognitive abilities.
Future modeling of performance in a language should include linguistic and non-linguistic, as well as cognitive and non-cognitive factors. Interaction of numerous applicable variables during the assessment situation must also be accounted for (i.e., what is actually involved in a language performance test?).
(1) Writing letters, both at the beginning and at the end of the semester, in which the students assess their strengths and weaknesses with English. This transfers the language learning responsibility into the hands of the student, but it also can serve as a means of diagnosing student levels of proficiency with the language, perceptions of proficiency, and preconceived ideas about the language learning process.
(2) Daily language logs give students the opportunity to record all class-external experiences with English, including when and where something happened, what was involved (linguistically), student speculation about and analysis of the cause of various occurrences with the language. The log documents to what extent students are transferring knowledge from the classroom to real world experiences.
(3) An English portfolio enables students to reflect on a semester of English learning and to present their interpretations of particular learning events that were important to them. Students include evidence of key events through direct (samples of writing, videos, or audio tapes) or indirect (descriptions of language events) representations of their English use. These portfolios show not only the range of abilities with English, but also demonstrate progress in the language.
McNamara and Deane conclude: "Using these complementary assessment tools -- traditional measures and student self-assessment information -- we have a more complete picture of our students' ability, effort, and progress. More importantly, students have a greater voice in their language learning process" (p. 21).
Mehrens considers the following factors that are purported to support performance assessment:
(1) Traditional criticisms of multiple-choice tests -- bias in testing (he suggests that charges of bias against well-constructed multiple-choice tests are misguided); irrelevant content, lack of inclusiveness of test content and correlation with curricular goals (he suggests that standardized multiple-choice achievement tests are effective and representative); multiple-choice tests only measure ability to recognize and cannot measure higher order thinking skills (again, Mehrens defends multiple-choice testing with evidence that it can test higher order thinking skills).
(2) Influence of cognitive psychologists -- development of theories of performance, and research into the extent to which procedural knowledge is measurable by multiple-choice versus performance tests.
(3) Delimited domain coverage in multiple-choice tests -- but can performance tests do any better? (Mehrens suggests that they might be able to measure in more depth, but most likely with even more narrow domain coverage).
(4) Lake Wobegon Effects, or teaching too closely to the test -- but Mehrens points out that due to domain specificity and sampling problems, performance assessment will not overcome this problem either.
(5) Deleterious instruction due to multiple-choice formats -- excessive focus on mastering only those abilities needed to score highly on achievement tests (but the assumption that PTs will compensate for this could be misguided).
Mehrens further offers the following definite accountability problems when using performance assessments:
(1) Fewer questions leads to the lack of exam content security and to the need to develop new questions each year (with accompanying higher costs).
(2) Implementation of every-pupil testing as opposed to matrix sampling might be prohibitively expensive.
(3) Development, administration, and scoring of large scale performance assessments are all more expensive than traditional multiple-choice testing.
(4) Public acceptance will surely go down with increased cost.
(5) Legal defensibility of PTs could be a problem, especially if documentation is limited.
(6) Professional credibility will be determined through interactions between teachers, teacher educators, and psychometricians, who all have disparate concepts of what is a credible test.
(7) Validity of the PT is threatened, considering limited domain sampling and the lack of generalizability of a given performance; it could also be problematic to render precise inferences from lower level examinees with poor scores on a PT (i.e., what was responsible for the low score?).
(8) Reliability of judgments could be threatened in PTs due to the limited number of observations, subjectivity of the scoring process, and limited internal consistency and generalizability.
Mehrens advocates a cautious approach to acceptance of PTs for accountability purposes, especially giving consideration to: (a) who does the scoring (those without a vested interest), (b) what extent the domain or construct is well-defined, (c) ability scaling of performance data, (d) the unit of reporting, (e) equating of performance assessment data, (f) problems of bias, and (g) ethnic differences.
Validity evidence can be sought or "configured" along several lines (including: relevance and representativeness of content and domain, response relationships among tasks and items, relationship of scores to examinee variables, the processes underlying tasks and items, uniformity and difference across time and populations, variation as a function of instruction or experimental manipulation, intended and unintended outcomes and side effects). Messick suggests that the traditional approaches of content and criterion validity are insufficient. Content validity only addresses the representativeness of tasks to domain expert judgments, and does not supply evidence for valuing inferences. Criterion validity only refers to the correlation between a set of criteria and the scores on a test that is purported to tap those criteria. As measurements, however, criterion correlations are subject to the same sources of irrelevant variance as any measurement. They are always subject to the validity of the standards.
Messick suggests that the only method for approaching validity is to consider any and all evidence that bears on score interpretation and meaning. Threats to construct validity can be constituted under the rubrics of construct underrepresentation and construct-irrelevant variance. Construct validation offers a means for appraising all types of evidence and the relationships and distortions that proceed from them. This process stems from the consideration of trait characteristics and connotations as well as value connotations associated with score inferencing.
Test interpretation as well as test use can be considered from the perspective of a unified concept of validity, which incorporates content, criterion, and consequences into the framework of construct validation. Messick incorporates these aspects of a unified validity framework into a progressive validation matrix which also includes meaning and value implications of scores and tests. Thus, the trustworthiness, meaningfulness, usefulness, and appropriateness of test scores are accounted for in terms of a unified theory of construct validity.
Performance and product can be both target and vehicle for assessment. Performances should be assessed if procedures have been explicitly taught and deviation from accepted process can be detected, whereas products should be the focus when no commonly accepted (standardized) task procedures exist (there are diverse means of arriving at an end) or have not been explicitly taught. Performance or product can be considered as the target of assessment when they are not held to represent some other ability or competency, when the one time activity is that which is important to the assessment (e.g., in an art show). When inferences are made from the performance or product and applied to another domain, then they become vehicles of assessment. Replicability and generalizability determine the extent to which a performance or product is tied to a construct of interest.
Performance assessment encounters problems of content coverage and construct generalizability: do scores represent processes of approaching tasks; are knowledge and skill required to address the assessment task; how broad is the domain that is being tapped; does the test feasibly cover all processes, skills, and knowledge associated with a task, or is the performance taken as representative of an extended range of tasks? Breadth and depth of coverage by individual tests can be increased through use of multiple and varied tests, although this must be counterbalanced with available time.
Response form and stimulus form can also be altered to affect domain coverage and generalizability. More open-ended versus more structured stimuli (item prompts, task descriptions, etc.) and response forms (checklists, multiple-choice questions, essays, etc.) can be combined to varying degrees to reach acceptable levels of content coverage and generalizability. Transparency and meaningfulness of test exercises should be kept in mind in test construction (i.e., tests should be meaningful educational experiences that motivate and direct learning).
Focus in performance assessment construction can be either on tasks or on constructs. Construct-based construction begins with constructs of interest to societies or institutions and narrows down to task selection based on performance attributes of the construct, scoring necessities, etc. Task-centered construction begins with the determination of desired performances, and scoring criteria and performance rubrics are part and parcel of the performance itself. Issues of competence related to task-based construction can only be derived from the use of multiple and varied tasks. Generalizability to constructs can be severely limited. Construct-centered rubrics and criteria can be too generic as to be unhelpful. A middle ground of scoring rubrics associated to a class of tasks that reflect a particular construct is a possible means of compensating for these deficiencies. Messick draws from Fitzpatrick and Morrison (1971) to suggest: "What is important to simulate are the critical aspects of the criterion situation that elicit those performances from which the focal constructs of knowledge and skill are inferred, at a sufficient level of fidelity to detect relevant differences and changes in the focal performance variables" (p. 17).
Because skills do not necessarily assume the same role in all contexts, either consistencies across tasks and within task solutions must be defined, competence with a particular task must be measured in the context of its practice, or many tasks in varied contexts must be sampled in order to achieve any degree of generalizability to broader domains. Contextualization of tasks can motivate some examinees, but it may also alienate others. Consequences of task contextuality should be investigated. Aggregation of responses to different tasks across different contexts, but measuring the same construct, could enable task and context matching to student background. Of course, contextualization is also a function of test purpose as well as student development.
Complex, integrated skills should be assessed in addition to component skills. Individual processes as well as complex performances are implied in authentic situations, hence both should be measured, instead of one forsaking the other.
Direct assessment is inappropriate for performance assessments, as scores are always mediated by some kind of outside judgment. Any kind of measurement is indirect. If direct assessment is ameliorated to indicate open-ended tasks and judgmental scoring, then the primary validity concern becomes one of eliminating construct-irrelevant variance. "In sum, then, a claim that a particular performance assessment is authentic and direct is tantamount to a claim of construct validity and needs to be supported by empirical evidence of construct validity" (p. 21). However, consequences and cost and efficiency must also be considered in indicating comprehensive construct validity.
Messick closes with an admonition that both positive and negative aspects of any kind of test should be addressed in the validation process; otherwise validity is a meaningless social process.
Psychometric properties: Validity of measuring higher-order thinking skills (hypothesis formulation, search for alternatives, ability to judge, self-monitoring) faces several challenges. Method and content effect on thinking processes is high, therefore generalizability of alternative assessments is limited, or context bound. Task specificity and scoring criteria must be well-defined, but definition delimits representation. Multiple exercises might compensate for this delimitation. Everyone should understand the scoring criteria, but the criteria must be such that they cannot be compromised by rote learning. That is, the relationship between instruction and assessment must be established. Reliability hinges on rater consistency (standard scoring procedures) and task-specific variance (reduced by multiple tasks). Equating and scaling of various testing formats and tasks must be considered. Fairness of alternative assessments must be established by comparing task relevancies for different populations and through judgmental reviews for bias and offensiveness. True Costs should be established by comparing various forms of assessment and cost of implementation of each, including time, effort, reusability of items, etc.
Consequences: In high-stakes environments, Teaching to the test is unavoidable. Teachers are pressured by media, administrations, etc. to raise scores; they teach test-based content at the expense of non-test content, and this weakens the generalizability from domain to domain. Alternative assessment would offer a holistic approach to developing broader pictures of students, would allow for individualization of test, and should encourage method and content variance. Test preparation and testwiseness would therefore focus on learning the underlying skills and concepts, as alternative assessment techniques should not allow for rote learning of method and content. Test security will be essential, especially due to the more unwieldy nature of many performance assessment techniques, as well as the difficulty in reproduction of limited items. Safety, ethics, and legality of alternative techniques must be established prior to implementation (consider specifically discrepancies between populations). Notice of test requirements must be publicized.
Mishler proposes a move away from the predominance of traditional technical criteria (reliability, falsifiability, objectivity) towards an assessment of the process of claiming that an investigation is trustworthy. "[F]ocusing on trustworthiness rather than truth displaces validation from its traditional location in a presumably objective, nonreactive, and neutral reality, and moves it to the social world -- a world constructed in and through our discourse and actions, through praxis" (p. 420). He suggests that in lieu of the imposition of rules and criteria that are embedded in traditional assumptions and practices, evaluation for claims of trustworthiness should focus on the analysis of relevant "exemplars." Research knowledge, trustworthiness, etc. is gained through practice, not through the memorization of a set of abstract rules of scientific procedures. The analysis of exemplars of how scientific communities build consensus with respect to problem-solving can reveal criteria and procedures for validation of other processes.
In order to choose exemplars, Mishler advises: "Legitimacy cannot be legislated in advance. Neither abstract rules nor appeal to an idealized version of the scientific method will suffice. Rather, the defining features of exemplars are inferred from the actual practices of working scientists" (p. 423). He provides one example of the use of normal, or traditional, scientific methods for validation of the study of narrative, and then contrasts this approach with three alternative approaches, employing exemplars. The three inductive approaches are characterized by the provision of enough primary source information, about the object of study (narratives, in this case) and the rationale for interpretation, that outside observers could judge for themselves whether or not claims are warranted, adequate, visible, and therefore trustworthy. Direct linkages between data, findings, and interpretation must be given under these circumstances. Displaying the full text is essential.
Mishler equates this approach, using exemplars for validation, with the argument in recent validation theory that interpretation and theory are of primary import for establishing construct validity. He suggests that one vital task for validation research within various communities of investigation is the establishment of sets of relevant exemplars.
Her summary of work by Anastasi (1986, 1990), Cronbach (1980, 1988, 1989), and Messick (1975, 1980, 1989a, 1989b) leads to the following consensus on validity inquiry that should inform validation of PAs. Construct validity emerged from the positivistic tradition of scientific observation that assumed the objectivity of observation. In order for a construct to be considered valid, it had to be located in a frame of reference that explicitly laid out the relationships between theory and observation. The best way to determine validity was by searching for convergent evidence (showing a construct measurement to be related to other, similar variables through observations) and discriminant evidence (showing a construct measurement not to be related to other, distinct variables). The search for rival explanatory hypotheses for a construct constituted early construct validation. Construct validity should explain what a test score means. Whether or not observations of a measurement or a construct are bound by a particular theory or approach (that of the logical positivistic tradition, for instance), however, is subject to debate. The role of community and observations of the consequences of assessment suggest the relativity of a measurement (a crucial aspect to more recent approaches to construct validation). Thus, contrasting multiple perspectives and value systems with respect to a measurement system, as well as looking into all stakeholders and investigating their judgments, is essential. Test use and score interpretation should also be kept within a set of contextual boundaries that are determined by content specificity (with respect to different cultures, for example) and extent of generalizability across time and place.
Extrapolation from general validation arguments to performance assessments has produced some useful components for investigation [cf. Frederiksen and Collins (1989); Haertel (1990, 1991); and Linn, Baker, and Dunbar (1991)]. All are grounded in arguments about the overall beneficial consequences of performance assessment, that washback from PAs will be superior to that from traditional multiple-choice assessment. Also, PA is "likely to produce more valid interpretation of certain complex educational domains" (p. 249). However, the PA genre will suffer in terms of traditional comparisons, (efficiency, comparability, and internal consistency). Commonalties in validation of PAs come in the form of: judgmental evidence of content and scope, scoring reliability, response complexity, and consequence (bias, cost, efficiency, and fairness). These authors diverge with respect to generalizability. Focus could be on quality of performance and scoring fairness, or towards interpretation with respect to representativeness of tasks for a particular purpose and particular individuals.
Generally, these validity approaches are grounded in "a social science epistemology that emphasizes theory-based hypothesis testing, attention to behavioral consistencies, and control of factors irrelevant to the intended inference" (p. 250). This leads to centralization of assessment authority, standardization, and use of so-called experts. However, "differential access to power and authority over educational interpretations and decisions" (p. 251) could be promoted by the use of many of these standard technical validity criteria. Problems arise, for example, when students take different approaches to achieving excellence -- i.e., not necessarily a linear progression from novice to expert. Perhaps greater validity could be achieved by use of examinees' own world views for definition and explication of performance. Objective observation is, in one sense, entirely impossible due to the imposition of world view. Use of a variety of epistemological perspectives would enhance understanding of effects and performance. Student or examinee profiling, observation across domains and time, varied units of analysis, and varied sources of evidence lead to enhanced interpretive validity.
Hermeneutic theory interprets meaning from the perspective of the particular text (or construct) and includes the knowledge and preconceptions that readers of the text bring with them. The dialectical relationship between text and reader determines meaning. In opposition to traditional psychometric approaches to assessment validation (norm-oriented and criterion-referenced), hermeneutics offers validity along different lines: "Here, the interpretation might be warranted by criteria like a reader's extensive knowledge of the learning context; multiple and varied sources of evidence; an ethic of disciplined, collaborative inquiry that encourages challenges and revisions to initial interpretations; and the transparency of the trail of evidence leading to the interpretations, which allows users to evaluate the conclusions for themselves" (p. 7). Moss presents several examples of assessment approaches that fall under the rubric of hermeneutics (e.g., various portfolio assessment activities).
In terms of concerns with the generalizability of such assessment approaches, the author offers several solutions from a hermeneutic perspective. Documentation of the process used to make decisions about an examinee's performance can account for variation across tasks, and can enable others to generalize from the decision. Latitude in assessment task choice by the examinee (with portfolios, for example) can allow for better representativeness (and generalizability) of the examinee's perceived abilities.
In consideration of the lack of objectivity and certainty of knowledge in all forms of assessment, a lack of generalizability across readers (raters or judges) should be seen as an asset. This encourages critique, dialogue, discussion, and debate which values subjective input from various backgrounds. Better understanding of the decision-making process can evolve, drawing from the idea that no knowledge exists without accompanying prejudice.
From a hermeneutic perspective, then, psychometric reliability (and the idea of "detached and impartial assessment") imposes a unitary, authoritarian viewpoint on test-takers. Individuality is squelched, and valuable information about examinee abilities is lost. Hermeneutics tries to counter this by looking beyond "anonymous authority" and the "idolatry of scientific method" to find the voice of those who have the greatest stake in any assessment. Moss concludes that we should not abandon reliability, even though it is not necessarily a condition to valid assessment in certain circumstances. She sees it as one tool among many that can be used to justify decision-making claims at validity, but a tool that should be faced with critical dialogue from other possibly warranted perspectives.
Qualitative research methodology offers an interpretive strategy for making warranted or valid decisions based on classroom centered portfolio assessment. In order to interpret and compare classroom learning for accountability purposes (i.e., for the public), the following methods should be employed: particular descriptions of different learning and growth categories, general descriptions displaying a breadth of evidence, and interpretation that frames the whole picture. The path taken for decision making should be well documented so that outside observers can retrace and critique the process.
The PROPEL project offers an example of this kind of assessment and accompanying accountability process in action. Based on narrative profiles of student portfolios, validation of the process could be undertaken. Outside evaluators experienced low levels of consistency in rating student profiles, however this factor supports increased dependence on teacher interpretations and an audit trail of teachers' decision making processes. Student autonomy in selection of portfolio pieces led to wide discrepancies in representativeness. This issue could be addressed by instituting genre requirements, although this would undermine, to a certain extent, the goals of the project. Efficiency in the assessment process, both for teachers and for "auditors," seemed to be a concern, although the time involved was not as extensive as predicted. Moss et al conclude:
"This kind of contextualized performance assessment is likely to provide an important supplement to the standardized sorts of performance assessment typically used at the system level and to suggest directions for curricular reform" (p. 20).