文章基本信息

标题：Walk a mile in my shoes: stakeholder accounts of testing experience with a computer-administered test.
作者：Fox, Janna ; Cheng, Liying
期刊名称：TESL Canada Journal
印刷版ISSN：0826-435X
出版年度：2015
期号：June
语种：English
出版社：TESL Canada
摘要：With the global trend toward internationalization of university campuses and the cultural and linguistic diversity of Canadian classrooms (Fox, Cheng, & Zumbo, 2014), language tests have become ever more pervasive and more powerful decision-making tools (Shohamy, 2007). Inferences drawn about test-takers' language abilities based on language test scores result in life-changing decisions, for example, university admission, professional certification, immigration, and citizenship.
关键词：College admissions;College entrance achievement tests;College entrance examinations;Educational research;Linguistics;Universities and colleges

Walk a mile in my shoes: stakeholder accounts of testing experience with a computer-administered test.

Fox, Janna ; Cheng, Liying

If we are to take seriously the argument ... that the test-taker in particular and validation in general should be at the heart of development, then tests simply must be built around the test-taker. (O'Sullivan, 2012, p. 16)

With the global trend toward internationalization of university campuses and the cultural and linguistic diversity of Canadian classrooms (Fox, Cheng, & Zumbo, 2014), language tests have become ever more pervasive and more powerful decision-making tools (Shohamy, 2007). Inferences drawn about test-takers' language abilities based on language test scores result in life-changing decisions, for example, university admission, professional certification, immigration, and citizenship.

Across Canada each year, thousands of students enroll in English language programs and test preparation courses with the hope of improving their language and test-taking strategies in order to pass a high-stakes proficiency test. The present study took place in such a program, at a mid-sized Canadian university that enrolled students at basic, intermediate, and advanced levels during each 12-week term. Such programs have become a ubiquitous feature of the Canadian context (Fox et al., 2014).

Although test-takers are most directly affected by high-stakes proficiency testing, their role as the principal stakeholders in language testing has not always been recognized (Shohamy, 1984). In recent years, however, language testing validation studies have increasingly drawn on test-taker feedback in order to better understand how tests behave, and what they are actually measuring. For example, test performance has been researched in relation to test-taker accounts of

* test-taking strategies (Alderson, 1990; Cohen & Upton, 2007; Phakiti, 2008; Purpura, 1998);

* behaviours and perceptions before, during, and after a test (Doe & Fox, 2011; Fox & Cheng, 2007; Huhta, Kalaja, & Pitkanen-Huhta, 2006; Storey, 1997);

* prior knowledge (Fox, Pychyl, & Zumbo, 1997; Jennings, Fox, Graves, & Shohamy, 1999; Pritchard, 1990; Sasaki, 2000);

* test anxiety (Cassady & Johnson, 2002); and

* motivation (Cheng et al., 2014; Sundre & Kitsantas, 2004).

Studies have also drawn on test-taker accounts in order to examine what they reveal about a test method (Shohamy, 1984) or a task (Elder, Iwashita, & McNamara, 2002; Fulcher, 1996). Others have elicited test-taker responses in consideration of a test itself (Bradshaw, 1990; Powers, Kim, Yu, Weng, & VanWinkle, 2009; Swain, Huang, Barkhoui, Brooks, & Lapkin, 2009).

Multiple-Stakeholder Accounts of Testing Experience

Recently, multiple stakeholder accounts of testing experience have been considered in the testing research literature as part of an ongoing program of test validation (Cheng, Andrews, & Yu, 2011; DeLuca, Cheng, Fox, Doe, & Li, 2013; Fox, 2003). For example, Fox (2003) examined differences in rater and test-taker accounts of a writing task in the context of the development and trial of a new version of a high-stakes test. Differences in these two stakeholder accounts led to a reconsideration of test specifications. Cheng et al. (2011) considered test-taking students' and their parents' perceptions of high-stakes assessment in Hong Kong. Qi (2007) compared students' and test developers' accounts of a writing test. Qi (2007) found differences in their perceptions of what was being measured, suggesting that understandings of a construct and what a test is actually measuring may differ in important ways from those intended by the test developers. Further, there is an ongoing need to accumulate validation evidence from operational tests in order to support the chain or network of inferences (e.g., Kane, 2012; Kane, Crooks, & Cohen, 1999; McNamara & Roever, 2006). This is what Kane et al. (1999) define as the "interpretive argument" (p. 6) of a test, that is, evidence that supports the interpretation or use of test scores. As Kane (2012) notes, validity itself is at its core an evaluation of the coherence and plausibility of evidence supporting a test's interpretive argument.

Messick (1996) pointed out that testing researchers and test developers should pay particular attention to construct underrepresentation and construct irrelevant variance as potential threats to the validity of inferences drawn from tests. In language testing research, however, once a test is operational, further consideration of these potential threats to validity has often been limited to an analysis of scores or outcomes alone (Bachman, 2000). Moss, Girard, and Haniford (2006) argue that validation studies should include stakeholder perspectives in order to expose sources of evidence that would otherwise stand to invalidate test inferences and uses. Bachman and Palmer (1996) also advise testing researchers and developers to explore test usefulness by eliciting feedback on operational versions of tests from key stakeholders (e.g., test-takers, raters, and other groups) who are affected by test decisions.

Over the past 10 years, we have seen an increase in the use of computers in large-scale language testing (see, for example, the TOEFL iBT or the Pearson Test of English [PTE] Academic). It is thus essential for testing researchers and developers to understand how the use of computers affects the test-taking experience and whether computer administration formats change the constructs being measured. Our review of the literature suggests that the role of computer-administered language tests in test performance is underresearched, in spite of their exponential growth. There has been some investigation of the impact of computer-administered testing (e.g., Maulin, 2004; Taylor, Jamieson, Eignor, & Kirsch, 1998), but it has arguably been insufficient. Fulcher (2003) noted the lack of published research on computer-administration interfaces in language testing in his consideration of a systematic interface design process. Since that time, some studies have been published that provide evidence to support the construct validity of various computer-administered tests in comparison with their paper-based counterparts (e.g., Chapelle, Chung, Hegelheimer, Pendar, & Xu, 2010; Choi, Kim, & Boo, 2003; Stricker, 2004), but these studies have tended to take place prior to the implementation of a new computer-administered test.

Computer-Administered Testing

A growing body of research suggests that we understand far too little of the implications of computer administration on the testing experience, of the ways in which a computer-administered format may subtly change a test construct (e.g., Hall, in press; Ockey, 2007), or the impact of the administration medium on test-taker perceptions and attitudes (e.g., Huff & Sireci, 2001; Richman-Hirsch, Olson-Buchanan, & Drasgow, 2000). Further, as Huff and Sireci (2001) note, "If the ability to interact successfully with a computer were necessary to do well on a test, but the test was not designed to measure computer facility, then computer proficiency would affect test performance." They go on to point out that "given that social class differences are associated with computer familiarity, this source of construct irrelevant variance is particularly troubling" (p. 19). Indeed, it is still the case that in a number of countries, students do not have wide accessibility to computers and are not accustomed to preparing their assignments using a computer.

As language testing researchers or expert informants (i.e., doctoral students and professors in language testing/assessment) who took the TOEFL iBT (hereafter, iBT; see DeLuca et al., 2013), we reported that our own experience with computer administration was generally very positive. For example, we recounted the ease with which we could respond to the writing task by typing our responses, and noted this as an improvement over the handwritten responses required of the paper-based tests we had previously written. Further, we reported that computer administration increased the overall sound quality of the listening and speaking sections of the test, and also allowed for the control of pacing in listening. At the same time, we identified "practical issues related to the language testing conditions, question design, and the testing protocol" (DeLuca et al., 2013, p. 663) of the test, which we argued were of potential concern with regard to construct. For example, we noted the high cognitive demands of the test (which we speculated were at times beyond those experienced by undergraduate students at the beginning of their degree programs). The cognitive demands were particularly evident in the reading section of the test (which was also the first section of the test). We speculated that having such a difficult section at the beginning of the test might undermine the confidence of test-takers. Further, we expressed concern about the length of the test and the limited amount of time provided to complete complex tasks (i.e., speededness).

In order to extend and elaborate these findings, the present study elicited responses of former and current iBT test-takers--the target population/stakeholders of the test--and was guided by the following research questions:

1. What characterized the computer-administered testing experience for former (successful) and current (unsuccessful) iBT test-takers?

2. What did probing the testing experiences of these test-takers reveal about construct representation and the interpretive argument of the iBT? How do their accounts compare with those of the language testing researchers reported in 2013?

Method

The present study used an exploratory concurrent mixed methods research design (Creswell, 2015), merging or integrating findings from both qualitative and quantitative research strands. Notices of the study were posted in the university where the study took place in order to recruit former iBT test-takers who had recently passed the test (i.e., within two months) and were enrolled in their degree programs at the time of the study. Two students volunteered for semistructured interviews (see Appendix for the interview questions). At the same time, 375 recent iBT test-takers voluntarily responded to questionnaires circulated in 15 classes of a preuniversity English for Academic Purposes (EAP) program in the same university. None of these current EAP students/test-takers had passed the iBT at the time of the study. All of the questionnaire respondents had been required to upgrade their English proficiency in order to meet the minimum language proficiency requirements for admission to university degree programs.

Once the interview and questionnaire data had been analyzed, results from the two research strands were extended and explained by merging or integrating findings (Creswell, 2015)--a critical step in mixed methods research. In total, 377 test-taker participants contributed to the development of our understanding of what characterized these test-takers' testing experience with the computer-administered iBT, and how their experience differed or confirmed the accounts of the language-testing researchers reported in 2013.

Participants

Former test-takers (n = 2)

Two university students (former, successful iBT test-takers) were interviewed about their testing experience. Pseudonyms are used in reporting their accounts. Li spoke Mandarin as a first language (L1) and English as a second (L2) or additional language. She had taken the TOEFL Paper-Based Test (PBT) in China, but her scores were not high enough to allow her to begin her Canadian university program. When she arrived in Canada, she completed an intensive, three-month iBT test preparation course prior to obtaining the TOEFL iBT test scores required for admission. Juan spoke Spanish (L1), English (L2), and French (L2). He had taken both the PBT and the iBT in the Dominican Republic prior to beginning his university program in Canada. Like Li, he had been unsuccessful on the PBT, but was later successful on the iBT.

Current test-takers (n = 375)

Current iBT test-takers voluntarily completed a questionnaire on their experiences taking the computer-administered test. Many of these participants indicated that they had taken a number of different proficiency tests, for example, the International English Language Testing System (IELTS) and the Canadian Academic English Language (CAEL) Assessment. All indicated that they were planning to take another proficiency test in the near future, but did not indicate which test they planned to take.

Instruments

The TOEFL iBT

Since its introduction in 2005, the iBT has been administered to millions of test-takers around the world. Although it is technically an Internet-based test, the present study focused on the computer-administered format or administration interface of the test (Fulcher, 2003). The iBT tests "academic English" in "reading, listening, speaking, and writing sections" (ETS, 2010, p. 6). It takes approximately 4 1/2 to 5 hours to complete.

Interview questions

Semistructured interviews were conducted with the two former test-takers, who were asked to account for their testing experience (see Appendix for interview questions).

Questionnaire

The test-taker questionnaire used in the study combined items based on test preparation and the role of computers in test administration (DeLuca et al., 2013) and on the posttest questionnaire developed and validated by the Testing Unit at the university where the study took place. This questionnaire was routinely distributed after administration of high-stakes proficiency tests. The test-taker questionnaire was designed to elicit both closed and open-ended responses. Closed items collected information on key grouping variables. Open-ended items were the primary focus of the study. The questionnaire was controlled for length and complexity, given that it was administered across a wide range of language proficiency levels and only a limited amount of time was allowed by EAP teachers for administration of the questionnaire in class. The questionnaire is included in the Results and Discussion section below.

Data Collection and Analysis

As mentioned earlier, notices of the study were posted in the university where the study took place to recruit former iBT test-takers. Two students volunteered for semistructured interviews, which were audio recorded and transcribed for analysis. The questionnaire was circulated in the preuniversity EAP program at the beginning of a new 12-week term and across basic, intermediate, and advanced classes. Participants filled in the questionnaire in their EAP classes. Only participants who had taken the iBT within the previous two-month period were considered in the study. None of these participants had received scores high enough to allow them to enter their university programs at the time of the study. These current test-takers are the target population of the iBT. Further, all of these participants wrote a high-stakes paper-based test (i.e., the Canadian Academic English Language Assessment) under test conditions during the first week of the term. If their CAEL test scores had been high enough, they would have been deemed to have met the language proficiency requirement and admitted to their university programs.

The responses of the two former test-takers and the open-ended questionnaire responses of the current test-takers were analyzed using a modified constructivist grounded theory approach (Charmaz, 2006). Specifically, interviews were recorded and transcribed. Next, the texts were sorted and synthesized through coding, "by attaching labels to segments of data that depict[ed] what each segment is about" (Charmaz, 2006, p. 3). Through this process, the data were distilled by "studying the data, comparing them, and writing memos [to define] ideas that best fit and interpreted] the data as tentative analytic categories" (Charmaz, 2006, p. 3). Subsequently, the categories, which we identified in analysis of the interview data as typified and recurrent features (Pare & Smart, 1994) of the computer-administered testing experience, were compared with categories we identified in the coding analysis of open-ended responses to the questionnaire. In order to assess the reliability of the coding procedure, selected samples of interview and questionnaire responses were subsequently coded by two other researchers, who were familiar with the coding approach used in this study, but had not participated in it. Interrater/coder agreement was considered satisfactory based on Cronbach's alpha ([alpha] = .86).

Quantitative data drawn from the questionnaires were analyzed using descriptive statistics (i.e., frequencies and percentages of response) to identify grouping variables. Open-ended responses were examined in relation to these variables (e.g., type of test preparation in relation to reports of test anxiety). Data from each of the research strands were analyzed and then integrated or merged in reporting the results. Merging the data from the two strands allowed us to extend and explain the findings with greater clarity and depth of interpretation. It should be noted that a distinctive characteristic and essential requirement of mixed methods studies (Creswell, 2015) is the integration of the separate findings from quantitative and qualitative strands. Given the present study's exploratory or naturalistic design, the qualitative findings are dominant, but they are more meaningful and interpretable when they are merged with the quantitative findings. Finally, we compared the language testing researchers' accounts of their iBT test-taking experience with those of the test-takers considered here, in relation to construct definition (Messick, 1996) and evidence supporting the interpretive argument of the test (Kane, 2012).

Results and Discussion

Overview

In exploring what characterized the computer-administered testing experience for both former (successful) and current (unsuccessful) iBT test-takers, we were also interested in any differences in their accounts, particularly those that might be considered construct irrelevant and potentially a threat to the interpretative argument for test use. We begin by presenting an overview of our findings, drawing on both the responses in semistructured interviews of the two former test-takers and the open-ended questionnaire responses of the 375 current test-takers.

Following Pare and Smart (1994), we reduced and synthesized the number of categories, identified as a result of multiple rounds of coding (Charmaz, 2006), into frequent and recurring themes that best characterize the computer-administered testing experience for the participants in the present study. The recurring themes across former and current test-takers are as follows.

* Acknowledgement of the importance of test preparation

* Concerns about speededness

* Positive responses to computer-administered tests of listening and speaking

* Mixed responses to reading subtests

In the section that follows below, we address our first research question in relation to these four recurrent themes:

1. What characterized the computer-administered testing experience for former (successful) and current (unsuccessful) iBT test-takers?

The Importance of Test Preparation

The questionnaire used in the study is presented in Figure 1, along with a summary of the responses (i.e., frequency and percentage, see square brackets) of the current test-takers to the closed items on the questionnaire. As indicated below, 273 (75.4%) of the 375 current test-takers who responded to the questionnaire indicated that they prepare in advance for a high-stakes test; 89 (24.8%) indicated they do not [13 (17%) did not respond].

Figure 1. Overview of current test-taker responses to the
questionnaire.

TEST-TAKER FEEDBACK QUESTIONNAIRE

Directions: Would you be willing to give us some feedback on taking
English language tests like the TOEFL iBT, IELTS, or CAEL? If yes,
please answer the following ques- tions. Whether you answer or not,
fold and drop this form in the box at the front of the room when
you leave. Do not record your name. Thank you for your feedback.

1. Do you prepare in advance for an English language test like the
TOEFL iBT, IELTS, or CAEL? [n = 362 responses]

( ) YES [273, 75.4%] ( ) NO [89, 24.6%]

If YES, how do you prepare for a test? Check all that apply.
[n = 270 responses]

[200, 70.4%] Look at the online practice tests

[75, 27.8%] Use the published Preparation Guide

[79, 29.3%] Take a preparation course

[85, 31.5%] Talk to friends

[10, 3.7%] All of the above

Other: Please explain:--

2. Have you ever taken an English test on computer? (X) YES [375 or
100%] ( ) NO

If YES, check all that apply:

--TOEFL CBT

X TOEFL iBT [N = 375 or 100%]

--Pearson Test of English (PTE) Academic

--Other (Please explain)--

3. Why did you take the test(s) in #2 above?

X To get into university [N = 375 or 100%]

--For my work

--For practice

--Other (Please explain)?--

4. Which method of testing do you prefer? [n = 355 responses]

--pen and paper [302, 85%]

--computer [53, 15%]

Please explain why you prefer this method of testing:--

5. Do you think you would do better on the writing section of a test
if you could use the computer to type your response? [n = 315
responses]

( ) YES [92, 29%] ( ) NO [223, 71%]

Please explain why:--

In total, 270 (72%) explained how they had prepared for the test, with 117 (43%) indicating multiple approaches to test preparation, 200 (70.4%) mentioned accessing online resources, 85 (31.5%) indicated that they had consulted friends, 75 (27.8%) studied a test preparation guide, and 79 (29.3%) reported taking a test preparation course. Ten (3.7%) identified all of the above approaches as test preparation they engaged in prior to taking a high-stakes test.

Of the 117 (43%) respondents who indicated that they prepared in a number of different ways prior to taking a high-stakes test, the most frequently mentioned multiple forms of preparation were

* Look at the online practice tests and talk to friends: 28 (10.4%);

* Look at the online practice tests, use the published preparation guide, and talk to friends: 15 (5.6%); and

* Look at the online practice tests, use the published preparation guide, and take a preparation course: 12 (4.4%).

Because the current test-takers did not report their scores on the iBT, it was impossible to relate the amount or type of test preparation to a specific test or test performance. It was clear, however, that test preparation was an important feature in test performance for most of the current test-takers who responded to this item on the questionnaire.

One of the former test-takers also reported extensive test preparation prior to taking the iBT and explained how test preparation contributed to her use of specific strategies during the test. Li, the Mandarin-speaking test-taker, took the advice of a test-savvy friend and enrolled in "an expensive ($300) two-month, intensive iBT test preparation course" as soon as she arrived in Canada. She explained, "I recognized that I had to get to know the process and procedures of the iBT. I really needed lots to prepare. I had to get used to the computer." She added,

   I didn't use much the computer before in China, but when I come
   here [Canada] I realized I have to use computer for test. That
   actually created a lot of like high anxiety for me, because I don't
   know what is going on with the computer.

In addition to taking the course, she purchased a test preparation book describing the iBT and completed the book's activities and sample tests; used materials provided in her iBT registration package, both official (from the test developer's website), and unofficial materials, available online; and interacted with friends who had already taken the iBT. She also took workshops on using computers and practiced frequently in the library to improve her computer skills.

There is a well-reported tension (Cheng & Fox, 2008; Green, 2006) between language teachers' goals to improve their students' language in substantive ways, and students' goals to simply pass the test. Many students view such tests as barriers to their university education rather than as necessary verification that their language proficiency has reached the threshold essential for their academic work. There has been considerable concern that at times test preparation courses may potentially undermine a test's potential to measure language constructs of interest, and that test-takers may waste their time practicing test-taking strategies that are not useful beyond the bounds of test-taking itself (e.g., Cheng et al., 2011).

The comments of the test-takers in this study suggest that test preparation is indeed a large pretest focus in this high-stakes context. However, beyond the concerns the test-takers have for passing the test are the concerns relating to their development of computer skills that are adequate for the demands of the test itself. It is possible to make the argument that such skills are essential to academic work (and therefore part of the construct the iBT is measuring); however, one may question whether or not all entering university students have such skills at admission. This issue is discussed further below with regard to the iBT writing task, and in relation to the accounts of the language testing researchers reported in DeLuca et al. (2013).

Concerns About Speededness

Additional concerns about computer administration were evident in the responses of both former and current test-takers with regard to the amount of time provided to complete tasks on the iBT. Further to Huff and Sired's (2001) misgivings regarding computer-administered tests, test-takers in the current study reported issues relating to time, the timing of tasks, and increasing_test anxiety during the computer-administered test. Henning (1987) refers to this as speededness, a label we appropriated for this study. Similar results were reported in DeLuca et al. (2013) when, as test researchers/expert informants, we took the iBT and noted how demanding the test was. We reported feeling increasing anxiety and a sense of declining confidence as a result of the pressure to complete highly complex tasks within the imposed time limits even though there were no high stakes attached to our test-taking. The effects of speededness were frequently commented on in the present study as well. When current test-takers were asked to identify and explain their preferences for computer-administered or paper-administered tests, 302 (85%) indicated they preferred a paper-based format; 55 (15%) preferred the iBT's computer-administered format.

Amongst the 302 test-takers who preferred the paper-based format, time was explicitly mentioned by 31 test-takers in explaining why it was their preference (e.g., "takes more time on computer" or "more time to think and review with pen and paper"). Issues of time in relation to task completion were explicitly mentioned by 55 others, who explained their typing skills were "slow" or "poor"; 77 reported that paper-based administration was faster for them ("writing is faster with pen and paper" or "not as fast to write with computer"). Thus, issues of time and speededness figured in the preference for paper-based administration formats in 163 (44%) of the test-takers who responded to the questionnaire. Of the 55 who preferred computer administration, only 4 mentioned time as a reason; 7 explained they had good typing skills and 13 explicitly mentioned their typing was "faster."

In interviewing the two former test-takers, both of them mentioned that time and their speed of response was an issue for them in sections of the iBT. These comments are further explained below in relation to specific sections of the test.

Positive Responses to Computer-Administered Tests of Listening and Speaking

The benefits of allowing test-takers to control the overall pace of their work on the listening section of the iBT was frequently mentioned by both former and current test-takers (as was the case with the language testing researchers). Such control was not possible in paper-based formats. However, foremost in their positive responses to the listening and speaking sections of the computer-administered test were comments about sound quality and clarity, particularly in relation to the listening section of the test. For example, Juan (former test-taker) stated:

   I'm an ideal candidate for this study because I've taken the two
   versions of the TOEFL. I took both paper-based and iBT TOEFLs. So
   obviously I didn't get the score I needed when I took the
   paper-based test, but one of the reasons I had to take it again was
   the context in which it was administered. It was not fair.
   Specifically the listening section, because I was seated behind the
   speakers. It affected my concentration, because the quality of the
   audio was very bad and also the acoustics of the room.

He reported that his initial negative experience on the listening section of the PBT undermined his performance overall:

   I think my other scores on the PBT were lower because the listening
   was administered first. It just destroyed me. I didn't put as much
   effort after that. So listening failure triggered really high
   anxiety, and I was not motivated to write the other parts of the
   test. Well, I was motivated, but I couldn't concentrate. I was
   still thinking about the listening.

He received an overall score of 480 on the PBT and, as expected, his lowest score was in listening. Subsequently, he learned that the iBT was also being offered. He reported that when he took the iBT, "listening was my highest score" and "I was successful overall too, because I scored 105 on the iBT!" He commented on the "quality of the sound, so easy, so clear" as a result of wearing headphones.

Improved sound quality and clarity in the listening section of the computer-administered test was explicitly highlighted by 7 of the 375 current test-takers as the clear advantage of the computer-administered test format. Like Juan, the 7 current test-takers who singled out sound quality as an issue for paper-based tests, reported test-taking experiences in which they perceived their performance had been undermined by conditions in the testing room itself: "I couldn't hear the sound of the lecture because I was in the back of the room" (Case 21) or "I couldn't hear the speakers" (Case 24).

However, not all of the issues related to sound have been resolved through computer administration. Many test-takers reported noise in the testing room as a problem in their test performance: "I couldn't think because the girl near me was on a different part [of the test] and she is speaking so loud. It is impossible to think and do my part" (Case 70). Other test-takers also complained about noise in the testing rooms: "the room too crowded and so noise is problem" (Case 26); "I can't hear because I hear other [test-takers] too" (Case 166). Several others reported distractions in the room: "In middle of test, this other one [test-taker] she has problem and makes noise and I can't focus on my test. Why they not take her outside to discuss?" (Case 12); or "new test-taker come into room to start test, but I not finished. I trouble then ... can't think" (Case 10). The accounts of the current and former test-taker groups are very similar to those reported by the language testing researchers in 2013 with regard to ambient noise, further discussed below.

Mixed Responses to the Reading Section of the Computer-Administered Test

In this study, both the former and current test-taker groups reported that the reading section was not particularly difficult. For example, the former test-takers stated, "Reading was fine. No issues" (Juan) and "I liked the reading" (Li). They pointed out a computer feature that was helpful: "I could click on unfamiliar or technical terms. That helped me." And Li remarked that the multiple choice format made the reading section easier for her: "I guess because we're Chinese, we can use multiple choice. So you kind of have a strategy to exclude [distracters]. So that's my active [test] preparation strategy."

The comments of the former test-takers were similar to those of the current test-takers. It is important to note, however, that 39 (10.4%) of the 375 current test-takers reported that they "were not used to reading computer screens" (Case 27) under pressure; "feel nauseous when I read from computer" (Case 2); or "hate to read on computer [and] could not focus on the screen" (Case 33). Others reported they were "unfamiliar with reading on computer like this" (Case 200). Still others pointed out that they "like to underline and highlight," "like to circle keywords," and "write notes in the margin" when they read.

In order to explore this reported practice, and with permission of the Testing Unit at the university where the study took place, we examined 50 randomly selected reading booklets with extensive reading passages from previously administered CAEL Assessments. CAEL allows test-takers to work with and use the reading booklets while they are responding to questions on the reading subtest. Test-takers may write on the reading booklets if they choose to do so. The review suggested that when test-takers have reading booklets, a majority tend to annotate their reading in some way: 34 (68%) highlighted, wrote in the margins, circled, or underlined; 16 (32%) did not annotate the reading in any way. Further research on the academic reading construct needs to examine this finding.

Issues of construct representation are the focus of the section below, which addresses the second research question.

2. What does probing the testing experiences of these test-takers reveal about construct representation and the interpretive argument of the iBT? How do their accounts compare with those of the language testing researchers reported in 2013?

Probing the test-taking experiences of current and former test-takers (the target population of the iBT) suggests potential threats to validity that might not otherwise be evident if test-takers' perspectives are not consulted (Cheng et al., 2011; Fox, 2003; Fox & Cheng, 2007), or if test performances (scores) are the sole source of data, as has traditionally been the case with validation studies of operational tests (Moss et al., 2006). In the section below, construct-relevant issues identified by the test-takers' accounts of their computer-administered testing experience are discussed. Suggestions are made to further investigate these issues in order to determine how significant they might be, and to accumulate evidence with regard to the test's interpretive argument.

Responses to the Computer-Administered Writing Tasks

Ockey (2007) found subtle changes in construct as a result of a computer-administration format in the case of visual modes. In the current study, the changes were more dramatic, particularly in relation to test-takers' accounts of the iBT writing section. For example, the former test-takers identified the computer-administered writing section of the test as "the most difficult part" and potentially "unfair." Juan explained that he was "very anxious" about his "ability to type fast." He felt the computer-administered format of the writing section of the test put him at a particular disadvantage because "keyboarding was such an important requirement for writing well on the test." He explained:

   If I had been an expert in typing I would have performed better.
   This is the issue. If you are a slow typer [typist], the time is
   consumed by typing, and so you don't have time to go back and read
   and think about what you are writing. It steals, in a way, the time
   you would have had to proofread and think about what you have
   written already. And if you look at it from that perspective, the
   ability to type fast on the keyboard, again definitely, I don't
   think that is fair in measuring whether you can write in English or
   not. Particularly for students who are from countries that do not
   have access to computers, or do, but do not type fast.

He pointed out that he drafts all of his university papers with pen and paper, typing only the final draft, because his "keyboarding is still very slow." He reported being "very nervous" during the writing section of the iBT, "because I type so slowly, I didn't have time to really finish or write what I wanted to say." He remarked that many of his current classmates and most students in his home country would not be able to perform well on the writing section of the computer-administered test. Like Li, they simply didn't have enough experience with computers or keyboarding. Nor did the former test-takers agree with the test researchers that familiarity with computers was ultimately an "advantage for studying in university." They pointed out that "familiarity" was different from "fast typing," and this skill was something that should not be expected of test-takers, who were just beginning university. They argued "it was unfair to expect this of second language students, when it is not expected of English-speaking students."

The former test-takers wondered why test-takers weren't given a choice to either type or write out their responses by hand. As Juan suggested, "Why don't they offer a choice for the writing section? Those who type quickly and write their papers this way could use the computer; those that don't, could use pen and paper, with the same amount of time to finish the work."

The responses of the current test-takers were also overwhelmingly negative about keyboarding requirements and/or the use of the computer for testing writing. In general, of the 355 test-takers who responded to a question about administration-format preference, only 53 (15%) indicated they preferred the computer, whereas 302 (85%) indicated they preferred paper-based administration for writing. Like Juan, they pointed out that "they were more used to pen and paper tests of writing" (Case 6). They argued that paper-based tests gave them "more freedom and mobility to write and erase manually ... in computer you have to look at the keyboard" (Case 21).

Many of the current test-takers expressed concerns about controlling the computer, pointing out that "computer makes me nervous" (Case 28), and that a paper-based test is "safer than using computer. Sometime we press keys which can remove all we did" (Case 34). Still others expressed concern "because I am not fast in typing" (Case 23). They argued that "it's faster for a person to write on paper and reduces time" (Case 41) or "[it's] more natural" (Case 342), adding that "a lot of time [typing] means a lot of pressure on a person."

What stood out in our analysis of these test-takers' accounts were the differences in the amount and type of test preparation that they reported, because of the computer-administered writing tasks. Li's extensive preparation--particularly her extended emphasis on computer familiarity and keyboarding speed in preparation for the iBT--and Juan's comments on increased anxiety as a result of the demands imposed by keyboarding requirements of the writing task suggest that keyboarding speed and computer familiarity are part of the construct being measured by the test.

Juan (unlike Li) did not prepare for the iBT (having been unsuccessful on the TOEFL PBT and pinpointing listening as the problem, he registered without delay for the iBT). Although Juan did not do as well as he had expected on the iBT writing section, and during the test he experienced higher anxiety as a result of his lack of familiarity with typing his written work and his limited keyboarding skill, he still passed the test. Interestingly, within the unsuccessful (current) test-taker group, there were notable differences of opinion. Of the 315 who responded to the question, "Do you think you would do better on the writing section of a [high-stake] test if you could use the computer to type your response?" 223 (71%) responded "No" and 92 (29%) responded "Yes." In their explanations of their responses to this question, there were compelling differences between test-takers who reported their writing performance was undermined by typing their responses to the writing tasks, and those who reported that their performance was enhanced. This speaks again to the issue of construct, and what the test intends to measure. The accounts of the test-takers in the present study suggest a potential method effect, as the iBT requirement that they type their responses in the writing section of the test may differentially impact test performance.

In contrast, as language testing researchers or expert informants (i.e., doctoral students and/or professors in language testing and assessment) in the 2013 study, we reported that the iBT writing subtest was, in our view, the easiest section of the test. We appreciated the speed with which we could express our thoughts in writing as a result of the computer interface and complained about paper-based tests, which required us to write out responses, because we considered handwritten responses unnecessarily slow and limiting. Our positive accounts, as professional academics, of typing our responses on the writing subtest stand in sharp contrast to those of former test-takers, Li and Juan, and 75% of the current test-takers who are hoping to enter undergraduate programs. These differences may have important implications for construct definition and fairness.

This study suggests the need to further examine the impact of required keyboarding/typewritten responses on test performance. It should be pointed out that keyboarding is not a requirement for admission to English-medium universities in North America. Although the testing researchers in DeLuca et al. (2013) reported feeling at ease with typing their responses, by the time a student reaches graduate school (particularly at the doctoral level), typing written texts may be a "natural" part of academic work. This is not the case for entering first-year students. Most of the test-takers considered in the present study reported that the requirement to keyboard or type their writing impeded their performance on the iBT writing tasks. Their accounts of their testing experience raise construct (ir)relevant questions, given that typing original text under time pressure is not a requirement for admission to undergraduate universities in Canada. This is precisely the issue raised by Huff and Sirechi (2001), who note that students from contexts where computers are not a ubiquitous feature of education may be disadvantaged by this requirement. One may ask if it is fair to require this skill of only one group of entering undergraduates (L2 applicants), when others, who do not need to submit evidence of their language proficiency, do not face this requirement?

Ambient Distracting Noise in Testing Rooms

Although overall the test-takers considered in this study responded positively to the computer-administered listening and speaking sections of the iBT because of improved sound quality, ambient and distracting noise in testing rooms, as discussed above, was a frequently reported issue (also reported in DeLuca et al., 2013). Given that standardized test administration is foundational to measurement quality in large-scale high-stakes testing, this is an issue that would be of concern to test developers and test users, because it speaks to the interpretation of test results (i.e., the interpretative argument).

It appears that, in some test sites, administration logistics are well worked out to the advantage of test-takers. In other test sites, the close proximity of computer stations is a problem for some test-takers. Based on the varying reports of the test-takers considered here, there do not appear to be standard requirements for the positioning of computer stations/test-takers in a testing room (or, if standards are explicit, they may not be consistently followed). This needs to be systematically reviewed because it is a potential source of construct irrelevant variance. This finding could be investigated through the use of posttest questionnaires, which asked test-takers to comment on their testing experience. Over time, test-taker feedback would reveal administration issues arising in specific test sites that could then be addressed. In addition, requirements for test sites may need to be further detailed to ensure that logistics are comparable across test administration centres (e.g., that minimum distances between computer stations are respected, that activity in a test room is restricted, and so on).

Reading Extended Texts on Computer

Also of concern were the comments of current test-takers who reported that reading on a computer screen was either physically challenging (e.g., "made me feel nauseous") or unrepresentative of how they generally read academic texts (e.g., "I underline when I read," "I need to write notes when I read"). As Fulcher (2003), Huff and Sired (2001), and Ockey (2007) have found, more research is needed to fully understand the impact and implications of such computer-administered tasks.

Although the construct of university-level academic reading was evident in the reported processes, procedures, and responses of the test researchers reported in DeLuca et al. (2013), it was not evident in the accounts of the former and current test-takers considered in the present study, who reported using test-wise (Cohen & Upton, 2007), multiple-choice test-taking strategies on the test--not academic reading strategies. Whereas the test researchers found the reading section the most demanding cognitively and commented on learning through their reading of the texts (albeit with concerns over insufficient time for reading in depth, i.e., speededness), most of the former and current test-takers seemed to take the reading subtest in stride--but not, it would seem, because they were effective academic readers. Rather, they reported using the multiple choice distracters strategically to find the "correct answers" (many of these practiced in test preparation courses prior to the test).

None of the former or current test-takers reported learning as an outcome of their testing experience, as had the testing researchers in DeLuca et al. (2013). The test-takers in the current study (the target group of the iBT) were reading strategically--for correct test answers, which one test-taker noted "were there, in the multiple choice options." This finding coincides with what Cohen and Upton (2007) found. Similar to Fox (2003), Cheng et al. (2011), and Qi (2007), these findings suggest that the construct intended by the test developer may not be the construct operationalized by the test, and may be undermining the interpretive argument for the test to a degree. This is important information for test developers, who may want to shorten the reading test (in keeping with the comments on speededness) and examine alternative response formats to avoid what appears to be a strong method effect as a result of the multiple-choice test format. In sum, whereas, based on the comments of the test researchers (DeLuca et al., 2013), an academic reading construct appears to have been operationalized by the reading section of the test, the comments of former and current iBT test-takers suggest that a different construct (unrelated to academic reading in university) may be operationalized for many in the iBT's target population.

Conclusion

This study investigated computer-administered testing experience by asking former and current test-takers for feedback on their testing experience. The results suggest that drawing on their insights increases our understanding of the operational test. However, the findings of this study must be interpreted with caution. First, the data for the study were drawn only from test-taker accounts of computer-administered testing experience. What we can account for and report is limited; so much of our experience is tacit. Further, our perceptions and accounts of an experience change over time, and the time between the test-takers' iBT testing experience and participation in the study was not fixed. Second, the questionnaire was administered only to unsuccessful iBT test-takers at the time of the study. All of these participants were volunteers. Their responses could not be linked to either iBT results or proficiency levels, which would likely have a bearing on the participants' views of the testing experience. Third, all of the data were collected from participants studying in one Canadian university, in either degree programs or in preuniversity EAP courses. Finally, only two of the former iBT test-takers who volunteered to be interviewed for the study met the criteria for selection (i.e., that they had not been successful on a high-stakes paper-based proficiency test, but had passed the iBT within the previous two months). If more former test-takers had been identified, they could have provided a much richer and thicker understanding of the computer-administered testing experience of successful test-takers. The interviews with the two former (successful) iBT test-takers were, however, clarified and extended by the questionnaire responses of the current (unsuccessful) test-taker participants in our study, and threw new light on the accounts of the iBT test taking experience reported by language testing researchers (expert informants) in 2013.

Despite acknowledged limitations, findings from this study suggest that the impact of computer administration on test performance needs to be further explored. More research is needed to address the threats to test performance and score interpretation posed by such issues as familiarity (test preparation), test method, speededness, and test anxiety, which we found in the current study, and were also raised by DeLuca et al. (2013), Huff and Sireci (2001), and Ockey (2007). Such issues speak to the interpretive argument of the test. As Kane, Crooks, and Cohen (1999) note, the ongoing collection of evidence drawn from in vivo or operational tests will either contribute to or lessen the meaningfulness of score interpretation and considerations of validity, which is essentially an evaluation of test interpretation and use. If, as suggested by O'Sullivan (2012) in the framing quote at the beginning of this article, the test-taker and validation are "at the heart of test development" (p. 16), then their accounts of testing experience are an essential source of test validation evidence. Test developers, test researchers, and other key stakeholders should also experience test-taking from the perspective of the test-taker. Walking a mile in test-takers' shoes provides important insights on how tests are measuring constructs of interest and their impact on test-taker performance.

The Authors

Janna Fox, PhD, Associate Professor in Applied Linguistics, Carleton University, teaches and undertakes research in language testing and curriculum, with a focus on diagnostic assessment and test validation. She received a 3M Teaching Fellowship for leadership in higher education and serves on the Board of Paragon Testing Inc., Vancouver, Canada.

Liying Cheng, PhD, is Professor and Director of the Assessment & Evaluation Group at the Faculty of Education, Queen's University. Her primary research interests are the impact of large scale testing on instruction, the relationships between assessment and instruction, and the academic and professional acculturation of international and new immigrant students, workers, and professionals to Canada.

References

Alderson, J. C. (1990). Testing Reading Comprehension Skills (Part Two): Getting students to talk about taking a reading test. Reading in a Foreign Language, 7, 465-504.

Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1-42.

Bachman, L. F., & Palmer, A. (1996). Language testing in practice. Oxford, UK: Oxford University Press.

Bradshaw, J. (1990). Test-takers' reactions to a placement test. Language Testing, 7(1), 13-30.

Cassady, J. C., & Johnson, R. E. (2002). Cognitive test anxiety and academic performance. Contemporary Educational Psychology, 27, 270-295.

Chapelle, C. A., Chung, Y.-R., Hegelheimer, V., Pendar, N. & Xu, J. (2010). Towards a computer-delivered test of productive grammatical ability. Language Testing, 27(4), 443-469.

Charmaz, K. (2006). Constructing grounded theory: A practical guide through qualitative analysis. London, UK: Sage.

Cheng, L., Andrews, S., & Yu, Y. (2011). Impact and consequences of school-based assessment (SBA): Students' and parents' views of SBA in Hong Kong. Language Testing, 28(2), 221-249. doi:10.1177/0265532210384253

Cheng, L., & Fox, J. (2008). Towards a better understanding of academic acculturation: Second language students in Canadian universities. Canadian Modern Language Review, 65(2), 307-333.

Cheng, L., Klinger, D., Fox, J., Doe, C., Jin, Y., & Wu, J. (2014). Motivation and test anxiety in test performance across three testing contexts: The CAEL, CET and GEPT. TESOL Quarterly, 48, 300-330. doi: 10.1002/tesq.105

Choi, I.-C., Kim, K., & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing, 20(3), 295-320.

Cohen, A., & Upton, T. (2007). "I want to go back to the text": Response strategies on the reading subset of the new TOEFL. Language Testing, 24, 209-250.

DeLuca, C., Cheng, L., Fox, J., Doe, C., & Li, M. (2013). Putting testing researchers to the test: An exploratory study on the TOEFL iBT. System, 41, 663-676.

Doe, C., & Fox, J. (2011). Exploring the testing process: Three test-takers' observed and reported strategy use over time and testing contexts. Canadian Modern Language Review, 67(1), 29-53.

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency tasks: What does the test-taker have to offer? Language Testing, 19, 347-368.

ETS. (2010). TOEFL iBT tips: How to prepare for the TOEFL iBT. Retrieved from http://www.ets.org/ Media/Tests/TOEFL/pdf/TOEFL_Tips.pdf. [Updated information now available at http:// www.ets.org/toefl/ibt/prepare/

Fox, J. (2003). From products to process: An ecological approach to bias detection. International Journal of Testing, 3(1), 21-47.

Fox, J., & Cheng, L. (2007). Did we take the same test? Differing accounts of the Ontario Secondary School Literacy Test by first and second language test-takers. Assessment in Education, 14(1), 9-26.

Fox, J., Cheng, L., & Zumbo, B. (2014). Do they make a difference? The impact of English language programs on second language students in Canadian universities. TESOL Quarterly, 48(1), 57-85. doi:10.1002/tesq.103

Fox, J., Pychyl, T. & Zumbo, B. (1997). An investigation of background knowledge in the assessment of language proficiency. In A. Huhta, V. Kohonen, L. Kurki-Suonio, & S. Luoma (Eds.), Current developments and alternatives in language assessment (pp. 367-383). Jyvaskyla, Finland: University of Jyvaskyla Press.

Fulcher, G. (1996). Testing tasks: Issues in task design and the group oral. Language Testing, 13(1), 23-51.

Fulcher, G. (2003). Interface design in computer-based language testing. Language Testing, 20(4), 384-408.

Green, A. B. (2006) Washback to the learner: Learner and teacher perspectives on IELTS preparation course expectations and outcomes. Assessing Writing, 11(2), 113-134.

Hall, C. (in press). Exploring computer-mediated second language oral proficiency testing: The test-taker's perspective.

Henning, G. (1987). A guide to language testing. Cambridge, MA: Newbury House.

Huff, K., & Sireci, S. (2001). Validity issues in computer based testing. Educational Measurement, Issues and Practice, 20(3), 16-25.

Huhta, A., Kalaja, P., & Pitkanen-Huhta, A. (2006). Discursive construction of a high-stakes test: The many faces of a test-taker. Language Testing, 23(3), 326-350.

Jennings, M., Fox, J., Graves, B., & Shohamy, E. (1999). The test-takers' choice: An investigation of the effect of topic on language test. Language Testing, 16(4), 426-456.

Kane, M. (2012, March). Validity, fairness, and testing. Paper presented at conference on conversations on validity around the world. Teachers College, NY.

Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5-17.

Maulin, S. (2004). Language testing using computers: Examining the effect of test-delivery medium on students' performance. Internet Journal of e-Language Learning & Teaching, 1(2), 1-14.

McNamara, T. & Roever, C. (2006). Language testing: The social dimension. Malden, MA: Blackwell.

Messick, S. (1996) Validity and washback in language testing. Language Testing, 13, 241-256.

Moss, P. A., Girard, B. J., & Haniford, L. C. (2006). Validity in educational assessment. Review of Research in Education, 30, 109-162.

Ockey, G. (2007). Construct implications of including still image or video in computer-based listening tests. Language Testing, 24(4), 517-537.

O'Sullivan, B. (2012). A brief history of language testing. In C. Coombe, P Davidson, B. O'Sullivan, & S. Stoynoff (Eds.), The Cambridge guide to second language assessment (pp. 9-19). Cambridge, UK: Cambridge University Press.

Pare, A., & Smart, G. (1994). Observing genres in action: Towards a research methodology. In A. Freedman & P. Medway (Eds.), Genre and the new rhetoric (pp. 146-154). London, UK: Taylor & Francis.

Phakiti, A. (2008) Construct validation of Bachman and Palmer's (1996) strategic competence model over time in EFL reading tests. Language Testing, 25, 237-272.

Powers, D. E., Kim, H., Yu, F., Weng, V. Z., and VanWinkle, W. (2009). The TOEIC[R] speaking and writing tests: Relations to test-taker perceptions of proficiency in English. ETS Policy and Research Reports, No. 78. doi:10.1002/j.2333-8504.2009.tb02175.x

Pritchard, R. (1990). The effects of cultural schemata on reading processing strategies. Reading Research Quarterly, 25, 273-295.

Purpura, J. E. (1998). Investigating the effects of strategy use and second language test performance with high- and low-ability test-takers: A structural equation modeling approach. Language Testing, 15, 333-379.

Qi, L. (2007). Is testing an efficient agent for pedagogical change? Examining the intended wash-back of the writing task in a high-stakes English test in China. Assessment in Education, 14(1), 51-74.

Richman-Hirsch, W., Olson-Buchanan, J., & Drasgow, F. (2000). Examining the impact of administration medium on examinee perceptions and attitudes. Journal of Applied Psychology, 85(6), 880-887.

Sasaki, M. (2000) Effects of cultural schemata on students' test-taking processes for cloze tests: A multiple data source approach. Language Testing, 17, 8-114.

Shohamy, E. (1984). Does the Testing Method Make a Difference? The Case of Reading Comprehension. Language Testing, 1, 147-170.

Shohamy, E. (2007). Tests as power tools: Looking back, looking forward. In J. Fox et al. (Eds.), Language testing reconsidered (pp. 141-152). Ottawa, ON: University of Ottawa Press.

Storey, P. (1997). Examining the test-taking process: A cognitive perspective on the discourse cloze test. Language Testing, 14, 214-231.

Stricker, L. J. (2004). The performance of native speakers of English and ESL speakers on the Computer Based TOEFL and the GRE General Test. Language Testing, 21 (2), 146-173.

Sundre, D. L., & Kitsantas, A. (2004). An exploration of the psychology of the examinee: Can examinee self-regulation and test-taking motivation predict consequential and non-consequential test performance? Contemporary Educational Psychology, 29, 6-26.

Swain, M., Huang, L., Barkhoui, K., Brooks, L., & Lapkin, S. (2009). The speaking section of the TOEFL iBT[TM] (SSTiBT): Test-takers' reported strategic behaviors (TOEFL iBT Research Report). Princeton, NJ: ETS.

Taylor, C., Jamieson, J., Eignor, D., and Kirsch, I. (1998). The relationship between computer familiarity and performance on computer-based TOEFL test tasks (TOEFL Research Report RR98-08, TOEFL-RR-61). http://www.ets.org/research/policy_research_reports/publications/ report/1998/hxwk

Appendix

Semistructured Interview Questions

1. I'd like to begin by asking you about your general experience with the test, your overall feeling, and your overall experience with this computer-administered test.

2. Were there any real stumbling blocks in your test-taking?

3. Perhaps I could ask you now about specific sections of the test.

4. Could you comment on your experience with the reading section of the test?

5. Could you comment on your experience with the listening section of the test?

6. Could you comment on your experience with the writing section of the test?

7. Could you comment on your experience with the speaking section of the test?

8. Which section(s) was the most difficult, and why?

9. Any final comments?