1、Chapter 3(第三章)The Reliability of Testing(测试的信度 )? The definition of reliability? The reliability coefficient? How to make tests more reliableWhat is reliability?Reliability refers to the trustworthiness and stability of candidates test results.In other words, if a group of students were given the sa

2、me test twice at different time, the more similar the scores would have been, the more reliable the test is said to be.How to establish the reliability of a test?It is possible to quantify the reliability of a test in the form of a reliability coefficient.They allow us to compare the reliability of

3、different tests.The ideal reliability coefficient is 1.-A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless of whe n it happened to be adm ini stered.-A test which had a reliability coefficient of zero would give

4、 sets of result quite unconn ected with each other.It is between the two extremes of 1 and zero that genuine test reliability coefficie nts are to be found.How high should we exp ect for differe nt types of Ian guage tests?Lado saysGood vocabulary, structure and reading tests are usually in the 0.9

5、to 0.99 range, while auditory comprehension tests are more often in the 0.8 to 0.89 range.A reliability coefficie nt of 0.85 might be con sidered high for an oral p roducti on test but low for a read ing test.The way to establish the reliability of a test:1. Test-retest methodIt means to have two se

6、ts of scores for comparison. The most obvious way of obta ining these is to get a group of subjects to take the same test twice.2. Sp lit-half methodIn this method, the subjects take the test in the usual way, but each subject is given two scores. One score is for one half of the test, the sec ond s

7、core is for the other half. The two sets of scores are the n used to obta in the reliability coefficie nt as if the whole test had bee n take n twice.In order for this method to work, it is n ecessary for the test to be sp ilt into two halves which are really equivale nt, through the careful match i

8、ng of items (in fact where items in the test have been ordered in terms of difficulty, a sp lit into odd-nu mbered items and eve n-nu mbered items may be adequate).3. P arallel forms method (the alternate forms method)It means to use two different forms of the same test to measure a group of stude n

9、ts continu ously or in a very short time. However, alternate forms are ofte n simply not available.How to make tests more reliableAs we have seen, there are two components of test reliability: the p erforma nee of can didates from occasi on to occasi on, and the reliability of the scori ng.Here we w

10、ill begin by suggesting ways of achieving consistent p erforma nces from can didates and the n turn our atte nti on to scorer reliability.I.Take eno ugh samples of behaviorOther things being equal, the more items that you have on a test, the more reliable that test will be.e.g.If we wan ted to know

11、how good an archer some one was, we wouldn t rely on the evidence of a single shaott the target. That one shot could be quite unrepresentative of their ability. To be satisfied that we had a really reliable measure of the ability we should want to see a large number of shots at the target.The same i

12、s true for language testing.It has been demonstrated empirically that the addition of further items will make a test more reliable.The additional items should be independent of each other and of existing items.e.g.A reading test asks the question:Where did the thief hide the jewels?If an additional

13、item following that took the form: “ Whatwas unusual about the hiding place?Would it make a full contribution to an increase in the reliability of the test?No.Why not?Because it is hardly possible for someone who got the original questions wrong to get the supplementary question right.We do not get

14、an additional sample of their behavior, so the reliability of our estimate of their ability is not increased.11Each additional item should as far as possible represent a fresh startfor the candidate.Do you think the longer a test is, the more reliability we will get?It is important to make a test lo

15、ng enough to achieve satisfactoryreliability, but it should not be made so long that the candidates becomeso bored or tired that the behavior that they exhibit becomesunrepresentative of their ability.2. Do not allow candidates too much freedomIn general, candidates should not be given a choice, and

16、 the rangeover which possible answers might vary should be restricted.Compare the following writing tasks:a) Write a composition on tourism.b) Write a composition on tourism in this country.c) Write a composition on how we might develop the tourist industry inthis country.d) Discuss the following me

17、asuresintended to increase the number offoreign tourists coming to this country:i)More/better advertising and / or information (where? What formshould it take?)ii)Improve facilities (hotels, transportation, communication etc.).iii)Training of personnel (guides, hotel managers etc.)The successive tas

18、ks impose more and more control over what iswritten. The fourth task is likely to be a much more reliable indicator of writing ability than the first.But in restricting the students we must be careful not to distort too much the task that we really want to see them perform.3. Write unambiguous items

19、It is essential that candidates should not be presented with items whose meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated.The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleague

20、s, who should try as hard as they can to find alternative interpretations to the ones intended.4. Provide clear and explicit instructionsThis applies both to written and oral instructions.If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them

21、certainly will.A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions.The frequency of the complaint that students are unintelligent, have been stupid, have willfully misun

22、derstood what they were asked to do, powers of telepathy toreveals that the supposition is often unwarranted.Test writers should not rely on the students elicit the desired behavior.The best means of avoiding problems is the use of colleagues to criticize drafts of instructions (including those whic

23、h will be spoken).Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.5. Ensure that tests are well laid out and perfectly legibleToo often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorl

24、y reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.6. Candidates should be familiar with format and testing techniquesIn any aspect of a

25、test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise. For this reason, every effort must be made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past t

26、est paper), or at least the provision of practice materials in the case of tests set within teaching institutions.7. Provide uniform and non-distracting conditions of administrationThe greater the differences between one administration of a test and another, the greater the differences one can expec

27、t between a candidate performance on the two occasions.Great care should be taken to ensure uniformity.e.g.Timing should be specified and strictly adhered to;The acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet se

28、tting with no distracting sounds or movements.How to obtain scorer reliability1. Use items that permit scoring which is as objective as possibleThis may appear to be a recommendation to use multiple choice items, which permit completely objective scoring. This is not intended.While it would be mista

29、ken to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quite inappropriate. What is more, good multiple choice items are notoriously difficult to write and always require extensive pretesting.An alternative to multiple ch

30、oice is the open-ended item which has a unique, possibly one-word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate s meaning unclear often make demands on the scorer judsgme

31、nt. The longer the required response, the greater the difficulties of this kind.One way of dealing with this is to struct ure the candidate s response by providing part of it.e.g.The open-ended questionWhat was different about the results?may be designed to elicit the responseSuccess was closely ass

32、ociated with high motivation.This is likely to cause problems for scoring. Greater scorer reliability will probably be achieved if the question is followed by:was more closely associated with2. Make comparisons between candidates as direct as possibleThis reinforces the suggestion already made that

33、candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond.Scoring the compositions all on one topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests.3.

34、Provide a detailed scoring keyThis should specify acceptable answers and assign points for partially correct responses. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have

35、 been subjected to group criticism. (This advice applies only where responses can be classed as partially or totally correct not ,in the case of compositions, forinstance.) 4. Train scorersThis is especially important where scoring is more subjective. The scoring of compositions, for example, should

36、 hot be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analyzed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.5. Agree acceptable resp

37、onses and appropriate scores at outset of scoringA sample of scripts should be taken immediately after the administration of the test. Where there are compositions, archetypical representatives of different levels of ability should be selected. Only when all scorers are agreed on the scores to be gi

38、ven to these should real scoring begin.For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising that part of the scoring. Once a dec

39、ision has been taken as to the points to be assigned, the supervisor should convey it to all the scorers concerned.6. Identify candidates by number, not nameScorers inevitably have expectations of candidates that they know.Except in purely objective testing, this will affect the way that they score.

40、Studies have shown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given.e.g.A scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.7. Employ multiple, independ


