Psychology Reliability homework help

Modules Chapter 5 wk2 p655

C H A P T E R 5

Reliability

In everyday conversation, reliability is a synonym for dependability or consistency. We speak of the train that is so reliable you can set your watch by it. If we’re lucky, we have a reliable friend who is always there for us in a time of need.

Broadly speaking, in the language of psychometrics reliability refers to consistency in measurement. And whereas in everyday conversation reliability always connotes something positive, in the psychometric sense it really only refers to something that is consistent—not necessarily consistently good or bad, but simply consistent.

It is important for us, as users of tests and consumers of information about tests, to know how reliable tests and other measurement procedures are. But reliability is not an all-or-none matter. A test may be reliable in one context and unreliable in another. There are different types and degrees of reliability. A reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance. In this chapter, we explore different kinds of reliability coefficients, including those for measuring test-retest reliability, alternate-forms reliability, split-half reliability, and inter-scorer reliability.

The Concept of Reliability

Recall from our discussion of classical test theory that a score on an ability test is presumed to reflect not only the testtaker’s true score on the ability being measured but also error.1 In its broadest sense, error refers to the component of the observed test score that does not have to do with the testtaker’s ability. If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows:

A statistic useful in describing sources of test score variability is the variance (σ2)—the standard deviation squared. This statistic is useful because it can be broken into components. Page 142Variance from true differences is true variance , and variance from irrelevant, random sources is error variance . If σ2 represents the total variance, the true variance, and the error variance, then the relationship of the variances can be expressed as

In this equation, the total variance in an observed distribution of test scores (σ2) equals the sum of the true variance plus the error variance . The term reliability refers to the proportion of the total variance attributed to true variance. The greater the proportion of the total variance attributed to true variance, the more reliable the test. Because true differences are assumed to be stable, they are presumed to yield consistent scores on repeated administrations of the same test as well as on equivalent forms of tests. Because error variance may increase or decrease a test score by varying amounts, consistency of the test score—and thus the reliability—can be affected.

In general, the term measurement error refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured. To illustrate, consider an English-language test on the subject of 12th-grade algebra being administered, in English, to a sample of 12-grade students, newly arrived to the United States from China. The students in the sample are all known to be “whiz kids” in algebra. Yet for some reason, all of the students receive failing grades on the test. Do these failures indicate that these students really are not “whiz kids” at all? Possibly. But a researcher looking for answers regarding this outcome would do well to evaluate the English-language skills of the students. Perhaps this group of students did not do well on the algebra test because they could neither read nor understand what was required of them. In such an instance, the fact that the test was written and administered in English could have contributed in large part to the measurement error in this evaluation. Stated another way, although the test was designed to evaluate one variable (knowledge of algebra), scores on it may have been more reflective of another variable (knowledge of and proficiency in English language). This source of measurement error (the fact that the test was written and administered in English) could have been eliminated by translating the test and administering it in the language of the testtakers.

Measurement error, much like error in general, can be categorized as being either systematic or random. Random error is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process. Sometimes referred to as “noise,” this source of error fluctuates from one testing situation to another with no discernible pattern that would systematically raise or lower scores. Examples of random error that could conceivably affect test scores range from unanticipated events happening in the immediate vicinity of the test environment (such as a lightning strike or a spontaneous “occupy the university” rally), to unanticipated physical events happening within the testtaker (such as a sudden and unexpected surge in the testtaker’s blood sugar or blood pressure).

JUST THINK . . .

What might be a source of random error inherent in all the tests an assessor administers in his or her private office?

In contrast to random error, systematic error refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured. For example, a 12-inch ruler may be found to be, in actuality, a tenth of one inch longer than 12 inches. All of the 12-inch measurements previously taken with that ruler were systematically off by one-tenth of an inch; that is, anything measured to be exactly 12 inches with that ruler was, in reality, 12 and one-tenth inches. In this example, it is the measuring instrument itself that has been found to be a source of systematic error. Once a systematic error becomes known, it becomes predictable—as well as fixable. Note also that a systematic source of error does not affect score consistency. So, for example, suppose a measuring instrument such as the official weight scale used on The Biggest Loser television Page 143program consistently underweighed by 5 pounds everyone who stepped on it. Regardless of this (systematic) error, the relative standings of all of the contestants weighed on that scale would remain unchanged. A scale underweighing all contestants by 5 pounds simply amounts to a constant being subtracted from every “score.” Although weighing contestants on such a scale would not yield a true (or valid) weight, such a systematic error source would not change the variability of the distribution or affect the measured reliability of the instrument. In the end, the individual crowned “the biggest loser” would indeed be the contestant who lost the most weight—it’s just that he or she would actually weigh 5 pounds more than the weight measured by the show’s official scale. Now moving from the realm of reality television back to the realm of psychological testing and assessment, let’s take a closer look at the source of some error variance commonly encountered during testing and assessment.

JUST THINK . . .

What might be a source of systematic error inherent in all the tests an assessor administers in his or her private office?

Sources of Error Variance

Sources of error variance include test construction, administration, scoring, and/or interpretation.

Test construction

One source of variance during test construction is item sampling or content sampling , terms that refer to variation among items within a test as well as to variation among items between tests. Consider two or more tests designed to measure a specific skill, personality attribute, or body of knowledge. Differences are sure to be found in the way the items are worded and in the exact content sampled. Each of us has probably walked into an achievement test setting thinking “I hope they ask this question” or “I hope they don’t ask that question.” If the only questions on the examination were the ones we hoped would be asked, we might achieve a higher score on that test than on another test purporting to measure the same thing. The higher score would be due to the specific content sampled, the way the items were worded, and so on. The extent to which a testtaker’s score is affected by the content sampled on a test and by the way the content is sampled (that is, the way in which the item is constructed) is a source of error variance. From the perspective of a test creator, a challenge in test development is to maximize the proportion of the total variance that is true variance and to minimize the proportion of the total variance that is error variance.

Test administration

Sources of error variance that occur during test administration may influence the testtaker’s attention or motivation. The testtaker’s reactions to those influences are the source of one kind of error variance. Examples of untoward influences during administration of a test include factors related to the test environment: room temperature, level of lighting, and amount of ventilation and noise, for instance. A relentless fly may develop a tenacious attraction to an examinee’s face. A wad of gum on the seat of the chair may make itself known only after the testtaker sits down on it. Other environment-related variables include the instrument used to enter responses and even the writing surface on which responses are entered. A pencil with a dull or broken point can make it difficult to blacken the little grids. The writing surface on a school desk may be riddled with heart carvings, the legacy of past years’ students who felt compelled to express their eternal devotion to someone now long forgotten. External to the test environment in a global sense, the events of the day may also serve as a source of error. So, for example, test results may vary depending upon whether the testtaker’s country is at war or at peace (Gil et al., 2016). A variable of interest when evaluating a patient’s general level of suspiciousness or fear is the patient’s home neighborhood and lifestyle. Especially in patients who live in and must cope daily with an unsafe neighborhood, Page 144what is actually adaptive fear and suspiciousness can be misinterpreted by an interviewer as psychotic paranoia (Wilson et al., 2016).

Other potential sources of error variance during test administration are testtaker variables. Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication can all be sources of error variance. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood or mental state are other potential sources of testtaker-related error variance. It is even conceivable that significant changes in the testtaker’s body weight could be a source of error variance. Weight gain and obesity are associated with a rise in fasting glucose level—which in turn is associated with cognitive impairment. In one study that measured performance on a cognitive task, subjects with high fasting glucose levels made nearly twice as many errors as subjects whose fasting glucose level was in the normal range (Hawkins et al., 2016).

Examiner-related variables are potential sources of error variance. The examiner’s physical appearance and demeanor—even the presence or absence of an examiner—are some factors for consideration here. Some examiners in some testing situations might knowingly or unwittingly depart from the procedure prescribed for a particular test. On an oral examination, some examiners may unwittingly provide clues by emphasizing key words as they pose questions. They might convey information about the correctness of a response through head nodding, eye movements, or other nonverbal gestures. In the course of an interview to evaluate a patient’s suicidal risk, highly religious clinicians may be more inclined than their moderately religious counterparts to conclude that such risk exists (Berman et al., 2015). Clearly, the level of professionalism exhibited by examiners is a source of error variance.

Test scoring and interpretation

In many tests, the advent of computer scoring and a growing reliance on objective, computer-scorable items have virtually eliminated error variance caused by scorer differences. However, not all tests can be scored from grids blackened by no. 2 pencils. Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still require scoring by trained personnel.

Manuals for individual intelligence tests tend to be very explicit about scoring criteria, lest examinees’ measured intelligence vary as a function of who is doing the testing and scoring. In some tests of personality, examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences, and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses. In one test of creativity, examinees might be given the task of creating as many things as they can out of a set of blocks. Here, it is the examiner’s task to determine which block constructions will be awarded credit and which will not. For a behavioral measure of social skills in an inpatient psychiatric service, the scorers or raters might be asked to rate patients with respect to the variable “social relatedness.” Such a behavioral measure might require the rater to check yes or no to items like Patient says “Good morning” to at least two staff members.

JUST THINK . . .

Can you conceive of a test item on a rating scale requiring human judgment that all raters will score the same 100% of the time?

Scorers and scoring systems are potential sources of error variance. A test may employ objective-type items amenable to computer scoring of well-documented reliability. Yet even then, a technical glitch might contaminate the data. If subjectivity is involved in scoring, then the scorer (or rater) can be a source of error variance. Indeed, despite rigorous scoring criteria set forth in many of the better-known tests of intelligence, examiner/scorers occasionally still are confronted by situations where an examinee’s response lies in a gray area. The element of subjectivity in scoring may be much greater in the administration of certain nonobjective-type personality tests, tests of creativity (such as the block test just described), and certain academic tests (such as essay examinations). Subjectivity in scoring can even enter into Page 145behavioral assessment. Consider the case of two behavior observers given the task of rating one psychiatric inpatient on the variable of “social relatedness.” On an item that asks simply whether two staff members were greeted in the morning, one rater might judge the patient’s eye contact and mumbling of something to two staff members to qualify as a yes response. The other observer might feel strongly that a no response to the item is appropriate. Such problems in scoring agreement can be addressed through rigorous training designed to make the consistency—or reliability—of various scorers as nearly perfect as can be.

Other sources of error

Surveys and polls are two tools of assessment commonly used by researchers who study public opinion. In the political arena, for example, researchers trying to predict who will win an election may sample opinions from representative voters and then draw conclusions based on their data. However, in the “fine print” of those conclusions is usually a disclaimer that the conclusions may be off by plus or minus a certain percent. This fine print is a reference to the margin of error the researchers estimate to exist in their study. The error in such research may be a result of sampling error—the extent to which the population of voters in the study actually was representative of voters in the election. The researchers may not have gotten it right with respect to demographics, political party affiliation, or other factors related to the population of voters. Alternatively, the researchers may have gotten such factors right but simply did not include enough people in their sample to draw the conclusions that they did. This brings us to another type of error, called methodological error. So, for example, the interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates.

Certain types of assessment situations lend themselves to particular varieties of systematic and nonsystematic error. For example, consider assessing the extent of agreement between partners regarding the quality and quantity of physical and psychological abuse in their relationship. As Moffitt et al. (1997) observed, “Because partner abuse usually occurs in private, there are only two persons who ‘really’ know what goes on behind closed doors: the two members of the couple” (p. 47). Potential sources of nonsystematic error in such an assessment situation include forgetting, failing to notice abusive behavior, and misunderstanding instructions regarding reporting. A number of studies (O’Leary & Arias, 1988; Riggs et al., 1989; Straus, 1979) have suggested that underreporting or overreporting of perpetration of abuse also may contribute to systematic error. Females, for example, may underreport abuse because of fear, shame, or social desirability factors and overreport abuse if they are seeking help. Males may underreport abuse because of embarrassment and social desirability factors and overreport abuse if they are attempting to justify the report.

Just as the amount of abuse one partner suffers at the hands of the other may never be known, so the amount of test variance that is true relative to error may never be known. A so-called true score, as Stanley (1971, p. 361) put it, is “not the ultimate fact in the book of the recording angel.” Further, the utility of the methods used for estimating true versus error variance is a hotly debated matter (see Collins, 1996; Humphreys, 1996; Williams & Zimmerman, 1996a, 1996b). Let’s take a closer look at such estimates and how they are derived.

Reliability Estimates

Test-Retest Reliability Estimates

A ruler made from the highest-quality steel can be a very reliable instrument of measurement. Every time you measure something that is exactly 12 inches long, for example, your ruler will tell you that what you are measuring is exactly 12 inches long. The reliability of this instrument Page 146of measurement may also be said to be stable over time. Whether you measure the 12 inches today, tomorrow, or next year, the ruler is still going to measure 12 inches as 12 inches. By contrast, a ruler constructed of putty might be a very unreliable instrument of measurement. One minute it could measure some known 12-inch standard as 12 inches, the next minute it could measure it as 14 inches, and a week later it could measure it as 18 inches. One way of estimating the reliability of a measuring instrument is by using the same instrument to measure the same thing at two points in time. In psychometric parlance, this approach to reliability evaluation is called the test-retest method, and the result of such an evaluation is an estimate of test-retest reliability.

Test-retest reliability is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test. The test-retest measure is appropriate when evaluating the reliability of a test that purports to measure something that is relatively stable over time, such as a personality trait. If the characteristic being measured is assumed to fluctuate over time, then there would be little sense in assessing the reliability of the test using the test-retest method.

As time passes, people change. For example, people may learn new things, forget some things, and acquire new skills. It is generally the case (although there are exceptions) that, as the time interval between administrations of the same test increases, the correlation between the scores obtained on each testing decreases. The passage of time can be a source of error variance. The longer the time that passes, the greater the likelihood that the reliability coefficient will be lower. When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the coefficient of stability .

An estimate of test-retest reliability from a math test might be low if the testtakers took a math tutorial before the second test was administered. An estimate of test-retest reliability from a personality profile might be low if the testtaker suffered some emotional trauma or received counseling during the intervening period. A low estimate of test-retest reliability might be found even when the interval between testings is relatively brief. This may well be the case when the testings occur during a time of great developmental change with respect to the variables they are designed to assess. An evaluation of a test-retest reliability coefficient must therefore extend beyond the magnitude of the obtained coefficient. If we are to come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-retest reliability estimate must extend to a consideration of possible intervening factors between test administrations.

An estimate of test-retest reliability may be most appropriate in gauging the reliability of tests that employ outcome measures such as reaction time or perceptual judgments (including discriminations of brightness, loudness, or taste). However, even in measuring variables such as these, and even when the time period between the two administrations of the test is relatively small, various factors (such as experience, practice, memory, fatigue, and motivation) may intervene and confound an obtained measure of reliability.2

Taking a broader perspective, psychological science, and science in general, demands that the measurements obtained by one experimenter be replicable by other experimenters using the same instruments of measurement and following the same procedures. However, as observed in this chapter’s Close-Up , a replicability problem of epic proportions appears to be brewing.Page 147

CLOSE-UP

Psychology’s Replicability Crisis*

In the mid-2000s, academic scientists became concerned that science was not being performed rigorously enough to prevent spurious results from reaching consensus within the scientific community. In other words, they worried that scientific findings, although peer-reviewed and published, were not replicable by independent parties. Since that time, hundreds of researchers have endeavored to determine if there is really a problem, and if there is, how to curb it. In 2015, a group of researchers called the Open Science Collaboration attempted to redo 100 psychology studies that had already been peer-reviewed and published in leading journals (Open Science Collaboration, 2015). Their results, published in the journal Science, indicated that, depending on the criteria used, only 40–60% of replications found the same results as the original studies. This low replication rate helped confirm that science indeed had a problem with replicability, the seriousness of which is reflected in the term replicability crisis.

Why and how did this crisis of replicability emerge? Here it will be argued that the major causal factors are (1) a general lack of published replication attempts in the professional literature, (2) editorial preferences for positive over negative findings, and (3) questionable research practices on the part of authors of published studies. Let’s consider each of these factors.

Lack of Published Replication Attempts

Journals have long preferred to publish novel results instead of replications of previous work. In fact, a recent study found that only 1.07% of the published psychological scientific literature sought to directly replicate previous work (Makel et al., 2012). Academic scientists, who depend on publication in order to progress in their careers, respond to this bias by focusing their research on unexplored phenomena instead of replications. The implications for science are dire. Replication by independent parties provides for confidence in a finding, reducing the likelihood of experimenter bias and statistical anomaly. Indeed, had scientists been as focused on replication as they were on hunting down novel results, the field would likely not be in crisis now.

Editorial Preference for Positive over Negative Findings

Journals prefer positive over negative findings. “Positive” in this context does not refer to how upbeat, beneficial, or heart-warming the study is. Rather, positive refers to whether the study concluded that an experimental effect existed. Stated another way, and drawing on your recall from that class you took in experimental methods, positive findings typically entail a rejection of the null hypothesis. In essence, from the perspective of most journals, rejecting the null hypothesis as a result of a research study is a newsworthy event. By contrast, accepting the null hypothesis might just amount to “old news.”

The fact that journals are more apt to publish positive rather than negative studies has consequences in terms of the types of studies that even get submitted for publication. Studies submitted for publication typically report the existence of an effect rather than the absence of one. The vast majority of studies that actually get published also report the existence of an effect. Those studies designed to disconfirm reports of published effects are few-and-far-between to begin with, and may not be deemed publishable even when they are conducted and submitted to a journal for review. The net result is that scientists, policy-makers, judges, and anyone else who has occasion to rely on published research may have a difficult time determining the actual strength and robustness of a reported finding.

Questionable Research Practices (QRPs)

In this admittedly nonexhaustive review of factors contributing to the replicability crisis, the third factor is QRPs. Included here are questionable scientific practices that do not rise to the level of fraud but still introduce error into bodies of scientific evidence. For example, a recent survey of psychological scientists found that nearly 60% of the respondents reported that they decided to collect more data after peeking to see if their already-collected data had reached statistical significance (John et al., 2012). While this procedure may seem relatively benign, it is not. Imagine you are trying to determine if a nickel is fair, or weighted toward heads. Rather than establishing the number flips you plan on performing prior to your “test,” you just start flipping and from time-to-time check how many times the coin has come up heads. After a run of five heads, you notice that your weighted-coin hypothesis is looking strong and decide to stop flipping. The nonindependence between the decision to collect data and the data themselves introduces bias. Over the course of many studies, such practices can seriously undermine a body of research.

There are many other sorts of QRPs. For example, one variety entails the researcher failing to report all of the research undertaken in a research program, and then Page 148selectively only reporting the studies that confirm a particular hypothesis. With only the published study in hand, and without access to the researchers’ records, it would be difficult if not impossible for the research consumer to discern important milestones in the chronology of the research (such as what studies were conducted in what sequence, and what measurements were taken).

One proposed remedy for such QRPs is preregistration (Eich, 2014). Preregistration involves publicly committing to a set of procedures prior to carrying out a study. Using such a procedure, there can be no doubt as to the number of observations planned, and the number of measures anticipated. In fact, there are now several websites that allow researchers to preregister their research plans. It is also increasingly common for academic journals to demand preregistration (or at least a good explanation for why the study wasn’t preregistered). Alternatively, some journals award special recognition to studies that were preregistered so that readers can have more confidence in the replicability of the reported findings.

Lessons Learned from the Replicability Crisis

The replicability crisis represents an important learning opportunity for scientists and students. Prior to such replicability issues coming to light, it was typically assumed that science would simply self-correct over the long run. This means that at some point in time, the nonreplicable study would be exposed as such, and the scientific record would somehow be straightened out. Of course, while some self-correction does occur, it occurs neither fast enough nor often enough, nor in sufficient magnitude. The stark reality is that unreliable findings that reach general acceptance can stay in place for decades before they are eventually disconfirmed. And even when such long-standing findings are proven incorrect, there is no mechanism in place to alert other scientists and the public of this fact.

Traditionally, science has only been admitted into courtrooms if an expert attests that the science has reached “general acceptance” in the scientific community from which it comes. However, in the wake of science’s replicability crisis, it is not at all uncommon for findings to meet this general acceptance standard. Sadly, the standard may be met even if the findings from the subject study are questionable at best, or downright inaccurate at worst. Fortunately, another legal test has been created in recent years (Chin, 2014). In this test, judges are asked to play a gatekeeper role and only admit scientific evidence if it has been properly tested, has a sufficiently low error rate, and has been peer-reviewed and published. In this latter test, judges can ask more sensible questions, such as whether the study has been replicated and if the testing was done using a safeguard like preregistration.

Conclusion

Spurred by the recognition of a crisis of replicability, science is moving to right from both past and potential wrongs. As previously noted, there are now mechanisms in place for preregistration of experimental designs and growing acceptance of the importance of doing so. Further, organizations that provide for open science (e.g., easy and efficient preregistration) are receiving millions of dollars in funding to provide support for researchers seeking to perform more rigorous research. Moreover, replication efforts—beyond even that of the Open Science Collaboration—are becoming more common (Klein et al, 2013). Overall, it appears that most scientists now recognize replicability as a concern that needs to be addressed with meaningful changes to what has constituted “business-as-usual” for so many years.

Effectively addressing the replicability crisis is important for any profession that relies on scientific evidence. Within the field of law, for example, science is used every day in courtrooms throughout the world to prosecute criminal cases and adjudicate civil disputes. Everyone from a criminal defendant facing capital punishment to a major corporation arguing that its violent video games did not promote real-life violence may rely at some point in a trial on a study published in a psychology journal. Appeals are sometimes limited. Costs associated with legal proceedings are often prohibitive. With a momentous verdict in the offing, none of the litigants has the luxury of time—which might amount to decades, if at all—for the scholarly research system to self-correct.

When it comes to psychology’s replicability crisis, there is good and bad news. The bad news is that it is real, and that it has existed perhaps, since scientific studies were first published. The good news is that the problem has finally been recognized, and constructive steps are being taken to address it.

Used with permission of Jason Chin.

*This Close-Up was guest-authored by Jason Chin of the University of Toronto.

Page 149

Parallel-Forms and Alternate-Forms Reliability Estimates

If you have ever taken a makeup exam in which the questions were not all the same as on the test initially given, you have had experience with different forms of a test. And if you have ever wondered whether the two forms of the test were really equivalent, you have wondered about the alternate-forms or parallel-forms reliability of the test. The degree of the relationship between various forms of a test can be evaluated by means of an alternate-forms or parallel-forms coefficient of reliability, which is often termed the coefficient of equivalence .

Although frequently used interchangeably, there is a difference between the terms alternate forms and parallel forms. Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are equal. In theory, the means of scores obtained on parallel forms correlate equally with the true score. More practically, scores obtained on parallel tests correlate equally with other measures. The term parallel forms reliability refers to an estimate of the extent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal.

Alternate forms are simply different versions of a test that have been constructed so as to be parallel. Although they do not meet the requirements for the legitimate designation “parallel,” alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level of difficulty. The term alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected by item sampling error, or other error.

JUST THINK . . .

You missed the midterm examination and have to take a makeup exam. Your classmates tell you that they found the midterm impossibly difficult. Your instructor tells you that you will be taking an alternate form, not a parallel form, of the original test. How do you feel about that?

Obtaining estimates of alternate-forms reliability and parallel-forms reliability is similar in two ways to obtaining an estimate of test-retest reliability: (1) Two test administrations with the same group are required, and (2) test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy (although not as much as when the same test is administered twice). An additional source of error variance, item sampling, is inherent in the computation of an alternate- or parallel-forms reliability coefficient. Testtakers may do better or worse on a specific form of the test not as a function of their true ability but simply because of the particular items that were selected for inclusion in the test.3

Developing alternate forms of tests can be time-consuming and expensive. Imagine what might be involved in trying to create sets of equivalent items and then getting the same people to sit for repeated administrations of an experimental test! On the other hand, once an alternate or parallel form of a test has been developed, it is advantageous to the test user in several ways. For example, it minimizes the effect of memory for the content of a previously administered form of the test.

JUST THINK . . .

From the perspective of the test user, what are other possible advantages of having alternate or parallel forms of the same test?

Certain traits are presumed to be relatively stable in people over time, and we would expect tests measuring those traits—alternate forms, parallel forms, or otherwise—to reflect that stability. As an example, we expect that there will be, and in fact there is, a reasonable degree of stability in scores on intelligence tests. Conversely, we might expect relatively little stability in scores obtained on a measure of state anxiety (anxiety felt at the moment).Page 150

An estimate of the reliability of a test can be obtained without developing an alternate form of the test and without having to administer the test twice to the same people. Deriving this type of estimate entails an evaluation of the internal consistency of the test items. Logically enough, it is referred to as an internal consistency estimate of reliability or as an estimate of inter-item consistency . There are different methods of obtaining internal consistency estimates of reliability. One such method is the split-half estimate.

Split-Half Reliability Estimates

An estimate of split-half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. It is a useful measure of reliability when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense). The computation of a coefficient of split-half reliability generally entails three steps:

· Step 1. Divide the test into equivalent halves.

· Step 2. Calculate a Pearson r between scores on the two halves of the test.

· Step 3. Adjust the half-test reliability using the Spearman–Brown formula (discussed shortly).

When it comes to calculating split-half reliability coefficients, there’s more than one way to split a test—but there are some ways you should never split a test. Simply dividing the test in the middle is not recommended because it’s likely that this procedure would spuriously raise or lower the reliability coefficient. Different amounts of fatigue for the first as opposed to the second part of the test, different amounts of test anxiety, and differences in item difficulty as a function of placement in the test are all factors to consider.

One acceptable way to split a test is to randomly assign items to one or the other half of the test. Another acceptable way to split a test is to assign odd-numbered items to one half of the test and even-numbered items to the other half. This method yields an estimate of split-half reliability that is also referred to as odd-even reliability . 4 Yet another way to split a test is to divide the test by content so that each half contains items equivalent with respect to content and difficulty. In general, a primary objective in splitting a test in half for the purpose of obtaining a split-half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related aspects.

Step 2 in the procedure entails the computation of a Pearson r, which requires little explanation at this point. However, the third step requires the use of the Spearman–Brown formula.

The Spearman–Brown formula

The Spearman–Brown formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test. It is a specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items. Because the reliability of a test is affected by its length, a formula is necessary for estimating the reliability of a test that has been shortened or lengthened. The general Spearman–Brown (rSB) formula is

Page 151

where rSB is equal to the reliability adjusted by the Spearman–Brown formula, rxy is equal to the Pearson r in the original-length test, and n is equal to the number of items in the revised version divided by the number of items in the original version.

By determining the reliability of one half of a test, a test developer can use the Spearman–Brown formula to estimate the reliability of a whole test. Because a whole test is two times longer than half a test, n becomes 2 in the Spearman–Brown formula for the adjustment of split-half reliability. The symbol rhh stands for the Pearson r of scores in the two half tests:

Usually, but not always, reliability increases as test length increases. Ideally, the additional test items are equivalent with respect to the content and the range of difficulty of the original items. Estimates of reliability based on consideration of the entire test therefore tend to be higher than those based on half of a test. Table 5–1 shows half-test correlations presented alongside adjusted reliability estimates for the whole test. You can see that all the adjusted correlations are higher than the unadjusted correlations. This is so because Spearman–Brown estimates are based on a test that is twice as long as the original half test. For the data from the kindergarten pupils, for example, a half-test reliability of .718 is estimated to be equivalent to a whole-test reliability of .836.

Grade	Half-Test Correlation (unadjusted r )	Whole-Test Estimate (rSB)
K	.718	.836
1	.807	.893
2	.777	.875
Table 5–1 Odd-Even Reliability Coefficients before and after the Spearman-Brown Adjustment*

*For scores on a test of mental ability

If test developers or users wish to shorten a test, the Spearman–Brown formula may be used to estimate the effect of the shortening on the test’s reliability. Reduction in test size for the purpose of reducing test administration time is a common practice in certain situations. For example, the test administrator may have only limited time with a particular testtaker or group of testtakers. Reduction in test size may be indicated in situations where boredom or fatigue could produce responses of questionable meaningfulness.

JUST THINK . . .

What are other situations in which a reduction in test size or the time it takes to administer a test might be desirable? What are the arguments against reducing test size?

A Spearman–Brown formula could also be used to determine the number of items needed to attain a desired level of reliability. In adding items to increase test reliability to a desired level, the rule is that the new items must be equivalent in content and difficulty so that the longer test still measures what the original test measured. If the reliability of the original test is relatively low, then it may be impractical to increase the number of items to reach an acceptable level of reliability. Another alternative would be to abandon this relatively unreliable instrument and locate—or develop—a suitable alternative. The reliability of the instrument could also be raised in some way. For example, the reliability of the instrument might be raised by creating new items, clarifying the test’s instructions, or simplifying the scoring rules.

Internal consistency estimates of reliability, such as that obtained by use of the Spearman–Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests. The impact of test characteristics on reliability is discussed in detail later in this chapter.Page 152

Other Methods of Estimating Internal Consistency

In addition to the Spearman–Brown formula, other methods used to obtain estimates of internal consistency reliability include formulas developed by Kuder and Richardson (1937) and Cronbach (1951). Inter-item consistency refers to the degree of correlation among all the items on a scale. A measure of inter-item consistency is calculated from a single administration of a single form of a test. An index of inter-item consistency, in turn, is useful in assessing the homogeneity of the test. Tests are said to be homogeneous if they contain items that measure a single trait. As an adjective used to describe test items, homogeneity (derived from the Greek words homos, meaning “same,” and genos, meaning “kind”) is the degree to which a test measures a single factor. In other words, homogeneity is the extent to which items in a scale are unifactorial.

In contrast to test homogeneity, heterogeneity describes the degree to which a test measures different factors. A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that assesses knowledge only of ultra high definition (UHD) television repair skills could be expected to be more homogeneous in content than a general electronics repair test. The former test assesses only one area whereas the latter assesses several, such as knowledge not only of UHD televisions but also of digital video recorders, Blu-Ray players, MP3 players, satellite radio receivers, and so forth.

The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item consistency than a heterogeneous test. Test homogeneity is desirable because it allows relatively straightforward test-score interpretation. Testtakers with the same score on a homogeneous test probably have similar abilities in the area tested. Testtakers with the same score on a more heterogeneous test may have quite different abilities.

Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality. One way to circumvent this potential source of difficulty has been to administer a series of homogeneous tests, each designed to measure some component of a heterogeneous variable.5

The Kuder–Richardson formulas

Dissatisfaction with existing split-half methods of estimating reliability compelled G. Frederic Kuder and M. W. Richardson (1937; Richardson & Kuder, 1939) to develop their own measures for estimating reliability. The most widely known of the many formulas they collaborated on is their Kuder–Richardson formula 20 , or KR-20, so named because it was the 20th formula developed in a series. Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar. However, KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple-choice items). If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method. Table 5–2 summarizes items on a sample heterogeneous test (the HERT), and Table 5–3 summarizes HERT performance for 20 testtakers. Assuming the difficulty level of all the items on the test to be about the same, would you expect a split-half (odd-even) estimate of reliability to be fairly high or low? How would the KR-20 reliability estimate compare with the odd-even estimate of reliability—would it be higher or lower?

Item Number	Content Area
1	UHD television
2	UHD television
3	Digital video recorder (DVR)
4	Digital video recorder (DVR)
5	Blu-Ray player
6	Blu-Ray player
7	Smart phone
8	Smart phone
9	Computer
10	Computer
11	Compact disc player
12	Compact disc player
13	Satellite radio receiver
14	Satellite radio receiver
15	Video camera
16	Video camera
17	MP3 player
18	MP3 player
Table 5–2 Content Areas Sampled for 18 Items of the Hypothetical Electronics Repair Test (HERT)
Item Number	Number of Testtakers Correct
1	14
2	12
3	9
4	18
5	8
6	5
7	6
8	9
9	10
10	10
11	8
12	6
13	15
14	9
15	12
16	12
17	14
18	7
Table 5–3 Performance on the 18-Item HERT by Item for 20 Testtakers

We might guess that, because the content areas sampled for the 18 items from this “Hypothetical Electronics Repair Test” are ordered in a manner whereby odd and even items Page 153tap the same content area, the odd-even reliability estimate will probably be quite high. Because of the great heterogeneity of content areas when taken as a whole, it could reasonably be predicted that the KR-20 estimate of reliability will be lower than the odd-even one. How is KR-20 computed? The following formula may be used:

where rKR20 stands for the Kuder–Richardson formula 20 reliability coefficient, k is the number of test items, σ2 is the variance of total test scores, p is the proportion of testtakers who pass the item, q is the proportion of people who fail the item, and Σ pq is the sum of the pq products over all items. For this particular example, k equals 18. Based on the data in Table 5–3, Σpq can be computed to be 3.975. The variance of total test scores is 5.26. Thus, rKR20 = .259.

An approximation of KR-20 can be obtained by the use of the 21st formula in the series developed by Kuder and Richardson, a formula known as—you guessed it—KR-21. The KR-21 formula may be used if there is reason to assume that all the test items have approximately Page 154the same degree of difficulty. Let’s add that this assumption is seldom justified. Formula KR-21 has become outdated in an era of calculators and computers. Way back when, KR-21 was sometimes used to estimate KR-20 only because it required many fewer calculations.

Numerous modifications of Kuder–Richardson formulas have been proposed through the years. The one variant of the KR-20 formula that has received the most acceptance and is in widest use today is a statistic called coefficient alpha. You may even hear it referred to as coefficient α−20. This expression incorporates both the Greek letter alpha (α) and the number 20, the latter a reference to KR-20.

Coefficient alpha

Developed by Cronbach (1951) and subsequently elaborated on by others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967), coefficient alpha may be thought of as the mean of all possible split-half correlations, corrected by the Spearman–Brown formula. In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is appropriate for use on tests containing nondichotomous items. The formula for coefficient alpha is

where ra is coefficient alpha, k is the number of items, is the variance of one item, Σ is the sum of variances of each item, and σ2 is the variance of the total test scores.

Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency reliability. A variation of the formula has been developed for use in obtaining an estimate of test-retest reliability (Green, 2003). Essentially, this formula yields an estimate of the mean of all possible test-retest, split-half coefficients. Coefficient alpha is widely used as a measure of reliability, in part because it requires only one administration of the test.

Unlike a Pearson r, which may range in value from −1 to +1, coefficient alpha typically ranges in value from 0 to 1. The reason for this is that, conceptually, coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are. Here, similarity is gauged, in essence, on a scale from 0 (absolutely no similarity) to 1 (perfectly identical). It is possible, however, to conceive of data sets that would yield a negative value of alpha (Streiner, 2003b). Still, because negative values of alpha are theoretically impossible, it is recommended under such rare circumstances that the alpha coefficient be reported as zero (Henson, 2001). Also, a myth about alpha is that “bigger is always better.” As Streiner (2003b) pointed out, a value of alpha above .90 may be “too high” and indicate redundancy in the items.

In contrast to coefficient alpha, a Pearson r may be thought of as dealing conceptually with both dissimilarity and similarity. Accordingly, an r value of −1 may be thought of as indicating “perfect dissimilarity.” In practice, most reliability coefficients—regardless of the specific type of reliability they are measuring—range in value from 0 to 1. This is generally true, although it is possible to conceive of exceptional cases in which data sets yield an r with a negative value.

Average proportional distance (APD)

A relatively new measure for evaluating the internal consistency of a test is the average proportional distance (APD) method (Sturman et al., 2009). Rather than focusing on similarity between scores on items of a test (as do split-half methods and Cronbach’s alpha), the APD is a measure that focuses on the degree of difference that exists between item scores. Accordingly, we define the average proportional distance method as a measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores.

To illustrate how the APD is calculated, consider the (hypothetical) “3-Item Test of Extraversion” (3-ITE). As conveyed by the title of the 3-ITE, it is a test that has only three Page 155items. Each of the items is a sentence that somehow relates to extraversion. Testtakers are instructed to respond to each of the three items with reference to the following 7-point scale: 1 = Very strongly disagree, 2 = Strongly disagree, 3 = Disagree, 4 = Neither Agree nor Disagree, 5 = Agree, 6 = Strongly agree, and 7 = Very strongly agree.

Typically, in order to evaluate the inter-item consistency of a scale, the calculation of the APD would be calculated for a group of testtakers. However, for the purpose of illustrating the calculations of this measure, let’s look at how the APD would be calculated for one testtaker. Yolanda scores 4 on Item 1, 5 on Item 2, and 6 on Item 3. Based on Yolanda’s scores, the APD would be calculated as follows:

· Step 1: Calculate the absolute difference between scores for all of the items.

· Step 2: Average the difference between scores.

· Step 3: Obtain the APD by dividing the average difference between scores by the number of response options on the test, minus one.

So, for the 3-ITE, here is how the calculations would look using Yolanda’s test scores:

· Step 1: Absolute difference between Items 1 and 2 = 1

· Absolute difference between Items 1 and 3 = 2

· Absolute difference between Items 2 and 3 = 1

· Step 2: In order to obtain the average difference (AD), add up the absolute differences in Step 1 and divide by the number of items as follows:

· Step 3: To obtain the average proportional distance (APD), divide the average difference by 6 (the 7 response options in our ITE scale minus 1). Using Yolanda’s data, we would divide 1.33 by 6 to get .22. Thus, the APD for the ITE is .22. But what does this mean?

The general “rule of thumb” for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range. A calculated APD of .25 is suggestive of problems with the internal consistency of the test. These guidelines are based on the assumption that items measuring a single construct such as extraversion should ideally be correlated with one another in the .6 to .7 range. Let’s add that the expected inter-item correlation may vary depending on the variables being measured, so the ideal correlation values are not set in stone. In the case of the 3-ITE, the data for our one subject suggests that the scale has acceptable internal consistency. Of course, in order to make any meaningful conclusions about the internal consistency of the 3-ITE, the instrument would have to be tested with a large sample of testtakers.

One potential advantage of the APD method over using Cronbach’s alpha is that the APD index is not connected to the number of items on a measure. Cronbach’s alpha will be higher when a measure has more than 25 items (Cortina, 1993). Perhaps the best course of action when evaluating the internal consistency of a given measure is to analyze and integrate the information using several indices, including Cronbach’s alpha, mean inter-item correlations, and the APD.

Before proceeding, let’s emphasize that all indices of reliability provide an index that is a characteristic of a particular group of test scores, not of the test itself (Caruso, 2000; Yin & Fan, 2000). Measures of reliability are estimates, and estimates are subject to error. The precise amount of error inherent in a reliability estimate will vary with various factors, such as the sample of testtakers from which the data were drawn. A reliability index published in a test manual might be very impressive. However, keep in mind that the reported reliability was achieved with a particular group of testtakers. If a new group of testtakers is sufficiently Page 156different from the group of testtakers on whom the reliability studies were done, the reliability coefficient may not be as impressive—and may even be unacceptable.

Measures of Inter-Scorer Reliability

When being evaluated, we usually would like to believe that the results would be the same no matter who is doing the evaluating.6 For example, if you take a road test for a driver’s license, you would like to believe that whether you pass or fail is solely a matter of your performance behind the wheel and not a function of who is sitting in the passenger’s seat. Unfortunately, in some types of tests under some conditions, the score may be more a function of the scorer than of anything else. This was demonstrated back in 1912, when researchers presented one pupil’s English composition to a convention of teachers and volunteers graded the papers. The grades ranged from a low of 50% to a high of 98% (Starch & Elliott, 1912). Concerns about inter-scorer reliability are as relevant today as they were back then (Chmielewski et al., 2015; Edens et al., 2015; Penney et al., 2016). With this as background, it can be appreciated that certain tests lend themselves to scoring in a way that is more consistent than with other tests. It is meaningful, therefore, to raise questions about the degree of consistency, or reliability, that exists between scorers of a particular test.

Variously referred to as scorer reliability, judge reliability, observer reliability, and inter-rater reliability, inter-scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure. Reference to levels of inter-scorer reliability for a particular test may be published in the test’s manual or elsewhere. If the reliability coefficient is high, the prospective test user knows that test scores can be derived in a systematic, consistent way by various scorers with sufficient training. A responsible test developer who is unable to create a test that can be scored with a reasonable degree of consistency by trained scorers will go back to the drawing board to discover the reason for this problem. If, for example, the problem is a lack of clarity in scoring criteria, then the remedy might be to rewrite the scoring criteria section of the manual to include clearly written scoring rules. Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with practice exercises and information on rater accuracy (Smith, 1986).

Inter-scorer reliability is often used when coding nonverbal behavior. For example, a researcher who wishes to quantify some aspect of nonverbal behavior, such as depressed mood, would start by composing a checklist of behaviors that constitute depressed mood (such as looking downward and moving slowly). Accordingly, each subject would be given a depressed mood score by a rater. Researchers try to guard against such ratings being products of the rater’s individual biases or idiosyncrasies in judgment. This can be accomplished by having at least one other individual observe and rate the same behaviors. If consensus can be demonstrated in the ratings, the researchers can be more confident regarding the accuracy of the ratings and their conformity with the established rating system.

JUST THINK . . .

Can you think of a measure in which it might be desirable for different judges, scorers, or raters to have different views on what is being judged, scored, or rated?

Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter-scorer reliability . In this chapter’s Everyday Psychometrics section, the nature of the relationship between the specific method used and the resulting estimate of diagnostic reliability is considered in greater detail.Page 157

EVERYDAY PSYCHOMETRICS

The Importance of the Method Used for Estimating Reliability*

As noted throughout this text, reliability is extremely important in its own right and is also a necessary, but not sufficient, condition for validity. However, researchers often fail to understand that the specific method used to obtain reliability estimates can lead to large differences in those estimates, even when other factors (such as subject sample, raters, and specific reliability statistic used) are held constant. A published study by Chmielewski et al. (2015) highlighted the substantial influence that differences in method can have on estimates of inter-rater reliability.

As one might expect, high levels of diagnostic (inter-rater) reliability are vital for the accurate diagnosis of psychiatric/psychological disorders. Diagnostic reliability must be acceptably high in order to accurately identify risk factors for a disorder that are common to subjects in a research study. Without satisfactory levels of diagnostic reliability, it becomes nearly impossible to accurately determine the effectiveness of treatments in clinical trials. Low diagnostic reliability can also lead to improper information regarding how a disorder changes over time. In applied clinical settings, unreliable diagnoses can result in ineffective patient care—or worse. The utility and validity of a particular diagnosis itself can be called into question if expert diagnosticians cannot, for whatever reason, consistently agree on who should and should not be so diagnosed. In sum, high levels of diagnostic reliability are essential for establishing diagnostic validity (Freedman, 2013; Nelson-Gray, 1991).

The official nomenclature of psychological/psychiatric diagnoses in the United States is the Diagnostic and Statistical Manual of Mental Disorders (DSM-5; American Psychiatric Association, 2013), which provides explicit diagnostic criteria for all mental disorders. A perceived strength of recent versions of the DSM is that disorders listed in the manual can be diagnosed with a high level of inter-rater reliability (Hyman, 2010; Nathan & Langenbucher, 1999), especially when trained professionals use semistructured interviews to assign those diagnoses. However, the field trials for the newest version of the manual, the DSM-5, demonstrated a mean kappa of only .44 (Regier et al., 2013), which is considered a “fair” level of agreement that is only moderately greater than chance (Cicchetti, 1994; Fleiss, 1981). Moreover, DSM-5 kappas were much lower than those from previous versions of the manual which had been in the “excellent” range. As one might expect, given the assumption that psychiatric diagnoses are reliable, the results of the DSM-5 field trials caused considerable controversy and led to numerous criticisms of the new manual (Frances, 2012; Jones, 2012). Interestingly, several diagnoses, which were unchanged from previous versions of the manual, also demonstrated low diagnostic reliability suggesting that the manual itself was not responsible for the apparent reduction in reliability. Instead, differences in the methods used to obtain estimates of inter-rater reliability in the DSM-5 Field Trials, compared to estimates for previous versions of the manual, may have led to the lower observed diagnostic reliability.

Prior to DSM-5, estimates of DSM inter-rater reliability were largely derived using the audio-recording method. In the audio-recording method, one clinician interviews a patient and assigns diagnoses. Then a second clinician, who does not know what diagnoses were assigned, listens to an audio-recording (or watches a video-recording) of the interview and independently assigns diagnoses. These two sets of ratings are then used to calculate inter-rater reliability coefficients (such as kappa). However, in recent years, several researchers have made the case that the audio-recording method might inflate estimates of diagnostic reliability for a variety of reasons (Chmielewski et al., 2015; Kraemer et al., 2012). First, if the interviewing clinician decides the patient they are interviewing does not meet diagnostic criteria for a disorder, they typically do not ask about any remaining symptoms of the disorder (this is a feature of semistructured interviews designed to reduce administration times). However, it also means that the clinician listening to the audio-tape, even if they believe the patient might meet diagnostic criteria for a disorder, does not have all the information necessary to assign a diagnosis and therefore is forced to agree that no diagnosis is present. Second, only the interviewing clinician can follow up patient responses with further questions or obtain clarification regarding symptoms to help them make a decision. Third, even when semistructured interviews are used it is possible that two highly trained clinicians might obtain different responses from a patient if they had each conducted their own interview. In other words, the patient may volunteer more or perhaps even different information to one of the clinicians for any number of reasons. All of the above result in the audio- or video-recording method artificially constraining the information provided to the clinicians to be identical, which is unlikely to occur in actual research or Page 158clinical settings. As such, this method does not allow for truly independent ratings and therefore likely results in overestimates of what would be obtained if separate interviews were conducted.

In the test-retest method, separate independent interviews are conducted by two different clinicians, with neither clinician knowing what occurred during the other interview. These interviews are conducted over a time frame short enough that true change in diagnostic status is highly unlikely, making this method similar to the dependability method of assessing reliability (Chmielewski & Watson, 2009). Because diagnostic reliability is intended to assess the extent to which a patient would receive the same diagnosis at different hospitals or clinics—or, alternatively, the extent to which different studies are recruiting similar patients—the test-retest method provides a more meaningful, realistic, and ecologically valid estimate of diagnostic reliability.

Chmielewski et al. (2015) examined the influence of method on estimates of reliability by using both the audio-recording and test-retest methods in a large sample of psychiatric patients. The authors’ analyzed DSM-5 diagnoses because of the long-standing claims in the literature that they were reliable and the fact that structured interviews had not yet been created for the DSM-5. They carefully selected a one-week test-retest interval, based on theory and research, to minimize the likelihood that true diagnostic change would occur while substantially reducing memory effects and patient fatigue which might exist if the interviews were conducted immediately after each other. Clinicians in the study were at least master’s level and underwent extensive training that far exceeded the training of clinicians in the vast majority of research studies. The same pool of clinicians and patients was used for the audio-recording and test-retest methods. Diagnoses were assigned using the Structured Clinical Interview for DSM-IV (SCID-I/P; First et al., 2002), which is widely considered the gold-standard diagnostic interview in the field. Finally, patients completed self-report measures which were examined to ensure patients’ symptoms did not change over the one-week interval.

Diagnostic (inter-rater) reliability using the audio-recording method was very high (mean kappa = .80) and would be considered “excellent” by traditional standards (Cicchetti, 1994; Fleiss, 1981). Moreover, estimates of diagnostic reliability were equivalent or superior to previously published values for the DSM-5. However, estimates of diagnostic reliability obtained from the test-retest method were substantially lower (mean kappa = .47) and would be considered only “fair” by traditional standards. Moreover, approximately 25% of the disorders demonstrated “poor” diagnostic reliability. Interestingly, this level of diagnostic reliability was very similar to that observed in the DSM-5 Field Trials (mean kappa = .44), which also used the test-retest method (Regier et al., 2013). It is important to note these large differences in estimates of diagnostic reliability emerged despite the fact that (1) the same highly trained master’s-level clinicians were used for both methods; (2) the SCID-I/P, which is considered the “gold standard” in diagnostic interviews, was used; (3) the same patient sample was used; and (4) patients’ self-report of their symptoms was very stable (or, patients were experiencing their symptoms the same way during both interviews) and any changes in self-report were unrelated to diagnostic disagreements between clinicians. These results suggest that the reliability of diagnoses is far lower than commonly believed. Moreover, the results demonstrate the substantial influence that method has on estimates of diagnostic reliability even when other factors are held constant.

Used with permission of Michael Chmielewski.

*This Everyday Psychometrics was guest-authored by Michael Chmielewski of Southern Methodist University and was based on an article by Chmielewski et al. (2015), published in the Journal of Abnormal Psychology (copyright © 2015 by the American Psychological Association). The use of this information does not imply endorsement by the publisher.

Using and Interpreting a Coefficient of Reliability

We have seen that, with respect to the test itself, there are basically three approaches to the estimation of reliability: (1) test-retest, (2) alternate or parallel forms, and (3) internal or inter-item consistency. The method or methods employed will depend on a number of factors, such as the purpose of obtaining a measure of reliability.

Another question that is linked in no trivial way to the purpose of the test is, “How high should the coefficient of reliability be?” Perhaps the best “short answer” to this question is: Page 159“On a continuum relative to the purpose and importance of the decisions to be made on the basis of scores on the test.” Reliability is a mandatory attribute in all tests we use. However, we need more of it in some tests, and we will admittedly allow for less of it in others. If a test score carries with it life-or-death implications, then we need to hold that test to some high standards—including relatively high standards with regard to coefficients of reliability. If a test score is routinely used in combination with many other test scores and typically accounts for only a small part of the decision process, that test will not be held to the highest standards of reliability. As a rule of thumb, it may be useful to think of reliability coefficients in a way that parallels many grading systems: In the .90s rates a grade of A (with a value of .95 higher for the most important types of decisions), in the .80s rates a B (with below .85 being a clear B−), and anywhere from .65 through the .70s rates a weak, “barely passing” grade that borders on failing (and unacceptable). Now, let’s get a bit more technical with regard to the purpose of the reliability coefficient.

The Purpose of the Reliability Coefficient

If a specific test of employee performance is designed for use at various times over the course of the employment period, it would be reasonable to expect the test to demonstrate reliability across time. It would thus be desirable to have an estimate of the instrument’s test-retest reliability. For a test designed for a single administration only, an estimate of internal consistency would be the reliability measure of choice. If the purpose of determining reliability is to break down the error variance into its parts, as shown in Figure 5–1, then a number of reliability coefficients would have to be calculated.

Figure 5–1 Sources of Variance in a Hypothetical Test In this hypothetical situation, 5% of the variance has not been identified by the test user. It is possible, for example, that this portion of the variance could be accounted for by transient error, a source of error attributable to variations in the testtaker’s feelings, moods, or mental state over time. Then again, this 5% of the error may be due to other factors that are yet to be identified.

Note that the various reliability coefficients do not all reflect the same sources of error variance. Thus, an individual reliability coefficient may provide an index of error from test construction, test administration, or test scoring and interpretation. A coefficient of inter-rater reliability, for example, provides information about error as a result of test scoring. Specifically, it can be used to answer questions about how consistently two scorers score the same test items. Table 5–4 summarizes the different kinds of error variance that are reflected in different reliability coefficients.

Page 160

Type of Reliability	Purpose	Typical uses	Number of Testing Sessions	Sources of Error Variance	Statistical Procedures
· Test-retest	· To evaluate the stabilityof a measure	· When assessing the stability of various personality traits	· 2	· Administration	· Pearson r or Spearman rho
· Alternate-forms	· To evaluate the relationship between different forms of a measure	· When there is a need for different forms of a test (e.g., makeup tests)	· 1 or 2	· Test construction or administration	· Pearson r or Spearman rho
· Internal consistency	· To evaluate the extent to which items on a scale relate to one another	· When evaluating the homogeneity of a measure (or, all items are tapping a single construct)	· 1	· Test construction	· Pearson r between equivalent test halves with Spearman Brown correction or Kuder-R-ichardson for dichotomous items, or coefficient alpha for multipoint items or APD
· Inter-scorer	· To evaluate the level of agreement between raters on a measure	· Interviews or coding of behavior. Used when researchers need to show that there is consensus in the way that different raters view a particular behavior pattern (and hence no observer bias).	· 1	· Scoring and interpretation	· Cohen’s kappa, Pearson r or Spearman rho
Table 5–4 Summary of Reliability Types

The Nature of the Test

Closely related to considerations concerning the purpose and use of a reliability coefficient are those concerning the nature of the test itself. Included here are considerations such as whether (1) the test items are homogeneous or heterogeneous in nature; (2) the characteristic, ability, or trait being measured is presumed to be dynamic or static; (3) the range of test scores is or is not restricted; (4) the test is a speed or a power test; and (5) the test is or is not criterion-referenced.

Some tests present special problems regarding the measurement of their reliability. For example, a number of psychological tests have been developed for use with infants to help identify children who are developing slowly or who may profit from early intervention of some sort. Measuring the internal consistency reliability or the inter-scorer reliability of such tests is accomplished in much the same way as it is with other tests. However, measuring test-retest reliability presents a unique problem. The abilities of the very young children being tested are fast-changing. It is common knowledge that cognitive development during the first months and years of life is both rapid and uneven. Children often grow in spurts, sometimes changing dramatically in as little as days (Hetherington & Parke, 1993). The child tested just before and again just after a developmental advance may perform very differently on the two testings. In such cases, a marked change in test score might be attributed to error when in reality it reflects a genuine change in the testtaker’s skills. The challenge in gauging the test-retest reliability of such tests is to do so in such a way that it is not spuriously lowered by the testtaker’s actual Page 161developmental changes between testings. In attempting to accomplish this, developers of such tests may design test-retest reliability studies with very short intervals between testings, sometimes as little as four days.

Homogeneity versus heterogeneity of test items

Recall that a test is said to be homogeneous in items if it is functionally uniform throughout. Tests designed to measure one factor, such as one ability or one trait, are expected to be homogeneous in items. For such tests, it is reasonable to expect a high degree of internal consistency. By contrast, if the test is heterogeneous in items, an estimate of internal consistency might be low relative to a more appropriate estimate of test-retest reliability.

Dynamic versus static characteristics

Whether what is being measured by the test is dynamic or static is also a consideration in obtaining an estimate of reliability. A dynamic characteristic is a trait, state, or ability presumed to be ever-changing as a function of situational and cognitive experiences. If, for example, one were to take hourly measurements of the dynamic characteristic of anxiety as manifested by a stockbroker throughout a business day, one might find the measured level of this characteristic to change from hour to hour. Such changes might even be related to the magnitude of the Dow Jones average. Because the true amount of anxiety presumed to exist would vary with each assessment, a test-retest measure would be of little help in gauging the reliability of the measuring instrument. Therefore, the best estimate of reliability would be obtained from a measure of internal consistency. Contrast this situation to one in which hourly assessments of this same stockbroker are made on a trait, state, or ability presumed to be relatively unchanging (a static characteristic ), such as intelligence. In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test-retest or the alternate-forms method would be appropriate.

JUST THINK . . .

Provide another example of both a dynamic characteristic and a static characteristic that a psychological test could measure.

Restriction or inflation of range

In using and interpreting a coefficient of reliability, the issue variously referred to as restriction of range or restriction of variance (or, conversely, inflation of range or inflation of variance ) is important. If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower. If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher. Refer back to Figure 3–17 on page 111 (Two Scatterplots Illustrating Unrestricted and Restricted Ranges) for a graphic illustration.

Also of critical importance is whether the range of variances employed is appropriate to the objective of the correlational analysis. Consider, for example, a published educational test designed for use with children in grades 1 through 6. Ideally, the manual for this test should contain not one reliability value covering all the testtakers in grades 1 through 6 but instead reliability values for testtakers at each grade level. Here’s another example: A corporate personnel officer employs a certain screening test in the hiring process. For future testing and hiring purposes, this personnel officer maintains reliability data with respect to scores achieved by job applicants—as opposed to hired employees—in order to avoid restriction of range effects in the data. This is so because the people who were hired typically scored higher on the test than any comparable group of applicants.

Speed tests versus power tests

When a time limit is long enough to allow testtakers to attempt all items, and if some items are so difficult that no testtaker is able to obtain a perfect score, then the test is a power test . By contrast, a speed test generally contains items of Page 162uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all testtakers should be able to complete all the test items correctly. In practice, however, the time limit on a speed test is established so that few if any of the testtakers will be able to complete the entire test. Score differences on a speed test are therefore based on performance speed because items attempted tend to be correct.

A reliability estimate of a speed test should be based on performance from two independent testing periods using one of the following: (1) test-retest reliability, (2) alternate-forms reliability, or (3) split-half reliability from two separately timed half tests. If a split-half procedure is used, then the obtained reliability coefficient is for a half test and should be adjusted using the Spearman–Brown formula.

Because a measure of the reliability of a speed test should reflect the consistency of response speed, the reliability of a speed test should not be calculated from a single administration of the test with a single time limit. If a speed test is administered once and some measure of internal consistency, such as the Kuder–Richardson or a split-half correlation, is calculated, the result will be a spuriously high reliability coefficient. To understand why the KR-20 or split-half reliability coefficient will be spuriously high, consider the following example.

When a group of testtakers completes a speed test, almost all the items completed will be correct. If reliability is examined using an odd-even split, and if the testtakers completed the items in order, then testtakers will get close to the same number of odd as even items correct. A testtaker completing 82 items can be expected to get approximately 41 odd and 41 even items correct. A testtaker completing 61 items may get 31 odd and 30 even items correct. When the numbers of odd and even items correct are correlated across a group of testtakers, the correlation will be close to 1.00. Yet this impressive correlation coefficient actually tells us nothing about response consistency.

Under the same scenario, a Kuder–Richardson reliability coefficient would yield a similar coefficient that would also be, well, equally useless. Recall that KR-20 reliability is based on the proportion of testtakers correct (p) and the proportion of testtakers incorrect (q) on each item. In the case of a speed test, it is conceivable that p would equal 1.0 and q would equal 0 for many of the items. Toward the end of the test—when many items would not even be attempted because of the time limit—p might equal 0 and q might equal 1.0. For many, if not a majority, of the items, then, the product pq would equal or approximate 0. When 0 is substituted in the KR-20 formula for Σ pq, the reliability coefficient is 1.0 (a meaningless coefficient in this instance).

Criterion-referenced tests

A criterion-referenced test is designed to provide an indication of where a testtaker stands with respect to some variable or criterion, such as an educational or a vocational objective. Unlike norm-referenced tests, criterion-referenced tests tend to contain material that has been mastered in hierarchical fashion. For example, the would-be pilot masters on-ground skills before attempting to master in-flight skills. Scores on criterion-referenced tests tend to be interpreted in pass–fail (or, perhaps more accurately, “master-failed-to-master”) terms, and any scrutiny of performance on individual items tends to be for diagnostic and remedial purposes.

Traditional techniques of estimating reliability employ measures that take into account scores on the entire test. Recall that a test-retest reliability estimate is based on the correlation between the total scores on two administrations of the same test. In alternate-forms reliability, a reliability estimate is based on the correlation between the two total scores on the two forms. In split-half reliability, a reliability estimate is based on the correlation between scores on two halves of the test and is then adjusted using the Spearman–Brown formula to obtain a reliability estimate of the whole test. Although there are exceptions, such traditional procedures of Page 163estimating reliability are usually not appropriate for use with criterion-referenced tests. To understand why, recall that reliability is defined as the proportion of total variance (σ2) attributable to true variance (σ2th). Total variance in a test score distribution equals the sum of the true variance plus the error variance (σe2)

A measure of reliability, therefore, depends on the variability of the test scores: how different the scores are from one another. In criterion-referenced testing, and particularly in mastery testing, how different the scores are from one another is seldom a focus of interest. In fact, individual differences between examinees on total test scores may be minimal. The critical issue for the user of a mastery test is whether or not a certain criterion score has been achieved.

As individual differences (and the variability) decrease, a traditional measure of reliability would also decrease, regardless of the stability of individual performance. Therefore, traditional ways of estimating reliability are not always appropriate for criterion-referenced tests, though there may be instances in which traditional estimates can be adopted. An example might be a situation in which the same test is being used at different stages in some program—training, therapy, or the like—and so variability in scores could reasonably be expected. Statistical techniques useful in determining the reliability of criterion-referenced tests are discussed in great detail in many sources devoted to that subject (e.g., Hambleton & Jurgensen, 1990).

The True Score Model of Measurement and Alternatives to It

Thus far—and throughout this book, unless specifically stated otherwise—the model we have assumed to be operative is classical test theory (CTT) , also referred to as the true score (or classical) model of measurement. CTT is the most widely used and accepted model in the psychometric literature today—rumors of its demise have been greatly exaggerated (Zickar & Broadfoot, 2009). One of the reasons it has remained the most widely used model has to do with its simplicity, especially when one considers the complexity of other proposed models of measurement. Comparing CTT to IRT, for example, Streiner (2010) mused, “CTT is much simpler to understand than IRT; there aren’t formidable-looking equations with exponentiations, Greek letters, and other arcane symbols” (p. 185). Additionally, the CTT notion that everyone has a “true score” on a test has had, and continues to have, great intuitive appeal. Of course, exactly how to define this elusive true score has been a matter of sometimes contentious debate. For our purposes, we will define true score as a value that according to classical test theory genuinely reflects an individual’s ability (or trait) level as measured by a particular test. Let’s emphasize here that this value is indeed very test dependent. A person’s “true score” on one intelligence test, for example, can vary greatly from that same person’s “true score” on another intelligence test. Similarly, if “Form D” of an ability test contains items that the testtaker finds to be much more difficult than those on “Form E” of that test, then there is a good chance that the testtaker’s true score on Form D will be lower than that on Form E. The same holds for true scores obtained on different tests of personality. One’s true score on one test of extraversion, for example, may not bear much resemblance to one’s true score on another test of extraversion. Comparing a testtaker’s scores on two different tests purporting to measure the same thing requires a sophisticated knowledge of the properties of each of the two tests, as well as some rather complicated statistical procedures designed to equate the scores.

Another aspect of the appeal of CTT is that its assumptions allow for its application in most situations (Hambleton & Swaminathan, 1985). The fact that CTT assumptions are rather easily met and therefore applicable to so many measurement situations can be Page 164advantageous, especially for the test developer in search of an appropriate model of measurement for a particular application. Still, in psychometric parlance, CTT assumptions are characterized as “weak”—this precisely because its assumptions are so readily met. By contrast, the assumptions in another model of measurement, item response theory (IRT), are more difficult to meet. As a consequence, you may read of IRT assumptions being characterized in terms such as “strong,” “hard,” “rigorous,” and “robust.” A final advantage of CTT over any other model of measurement has to do with its compatibility and ease of use with widely used statistical techniques (as well as most currently available data analysis software). Factor analytic techniques, whether exploratory or confirmatory, are all “based on the CTT measurement foundation” (Zickar & Broadfoot, 2009, p. 52).

For all of its appeal, measurement experts have also listed many problems with CTT. For starters, one problem with CTT has to do with its assumption concerning the equivalence of all items on a test; that is, all items are presumed to be contributing equally to the score total. This assumption is questionable in many cases, and particularly questionable when doubt exists as to whether the scaling of the instrument in question is genuinely interval level in nature. Another problem has to do with the length of tests that are developed using a CTT model. Whereas test developers favor shorter rather than longer tests (as do most testtakers), the assumptions inherent in CTT favor the development of longer rather than shorter tests. For these reasons, as well as others, alternative measurement models have been developed. Below we briefly describe domain sampling theory and generalizability theory. We will then describe in greater detail, item response theory (IRT), a measurement model that some believe is a worthy successor to CTT (Borsbroom, 2005; Harvey & Hammer, 1999).

Domain sampling theory and generalizability theory

The 1950s saw the development of a viable alternative to CTT. It was originally referred to as domain sampling theory and is better known today in one of its many modified forms as generalizability theory. As set forth by Tryon (1957), the theory of domain sampling rebels against the concept of a true score existing with respect to the measurement of psychological constructs. Whereas those who subscribe to CTT seek to estimate the portion of a test score that is attributable to error, proponents of domain sampling theory seek to estimate the extent to which specific sources of variation under defined conditions are contributing to the test score. In domain sampling theory, a test’s reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample (Thorndike, 1985). A domain of behavior, or the universe of items that could conceivably measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test. In theory, the items in the domain are thought to have the same means and variances of those in the test that samples from the domain. Of the three types of estimates of reliability, measures of internal consistency are perhaps the most compatible with domain sampling theory.

In one modification of domain sampling theory called generalizability theory, a “universe score” replaces that of a “true score” (Shavelson et al., 1989). Developed by Lee J. Cronbach (1970) and his colleagues (Cronbach et al., 1972), generalizability theory is based on the idea that a person’s test scores vary from testing to testing because of variables in the testing situation. Instead of conceiving of all variability in a person’s scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score. This universe is described in terms of its facets , which include things like the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration. Page 165According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained. This test score is the universe score , and it is, as Cronbach noted, analogous to a true score in the true score model. Cronbach (1970) explained as follows:

“What is Mary’s typing ability?” This must be interpreted as “What would Mary’s word processing score on this be if a large number of measurements on the test were collected and averaged?” The particular test score Mary earned is just one out of a universe of possible observations. If one of these scores is as acceptable as the next, then the mean, called the universe score and symbolized here by Mp (mean for person p), would be the most appropriate statement of Mary’s performance in the type of situation the test represents.

The universe is a collection of possible measures “of the same kind,” but the limits of the collection are determined by the investigator’s purpose. If he needs to know Mary’s typing ability on May 5 (for example, so that he can plot a learning curve that includes one point for that day), the universe would include observations on that day and on that day only. He probably does want to generalize over passages, testers, and scorers—that is to say, he would like to know Mary’s ability on May 5 without reference to any particular passage, tester, or scorer… .

The person will ordinarily have a different universe score for each universe. Mary’s universe score covering tests on May 5 will not agree perfectly with her universe score for the whole month of May… . Some testers call the average over a large number of comparable observations a “true score”; e.g., “Mary’s true typing rate on 3-minute tests.” Instead, we speak of a “universe score” to emphasize that what score is desired depends on the universe being considered. For any measure there are many “true scores,” each corresponding to a different universe.

When we use a single observation as if it represented the universe, we are generalizing. We generalize over scorers, over selections typed, perhaps over days. If the observed scores from a procedure agree closely with the universe score, we can say that the observation is “accurate,” or “reliable,” or “generalizable.” And since the observations then also agree with each other, we say that they are “consistent” and “have little error variance.” To have so many terms is confusing, but not seriously so. The term most often used in the literature is “reliability.” The author prefers “generalizability” because that term immediately implies “generalization to what?” … There is a different degree of generalizability for each universe. The older methods of analysis do not separate the sources of variation. They deal with a single source of variance, or leave two or more sources entangled. (Cronbach, 1970, pp. 153–154)

How can these ideas be applied? Cronbach and his colleagues suggested that tests be developed with the aid of a generalizability study followed by a decision study. A generalizability study examines how generalizable scores from a particular test are if the test is administered in different situations. Stated in the language of generalizability theory, a generalizability study examines how much of an impact different facets of the universe have on the test score. Is the test score affected by group as opposed to individual administration? Is the test score affected by the time of day in which the test is administered? The influence of particular facets on the test score is represented by coefficients of generalizability . These coefficients are similar to reliability coefficients in the true score model.

After the generalizability study is done, Cronbach et al. (1972) recommended that test developers do a decision study, which involves the application of information from the generalizability study. In the decision study , developers examine the usefulness of test scores in helping the test user make decisions. In practice, test scores are used to guide a variety of decisions, from placing a child in special education to hiring new employees to Page 166discharging mental patients from the hospital. The decision study is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use. Why is this so important? Cronbach (1970) noted:

The decision that a student has completed a course or that a patient is ready for termination of therapy must not be seriously influenced by chance errors, temporary variations in performance, or the tester’s choice of questions. An erroneous favorable decision may be irreversible and may harm the person or the community. Even when reversible, an erroneous unfavorable decision is unjust, disrupts the person’s morale, and perhaps retards his development. Research, too, requires dependable measurement. An experiment is not very informative if an observed difference could be accounted for by chance variation. Large error variance is likely to mask a scientifically important outcome. Taking a better measure improves the sensitivity of an experiment in the same way that increasing the number of subjects does. (p. 152)

Generalizability has not replaced CTT. Perhaps one of its chief contributions has been its emphasis on the fact that a test’s reliability does not reside within the test itself. From the perspective of generalizability theory, a test’s reliability is very much a function of the circumstances under which the test is developed, administered, and interpreted.

Item response theory (IRT)

Another alternative to the true score model is item response theory (IRT; Lord & Novick, 1968; Lord, 1980). The procedures of item response theory provide a way to model the probability that a person with X ability will be able to perform at a level of Y. Stated in terms of personality assessment, it models the probability that a person with X amount of a particular personality trait will exhibit Y amount of that trait on a personality test designed to measure it. Because so often the psychological or educational construct being measured is physically unobservable (stated another way, is latent) and because the construct being measured may be a trait (it could also be something else, such as an ability), a synonym for IRT in the academic literature is latent-trait theory . Let’s note at the outset, however, that IRT is not a term used to refer to a single theory or method. Rather, it refers to a family of theories and methods—and quite a large family at that—with many other names used to distinguish specific approaches. There are well over a hundred varieties of IRT models. Each model is designed to handle data with certain assumptions and data characteristics.

Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item’s level of discrimination; items may be viewed as varying in terms of these, as well as other, characteristics. “Difficulty” in this sense refers to the attribute of not being easily accomplished, solved, or comprehended. In a mathematics test, for example, a test item tapping basic addition ability will have a lower difficulty level than a test item tapping basic algebra skills. The characteristic of difficulty as applied to a test item may also refer to physical difficulty—that is, how hard or easy it is for a person to engage in a particular activity. Consider in this context three items on a hypothetical “Activities of Daily Living Questionnaire” (ADLQ), a true–false questionnaire designed to tap the extent to which respondents are physically able to participate in activities of daily living. Item 1 of this test is I am able to walk from room to room in my home. Item 2 is I require assistance to sit, stand, and walk. Item 3 is I am able to jog one mile a day, seven days a week. With regard to difficulty related to mobility, the respondent who answers true to item 1 and false to item 2 may be presumed to have more mobility than the respondent who answers false to item 1 and true to item 2. In classical test theory, each of these items might be scored with 1 point awarded to responses indicative Page 167of mobility and 0 points for responses indicative of a lack of mobility. Within IRT, however, responses indicative of mobility (as opposed to a lack of mobility or impaired mobility) may be assigned different weights. A true response to item 1 may therefore earn more points than a false response to item 2, and a true response to item 3 may earn more points than a true response to item 1.

In the context of IRT, discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured. Consider two more ADLQ items: item 4, My mood is generally good; and item 5, I am able to walk one block on flat ground. Which of these two items do you think would be more discriminating in terms of the respondent’s physical abilities? If you answered “item 5” then you are correct. And if you were developing this questionnaire within an IRT framework, you would probably assign differential weight to the value of these two items. Item 5 would be given more weight for the purpose of estimating a person’s level of physical activity than item 4. Again, within the context of classical test theory, all items of the test might be given equal weight and scored, for example, 1 if indicative of the ability being measured and 0 if not indicative of that ability.

A number of different IRT models exist to handle data resulting from the administration of tests with various characteristics and in various formats. For example, there are IRT models designed to handle data resulting from the administration of tests with dichotomous test items (test items or questions that can be answered with only one of two alternative responses, such as true–false, yes–no, or correct–incorrect questions). There are IRT models designed to handle data resulting from the administration of tests with polytomous test items (test items or questions with three or more alternative responses, where only one is scored correct or scored as being consistent with a targeted trait or other construct). Other IRT models exist to handle other types of data.

In general, latent-trait models differ in some important ways from CTT. For example, in CTT, no assumptions are made about the frequency distribution of test scores. By contrast, such assumptions are inherent in latent-trait models. As Allen and Yen (1979, p. 240) have pointed out, “Latent-trait theories propose models that describe how the latent trait influences performance on each test item. Unlike test scores or true scores, latent traits theoretically can take on values from −∞ to +∞ [negative infinity to positive infinity].” Some IRT models have very specific and stringent assumptions about the underlying distribution. In one group of IRT models developed by the Danish mathematician Georg Rasch, each item on the test is assumed to have an equivalent relationship with the construct being measured by the test. A shorthand reference to these types of models is “Rasch,” so reference to the Rasch model is a reference to an IRT model with very specific assumptions about the underlying distribution.

The psychometric advantages of IRT have made this model appealing, especially to commercial and academic test developers and to large-scale test publishers. It is a model that in recent years has found increasing application in standardized tests, professional licensing examinations, and questionnaires used in behavioral and social sciences (De Champlain, 2010). However, the mathematical sophistication of the approach has made it out of reach for many everyday users of tests such as classroom teachers or “mom and pop” employers (Reise & Henson, 2003). To learn more about the approach that Roid (2006) once characterized as having fostered “new rules of measurement” for ability testing ask your instructor to access the Instructor Resources within Connect and check out OOBAL-5-B2, “Item Response Theory (IRT).” More immediately, you can meet a “real-life” user of IRT in this chapter’s Meet an Assessment Professional feature.Page 168

MEET AN ASSESSMENT PROFESSIONAL

Meet Dr. Bryce B. Reeve

Iuse my skills and training as a psychometrician to design questionnaires and studies to capture the burden of cancer and its treatment on patients and their families… . The types of questionnaires I help to create measure a person’s health-related quality of life (HRQOL). HRQOL is a multidimensional construct capturing such domains as physical functioning, mental well-being, and social well-being. Different cancer types and treatments for those cancers may have different impact on the magnitude and which HRQOL domain is affected. All cancers can impact a person’s mental health with documented increases in depressive symptoms and anxiety… . There may also be positive impacts of cancer as some cancer survivors experience greater social well-being and appreciation of life. Thus, our challenge is to develop valid and precise measurement tools that capture these changes in patients’ lives. Psychometrically strong measures also allow us to evaluate the impact of new behavioral or pharmacological interventions developed to improve quality of life. Because many patients in our research studies are ill, it is important to have very brief questionnaires to minimize their burden responding to a battery of questionnaires.

… we … use both qualitative and quantitative methodologies to design … HRQOL instruments. We use qualitative methods like focus groups and cognitive interviewing to make sure we have captured the experiences and perspectives of cancer patients and to write questions that are comprehendible to people with low literacy skills or people of different cultures. We use quantitative methods to examine how well individual questions and scales perform for measuring the HRQOL domains. Specifically, we use classical test theory, factor analysis, and item response theory (IRT) to: (1) develop and refine questionnaires; (2) identify the performance of instruments across different age groups, males and females, and cultural/racial groups; and (3) to develop item banks which allow for creating standardized questionnaires or administering computerized adaptive testing (CAT).

Bryce B. Reeve, Ph.D., U.S. National Cancer Institute © Bryce B. Reeve/National Institute of Health

I use IRT models to get an in-depth look as to how questions and scales perform in our cancer research studies. [Using IRT], we were able to reduce a burdensome 21-item scale down to a brief 10-item scale… .

Differential item function (DIF) is a key methodology to identify … biased items in questionnaires. I have used IRT modeling to examine DIF in item responses on many HRQOL questionnaires. It is especially important to evaluate DIF in questionnaires that have been translated to multiple languages for the purpose of conducting international research studies. An instrument may be translated to have the same words in multiple languages, but the words themselves may have entirely different meaning to people of different cultures. For example, researchers at the University of Massachusetts found Chinese respondents gave lower satisfaction ratings of their medical doctors than non-Chinese. In a review of the translation, the “Excellent” response category translated into Chinese as “God-like.” IRT modeling gives me the ability to not only detect DIF items, but the flexibility to correct for bias as well. I can use IRT to look at unadjusted and adjusted IRT scores to see the effect of the DIF item without removing the item from the scale if the item is deemed relevant… .Page 169

The greatest challenges I found to greater application or acceptance of IRT methods in health care research are the complexities of the models themselves and lack of easy-to-understand resources and tools to train researchers. Many researchers have been trained in classical test theory statistics, are comfortable interpreting these statistics, and can use readily available software to generate easily familiar summary statistics, such as Cronbach’s coefficient α or item-total correlations. In contrast, IRT modeling requires an advanced knowledge of measurement theory to understand the mathematical complexities of the models, to determine whether the assumptions of the IRT models are met, and to choose the model from within the large family of IRT models that best fits the data and the measurement task at hand. In addition, the supporting software and literature are not well adapted for researchers outside the field of educational testing.

Read more of what Dr. Reeve had to say—his complete essay—through the Instructor Resources within Connect.

Used with permission of Bryce B. Reeve.

Reliability and Individual Scores

The reliability coefficient helps the test developer build an adequate measuring instrument, and it helps the test user select a suitable test. However, the usefulness of the reliability coefficient does not end with test construction and selection. By employing the reliability coefficient in the formula for the standard error of measurement, the test user now has another descriptive statistic relevant to test interpretation, this one useful in estimating the precision of a particular test score.

The Standard Error of Measurement

The standard error of measurement, often abbreviated as SEM or SEM, provides a measure of the precision of an observed test score. Stated another way, it provides an estimate of the amount of error inherent in an observed score or measurement. In general, the relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM.

To illustrate the utility of the SEM, let’s revisit The Rochester Wrenchworks (TRW) and reintroduce Mary (from Cronbach’s excerpt earlier in this chapter), who is now applying for a job as a word processor. To be hired at TRW as a word processor, a candidate must be able to word-process accurately at the rate of 50 words per minute. The personnel office administers a total of seven brief word-processing tests to Mary over the course of seven business days. In words per minute, Mary’s scores on each of the seven tests are as follows:

52 55 39 56 35 50 54

If you were in charge of hiring at TRW and you looked at these seven scores, you might logically ask, “Which of these scores is the best measure of Mary’s ‘true’ word-processing ability?” And more to the point, “Which is her ‘true’ score?”

The “true” answer to this question is that we cannot conclude with absolute certainty from the data we have exactly what Mary’s true word-processing ability is. We can, however, make an educated guess. Our educated guess would be that her true word-processing ability is equal to the mean of the distribution of her word-processing scores plus or minus a number of points accounted for by error in the measurement process. We do not know how many points are accounted for by error in the measurement process. The best we can do is estimate how much error entered into a particular test score.

The standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the standard error of Page 170measurement as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests. Also known as the standard error of a score and denoted by the symbol σmeas, the standard error of measurement is an index of the extent to which one individual’s scores vary over tests presumed to be parallel. In accordance with the true score model, an obtained test score represents one point in the theoretical distribution of scores the testtaker could have obtained. But where on the continuum of possible scores is this obtained score? If the standard deviation for the distribution of test scores is known (or can be calculated) and if an estimate of the reliability of the test is known (or can be calculated), then an estimate of the standard error of a particular score (or, the standard error of measurement) can be determined by the following formula:

where σmeas is equal to the standard error of measurement, σ is equal to the standard deviation of test scores by the group of testtakers, and rxx is equal to the reliability coefficient of the test. The standard error of measurement allows us to estimate, with a specific level of confidence, the range in which the true score is likely to exist.

If, for example, a spelling test has a reliability coefficient of .84 and a standard deviation of 10, then

In order to use the standard error of measurement to estimate the range of the true score, we make an assumption: If the individual were to take a large number of equivalent tests, scores on those tests would tend to be normally distributed, with the individual’s true score as the mean. Because the standard error of measurement functions like a standard deviation in this context, we can use it to predict what would happen if an individual took additional equivalent tests:

· approximately 68% (actually, 68.26%) of the scores would be expected to occur within ±1σmeas of the true score;

· approximately 95% (actually, 95.44%) of the scores would be expected to occur within ±2σmeas of the true score;

· approximately 99% (actually, 99.74%) of the scores would be expected to occur within ±3σmeas of the true score.

Of course, we don’t know the true score for any individual testtaker, so we must estimate it. The best estimate available of the individual’s true score on the test is the test score already obtained. Thus, if a student achieved a score of 50 on one spelling test and if the test had a standard error of measurement of 4, then—using 50 as the point estimate—we can be:

· 68% (actually, 68.26%) confident that the true score falls within 50 ± 1σmeas (or between 46 and 54, including 46 and 54);

· 95% (actually, 95.44%) confident that the true score falls within 50 ± 2σmeas (or between 42 and 58, including 42 and 58);

· 99% (actually, 99.74%) confident that the true score falls within 50 ± 3σmeas (or between 38 and 62, including 38 and 62).

The standard error of measurement, like the reliability coefficient, is one way of expressing test reliability. If the standard deviation of a test is held constant, then the smaller the σmeas, the more reliable the test will be; as rxx increases, the σmeas decreases. For example, when a reliability coefficient equals .64 and σ equals 15, the standard error of measurement equals 9:

Page 171

With a reliability coefficient equal to .96 and σ still equal to 15, the standard error of measurement decreases to 3:

In practice, the standard error of measurement is most frequently used in the interpretation of individual test scores. For example, intelligence tests are given as part of the assessment of individuals for intellectual disability. One of the criteria for mental retardation is an IQ score of 70 or below (when the mean is 100 and the standard deviation is 15) on an individually administered intelligence test (American Psychiatric Association, 1994). One question that could be asked about these tests is how scores that are close to the cutoff value of 70 should be treated. Specifically, how high above 70 must a score be for us to conclude confidently that the individual is unlikely to be retarded? Is 72 clearly above the retarded range, so that if the person were to take a parallel form of the test, we could be confident that the second score would be above 70? What about a score of 75? A score of 79?

Useful in answering such questions is an estimate of the amount of error in an observed test score. The standard error of measurement provides such an estimate. Further, the standard error of measurement is useful in establishing what is called a confidence interval : a range or band of test scores that is likely to contain the true score.

Consider an application of a confidence interval with one hypothetical measure of adult intelligence. The manual for the test provides a great deal of information relevant to the reliability of the test as a whole as well as more specific reliability-related information for each of its subtests. As reported in the manual, the standard deviation is 3 for the subtest scaled scores and 15 for IQ scores. Across all of the age groups in the normative sample, the average reliability coefficient for the Full Scale IQ (FSIQ) is .98, and the average standard error of measurement for the FSIQ is 2.3.

Knowing an individual testtaker’s FSIQ score and his or her age, we can calculate a confidence interval. For example, suppose a 22-year-old testtaker obtained a FSIQ of 75. The test user can be 95% confident that this testtaker’s true FSIQ falls in the range of 70 to 80. This is so because the 95% confidence interval is set by taking the observed score of 75, plus or minus 1.96, multiplied by the standard error of measurement. In the test manual we find that the standard error of measurement of the FSIQ for a 22-year-old testtaker is 2.37. With this information in hand, the 95% confidence interval is calculated as follows:

The calculated interval of 4.645 is rounded to the nearest whole number, 5. We can therefore be 95% confident that this testtaker’s true FSIQ on this particular test of intelligence lies somewhere in the range of the observed score of 75 plus or minus 5, or somewhere in the range of 70 to 80.

In the interest of increasing your SEM “comfort level,” consider the data presented in Table 5–5. These are SEMs for selected age ranges and selected types of IQ measurements as reported in the Technical Manual for the Stanford-Binet Intelligence Scales, fifth edition (SB5). When presenting these and related data, Roid (2003c, p. 65) noted: “Scores that are more precise and consistent have smaller differences between true and observed scores, resulting in lower SEMs.” Given this, just think: What hypotheses come to mind regarding SB5 IQ scores at ages 5, 10, 15, and 80+?

	Age (in years)
IQ Type	5	10	15	80+
Full Scale IQ	2.12	2.60	2.12	2.12
Nonverbal IQ	3.35	2.67	3.00	3.00
Verbal IQ	3.00	3.35	3.00	2.60
Abbreviated Battery IQ	4.24	5.20	4.50	3.00
Table 5–5 Standard Errors of Measurement of SB5 IQ Scores at Ages 5, 10, 15, and 80+

The standard error of measurement can be used to set the confidence interval for a particular score or to determine whether a score is significantly different from a criterion (such as the cutoff score of 70 described previously). But the standard error of measurement cannot be used to compare scores. So, how do test users compare scores?Page 172

The Standard Error of the Difference Between Two Scores

Error related to any of the number of possible variables operative in a testing situation can contribute to a change in a score achieved on the same test, or a parallel test, from one administration of the test to the next. The amount of error in a specific test score is embodied in the standard error of measurement. But scores can change from one testing to the next for reasons other than error.

True differences in the characteristic being measured can also affect test scores. These differences may be of great interest, as in the case of a personnel officer who must decide which of many applicants to hire. Indeed, such differences may be hoped for, as in the case of a psychotherapy researcher who hopes to prove the effectiveness of a particular approach to therapy. Comparisons between scores are made using the standard error of the difference , a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant. As you are probably aware from your course in statistics, custom in the field of psychology dictates that if the probability is more than 5% that the difference occurred by chance, then, for all intents and purposes, it is presumed that there was no difference. A more rigorous standard is the 1% standard. Applying the 1% standard, no statistically significant difference would be deemed to exist unless the observed difference could have occurred by chance alone less than one time in a hundred.

The standard error of the difference between two scores can be the appropriate statistical tool to address three types of questions:

1. How did this individual’s performance on test 1 compare with his or her performance on test 2?

2. How did this individual’s performance on test 1 compare with someone else’s performance on test 1?

3. How did this individual’s performance on test 1 compare with someone else’s performance on test 2?

As you might have expected, when comparing scores achieved on the different tests, it is essential that the scores be converted to the same scale. The formula for the standard error of the difference between two scores is

where σdiff is the standard error of the difference between two scores, is the squared standard error of measurement for test 1, and is the squared standard error of measurement for test 2. If we substitute reliability coefficients for the standard errors of measurement of the separate scores, the formula becomes

Page 173

where r1 is the reliability coefficient of test 1, r2 is the reliability coefficient of test 2, and σ is the standard deviation. Note that both tests would have the same standard deviation because they must be on the same scale (or be converted to the same scale) before a comparison can be made.

The standard error of the difference between two scores will be larger than the standard error of measurement for either score alone because the former is affected by measurement error in both scores. This also makes good sense: If two scores each contain error such that in each case the true score could be higher or lower, then we would want the two scores to be further apart before we conclude that there is a significant difference between them.

The value obtained by calculating the standard error of the difference is used in much the same way as the standard error of the mean. If we wish to be 95% confident that the two scores are different, we would want them to be separated by 2 standard errors of the difference. A separation of only 1 standard error of the difference would give us 68% confidence that the two true scores are different.

As an illustration of the use of the standard error of the difference between two scores, consider the situation of a corporate personnel manager who is seeking a highly responsible person for the position of vice president of safety. The personnel officer in this hypothetical situation decides to use a new published test we will call the Safety-Mindedness Test (SMT) to screen applicants for the position. After placing an ad in the employment section of the local newspaper, the personnel officer tests 100 applicants for the position using the SMT. The personnel officer narrows the search for the vice president to the two highest scorers on the SMT: Moe, who scored 125, and Larry, who scored 134. Assuming the measured reliability of this test to be .92 and its standard deviation to be 14, should the personnel officer conclude that Larry performed significantly better than Moe? To answer this question, first calculate the standard error of the difference:

Note that in this application of the formula, the two test reliability coefficients are the same because the two scores being compared are derived from the same test.

What does this standard error of the difference mean? For any standard error of the difference, we can be:

· 68% confident that two scores differing by 1σdiff represent true score differences;

· 95% confident that two scores differing by 2σdiff represent true score differences;

· 99.7% confident that two scores differing by 3σdiff represent true score differences.

Applying this information to the standard error of the difference just computed for the SMT, we see that the personnel officer can be:

· 68% confident that two scores differing by 5.6 represent true score differences;

· 95% confident that two scores differing by 11.2 represent true score differences;

· 99.7% confident that two scores differing by 16.8 represent true score differences.

The difference between Larry’s and Moe’s scores is only 9 points, not a large enough difference for the personnel officer to conclude with 95% confidence that the two individuals have true scores that differ on this test. Stated another way: If Larry and Moe were to take a parallel form of the SMT, then the personnel officer could not be 95% confident that, at the next testing, Larry would again outperform Moe. The personnel officer in this example would have to resort to other means to decide whether Moe, Larry, or someone else would be the best candidate for the position (Curly has been patiently waiting in the wings).

JUST THINK . . .

With all of this talk about Moe, Larry, and Curly, please tell us that you have not forgotten about Mary. You know, Mary from the Cronbach quote on page 165—yes, that Mary. Should she get the job at TRW? If your instructor thinks it would be useful to do so, do the math before responding.

Page 174

As a postscript to the preceding example, suppose Larry got the job primarily on the basis of data from our hypothetical SMT. And let’s further suppose that it soon became all too clear that Larry was the hands-down absolute worst vice president of safety that the company had ever seen. Larry spent much of his time playing practical jokes on fellow corporate officers, and he spent many of his off-hours engaged in his favorite pastime, flagpole sitting. The personnel officer might then have very good reason to question how well the instrument called the Safety-Mindedness Test truly measured safety-mindedness. Or, to put it another way, the personnel officer might question the validity of the test. Not coincidentally, the subject of test validity is taken up in the next chapter.

Self-Assessment

Test your understanding of elements of this chapter by seeing if you can explain each of the following terms, expressions, and abbreviations:

· alternate forms

· alternate-forms reliability

· average proportional distance (APD)

· classical test theory (CTT)

· coefficient alpha

· coefficient of equivalence

· coefficient of generalizability

· coefficient of inter-scorer reliability

· coefficient of stability

· confidence interval