Great Careers in Testing

Dr. Robert Brennan

Dr. Robert Brennan

Dr. Robert L. Brennan is E. F. Lindquist Chair in Measurement and Testing, and Director of the Center for Advanced Studies in Measurement and Assessment (CASMA), in the College of Education of The University of Iowa.

Dr. Brennan received his doctorate from Harvard Graduate School of Education in 1970. He was on the faculty at the State University of New York at Stony Brook until 1976. From then until 1994, he held various positions at ACT, including Distinguished Research Scientist. In 1994 Dr. Brennan was appointed E. F. Lindquist Chair in Measurement and Testing and Director of the Iowa Testing Programs. He resigned the latter position in 2002 and founded CASMA.

Dr. Brennan is the author or co-author of numerous journal articles and several books including Generalizability Theory (2001) and Test Equating, Scaling, and Linking Methods and Practices (2004). He is the editor of the fourth edition of Educational Measurement.

Dr. Brennan has served as President of the Iowa Academy of Education, Vice-President of Division D of the American Educational Research Association (AERA), and President of the National Council on Measurement in Education (NCME).

Dr. Brennan was the co-recipient of the 1997 NCME Award for Outstanding Technical Contribution to the Field of Educational Measurement, recipient of the 2000 NCME Award for Career Contributions to Educational Measurement, recipient of the 2004 E. F. Lindquist Award for Contributions to the Field of Educational Measurement sponsored by the AERA and ACT, and recipient of the 2011 Career Achievement Award from the Association of Test Publishers.

Dr. Brennan’s Career

How did you first get interested in measurement and testing? Was there a light bulb moment where you knew that was the direction you were headed?
In 1967 I was one of about 30 persons in the US who were awarded a one-year NSF Fellowship to attend Harvard Graduate School of Education (HGSE) to study for an MAT in mathematics. The program requirements were quite lenient; indeed, we were encouraged to explore a wide variety of course offerings. Soon, I began taking courses in statistics and computers, rather than pure math. After the first year, I was admitted into the EdD program and began to veer even further away from math, largely because of an assistantship I received that involved evaluating the effectiveness of computer-aided and multimedia instruction. To do so, I needed some background in educational measurement, but HGSE had no courses in that area at the time. So, I began studying measurement on my own. To this day, I have never had a formal course in measurement, except for an undergraduate course that I was required to take to get certified to teach in Massachusetts. By the time I was ready to write a dissertation in the late 1960’s, my professional focus was almost entirely educational measurement, and that has remained the case throughout my career. From my days at HGSE until now, I have had a particular interest in conceptualizing, quantifying, and reducing sources of error in measurement procedures.

What were some of your strongest vocational influences?
When I was at HGSE, my assistantship was in the Computer-Aided Instruction Laboratory under the direction of a psychologist named Lawrence M. Stolurow. Larry was definitely not a measurement person, but he was a superb mentor who gave me a great deal of latitude to learn and to experiment with numerous ideas---some of which actually worked! I doubt that I would be in the measurement field today if Larry had not supported my efforts.

In the early 1970s I was a consultant on several federally funded projects that involved early childhood education. The feds mandated that a fraction of that funding be used for evaluating the projects. A persistent focus of the evaluations involved using rating instruments to assess the effectiveness of various educational innovations at the classroom level. This raised the question of how reliable such instruments were when the focus was on class means rather than individual students. Those of us working on this problem came up with three answers, but no basis for deciding which answer was "right" or even preferable.

In 1972, I decided that I was going to devote the entire summer to trying to solve this problem. I went to the library at SUNY Stony Brook where I was a brand new assistant professor, and I found a just-published book by Lee J. Cronbach and some of his colleagues on the subject of generalizability (G) theory. I read the book cover to cover twice (it is a tough book to read). Then, in the middle of the third reading, I had a "eureka" moment. I knew immediately why we had three answers, what the differences were among them, and how to think about which answer was preferable. That experience hooked me on G theory, which still remains a principle focus of my research and strongly influences how I think about measurement. Lee Cronbach taught me a lot even though I was never in even one of his classes.

In 1973 I chaired a committee to hire a new assistant professor of measurement and statistics at Stony Brook. We hired Michael T. Kane, a Stanford PhD who actually had taken a course of two from Cronbach. Quickly I discovered that Mike also had been working on the problem of reliability of group means, but from a different perspective. That led to a joint publication on the topic. At about the same time he and I also jointly authored two papers on criterion-referenced indices of reliability. We have maintained a strong professional relationship and personal friendship for nearly 40 years. Michael’s research and his critiques of mine have substantially influenced my own research.

In 1981 I hired Michael J. Kolen to work with me in the Measurement Research Department of ACT. Our work focused primarily on operational equating and scaling for both resident ACT programs and external contracts. This work led to numerous publications. In particular, we jointly authored a book published by Springer-Verlag in 1995 on the subject of equating and scaling. The second edition appeared in 2004, and the third edition is in progress. This book and the associated research have been a central focus on my career since the 1980s. Working with Michael in these areas has been invaluable, challenging, and rewarding.

What have been the most enjoyable areas of your career in measurement and testing? What do you most enjoy thinking about?
The part of my career that I enjoy the most centers on research. I very much enjoy the challenge of trying to solve a measurement problem, more deeply understand some aspect of measurement, and trying to communicate (through writing or teaching) an understanding of measurement issues. Probably the one area that I have attended to more than any other is conceptualizing, quantifying, and (to the extent possible) reducing measurement error.

Tell us about your work in Paradoxes in Educational Measurement.
Throughout my career I have had a strong interest in paradoxes in educational measurement. This interest reflects, in part, my belief that advances in science generally occur through efforts to resolve paradoxes. A well-known measurement paradox is the so-called reliability-validity paradox which occurs when an increase in reliability is associated with a decrease in validity. Generalizability theory provides a rather elegant explanation of this paradox.

About 20 years ago, in 1991, I wrote a paper about paradoxes in educational measurement that summarized my thinking about this topic up to that time. I recently reread that paper and still believe that it is pretty much on target. More recently, I have been attempting to understand some of the paradoxes through an in-depth consideration of similarities and differences among measurement models such as classical theory, generalizability theory, and item response theory.

What is Classical Test Theory?
In my opinion, classical test theory (CTT) is the most basic measurement theory, and still one of the most useful theories. (The foundations of CTT began to be formulated over 100 years ago.) CTT postulates that any observed score (X) can be decomposed into two parts: a true score (T) and an error score (E); i.e., X = T + E. As simple as this equation is, it has one very real complexity---T and E are both unobservable quantities. To get around this problem, we typically define T for a person as the expected (i.e., average) observed score that the person would get if that person were tested a very large number of times. Then, with some additional assumptions, it is possible to derive a surprisingly large number of very useful results, including formulas for standard errors of measurement and reliability coefficients.

What are some of the key ideas of your book Generalizability Theory?
Generalizability (G) theory liberalizes classical test theory in a number of ways. At its simplest level , this liberalization occurs through allowing the undifferentiated error (E) in CTT to be decomposed into various distinguishable parts. This permits a G theory analysis to specify how much of the total error variance is attributable to various sources (e.g., sampling of items, raters, occasions, etc.). G theory also differs from CTT in that G theory has a very rich conceptual framework and statistical machinery that permits it to address problems that are outside (or at the periphery) of CTT.

As mentioned earlier, the 1972 seminal monograph on G theory by Cronbach and his colleagues was extraordinarily comprehensive but very difficult to read. In 1982 I wrote a monograph entitled "Elements of Generalizability Theory" (EGT) that was less comprehensive than the Cronbach et al. monograph but simpler to read. I also co-authored a computer program named GENOVA that was coordinated with EGT, which was revised slightly in 1992. Then in 2001 Springer-Verlag published a book I authored entitled "Generalizability Theory" which is currently the most comprehensive book on the topic.

Equating is one of the courses you teach to graduate students – please tell us about it.
For testing programs that consist of multiple forms, which is usually the case for large-scale testing programs, the various forms are never perfectly comparable in difficulty, no matter how well developed the forms may be. Equating is a statistical process that adjusts for difficulty in test forms so that the scores reported to examinees are as comparable as possible. The ideal goal is that it should be a matter of indifference to each and every examinee which form of a test he/she takes. That ideal is never reached perfectly, but equating (in conjunction with good test development) can help ensure that scores are comparable enough for practical purposes.

What are measurement models?
I use the phrase "measurement models" to refer to statistical-type models that are used in the field of measurement. Under this definition, probably the most frequently discussed models are classical test theory, generalizability theory, and item response theory. In my lexicon, I also refer to models used in equating and scaling as measurement models. Measurement models have one or more latent variable components (e.g., true score, error score, ability, proficiency, etc.). The various models often differ substantially, however, with respect to assumptions and with respect to how latent variables are conceptualized.

What was your inspiration behind founding CASMA? What is the mission of CASMA?
The primary purpose of CASMA is to pursue research-based initiatives that lead to advancements in the methodology and practice of educational measurement and assessment. Inter-disciplinary measurement and assessment activities, as well as international activities, may be pursued when they relate to the primary mission.

The University of Iowa has a long tradition and international reputation in the field of educational measurement. It is intended that CASMA build upon and extend that reputation The long-term vision for CASMA is that it be recognized nationally and internationally as one of the premier interdisciplinary centers that performs, promotes, fosters, and disseminates high quality research in measurement and assessment.

Testing Careers and Education

For people considering a career in psychometrics, what are important qualities, skills or talents they should possess to be successful?
For persons who wish to pursue a career in psychometrics, strong mathematical skills and coursework are almost always a necessary prerequisite (calculus is useful, although not necessarily required). Also, knowledge of statistical programs such as SAS and R is very helpful, and experience in a programming language such as C, C++, or C# can be very useful. Good English writing skills are essential for positions that lead to advancement in most testing companies.

In the United States, how many people are currently employed in testing? From that number, what are the largest specialties?
I have spent my career almost exclusively in the area of educational testing, as well as licensure and certification testing. There is also considerable work on testing done in numerous industrial and organization settings, as well as governmental environments. Broadly speaking, there are probably three main specialty areas: test development, psychometric support and research, and administration. It is very rare for someone to go into administration directly out of graduate school; almost always someone in administration started out in test development or psychometrics. There are also, of course, faculty positions in measurement at universities, but there are relatively few open positions these days.

I do not believe there is currently any count of the number of persons in the testing field, largely because there are so many different environments within which testing occurs. In the area of educational testing, the primary organization is the National Council on Measurement in Education (NCME), which has under 2000 members, but there are many people who work in educational testing environments who are not NCME members. It is undoubtedly true that there are not nearly as many trained measurement professionals as there are positions in measurement. This is one of the major challenges faced by the field.

What fields of testing are in growing in demand and which are declining?
I don’t think there are any fields of testing that are declining in demand. In the last ten years, there has been a growing demand for measurement professionals in the PreK—12 testing arena. I expect this demand to increase, and perhaps dramatically so, in the future given initiatives such as Race-to-the-Top (RTTP), the Common Core State Standards (and associated assessment work by the PARCC and Smarter-Balanced consortia), and likely revisions in No Child Left Behind (NCLB).

What is the difference between measurement and assessment?
Many people uses the words "measurement" and "assessment" pretty much synonymously. I tend to use "measurement" in a global sense that encompasses both measurement models and associated instruments, and I reserve the word "assessments" for the instruments themselves.

What types of assessment degrees are offered?
There are both MA degrees and PhD degrees in measurement. I do not know of any programs that award a BA in measurement.

Testing Today

What can you tell us about incorporating learning into assessment? Is there a time and place for this or should the two areas stay separate?
I don’t think there is much talk about incorporating learning into assessment, but there is a great deal of interest in incorporating assessment into learning. Much of this interest focuses of what is called "formative assessment." I think that virtually all measurement professionals acknowledge the importance of formative assessment, but good formative assessment is challenging, often expensive, and frequently takes a back seat to summative assessment.

As a society, do we test too much or not enough? How is the U.S. different from other developed countries?
I don’t think it would be correct to say that we test too much in the U.S., or that we do not test enough. Rather, I would say that much of the testing in PreK—12 in the US that resulted largely from NCLB has not been nearly as good as I would like. Almost all developed countries have a national testing program of one type or another. The U.S. is the exception. At the present time, every state has its own testing program. Almost certainly that will change in a few years when assessments being developed by the PARCC and Smarter-Balanced consortia (based on the Common Core State Standards) are ready for use.

If you had the power to change the world of assessment, what would be the first 3 things you would do?
Require that all teachers and school administrators demonstrate substantial knowledge of measurement and testing.

Require that all large-scale testing programs have timely, well-written, informative technical documentation that addresses important measurement issues such as validity, reliability, equating, and scaling using appropriate, up-to-date methodologies.

Increase the number of measurement professionals and diversify the routes whereby they become measurement professionals. Develop certification procedures for measurement professionals.

What are the most important elements of a well developed test?
The quick answer is: a well-developed, detailed, and unambiguous table of specifications.

What do you consider to be among the top achievements or innovations in the history of assessment?
The developments of classical test theory, generalizability theory, and item response theory; the application of technology (especially optical scanning and computers) to testing.

What percentage of tests are currently taken online and what do you feel that percentage will be 5 years and 20 years?
I don’t think anyone knows the answer to this question, except that the percentage is substantially greater than it used to be and will likely get much greater in the coming decades.

Please give us a glimpse into how you see the future of assessments?
Almost certainly there will be an increase in the use of technology in testing, not only for the administration of tests but also for the scoring of examinee responses to many different types of items and prompts.

Historically, testing has been an important vehicle that developing countries have used to help advance their economies and status within the world. I think that will be even more true in the future.

International testing for the purpose of comparing the educational attainments of students in different countries has increased dramatically in recent decades. My guess is that such testing will increase even more in the future.

In general, I think there will be more testing in both developed and developing countries in the future. I am concerned, however, that the quality of testing may not be as good as it needs to be, for two reasons. First, the public and politicians routinely underestimate how time-consuming and difficult it is to create good tests and, consequently, testing initiatives too often suffer from unrealistic resource constraints. Second, even now we do not have nearly as many well-trained measurement professionals as we need. Solving the latter problem is particularly important.