Interim Assessments: Why Evaluating Them Is So Hard

October 3, 2023

By Marianne Perie, Senior Research Director, Assessment and Accountability at WestEd. She provides deep measurement expertise that draws on her more than two decades of experience working to improve educational equity through high-quality research.

Education Week recently published an article regarding an attempted evaluation of interim assessment products (Schwartz, 2023). Building on its work evaluating curriculum products, EdReports worked with the Center for Assessment to develop a framework for evaluating the technical quality of interim assessments in the context of their intended use(s). Their protocol would have provided ratings of interim assessments on three criteria (Landl & Lyons, 2023):

alignment of the assessment to the expectations of college and career-ready standards and evidence that the assessment is fair and accessible for all students in the intended test-taking population
evidence of technical quality based on the types of information vendors provide related to student performance (i.e., achievement, predictive, subscores, and growth) and the ways in which they intend for that information to be used
information for interested parties about score reports and supports that assure assessment data is interpreted correctly and appropriately for use in multiple contexts

The framework, now released as the District Assessment Procurement Protocol (DAPP), includes criteria beyond what is typically evaluated by companies such as the National Center on Intensive Intervention (NCII) or the Buros Center for Testing at the University of Nebraska-Lincoln (Buros), plus adds educator feedback in the reviews. Unfortunately, most vendors did not agree to participate in the EdReports evaluation work, citing the resource-intensive process of providing evidence, the overly strict requirements, and potential risks of negative findings. Though two vendors did sign on initially—Curriculum Associates for its i-Ready product and Smarter Balanced for its Interim Assessments Blocks—Smarter Balanced later backed out citing concerns with the rigidity of the required evidence. The review project was ultimately tabled in May 2023.

“Most districts lack the capacity and the expertise to put these different component pieces together, interpret the technical manuals, and put these in a synthesized report.” —Eric Hirsch, Executive Director, EdReports

EdReports’ proposed public and transparent evaluation focused on purpose and use is ideal for helping potential users make informed decisions on choosing the right assessment. Why? Because assessments are tools, and tools are best when used for their intended purposes. For instance, a basic cellphone is great at making calls and sending text messages but just mediocre at taking pictures of animals in motion.

Even though the EdReports evaluation has been paused, vendors could still use the resulting framework for internal and proprietary evaluations, comparing the statements the sales staff use for marketing to the available evidence about the assessment’s validity for the purposes and uses described in marketing conversations. In addition, districts could use parts of the framework to decide which assessment is best for them.

Evaluations Based on Purpose and Use Matter

Districts purchase interim assessments, and educators use them for multiple purposes, including

to get periodic updates on student learning,
to set learning goals within an instructional period,
to monitor progress and demonstrate growth within a school year, and
to find areas of strengths and weaknesses within or across classrooms to adjust teaching strategies.

Each of these purposes requires specific evidence for that use. For instance, a teacher who wants to know specific areas of strengths and weakness of her 4th graders’ ability to add and subtract fractions and decimals will need an assessment with questions that assess specific knowledge and skills within those specific standards. However, a teacher who wants to determine the degree to which their fourth graders have grown in mathematics as a whole from fall to winter will want items that more generally represent the domains and standards of interest that are sensitive to growth. The first teacher will want a report that breaks down the standards for fractions and decimals into smaller knowledge and skills and highlights which of those skills students have mastered and which they have not.

The second teacher will be better served by a graphic tracking growth over time for the classroom as a whole and for each student with comparisons to a normed sample. Thus, determining the degree to which an assessment can meet either of these two purposes and uses requires different evidence. Furthermore, assessments that purport to focus on only one purpose and use should not automatically be evaluated on every possible purpose and use.

Information Currently Available to Districts for Internal Evaluation

Districts are the most common users of interim assessments, but they often lack the information they need to make informed decisions about the best assessment to meet their specific purpose and use case. All interim assessment vendors have marketing materials they share and sales representatives that will provide information to the districts. However, districts often need an unbiased source to help them compare products.

Some information of interest to districts is typically easy to discover, such as:

which assessments neighboring districts are using,
computer platform requirements,
ease of loading student data and producing reports,
time needed/length of assessment,
accommodations available,
cost, and
professional development for end users provided by the vendor.

However, more technical information specific to the assessment’s intended purpose and use can be difficult to find. Moreover, tying the technical information to the intended purpose and proposed use is rarely done.

Currently, there are some high-quality external reviews provided by NCII and Buros. These reviews include many of the necessary criteria for a strong interim assessment; however, they lack focus on intended purpose and use and do not include feedback from educators who have used the assessment. Reviews conducted by NCII tend to provide information on the quality of the tools based on technical categories, such as

validity, reliability, and bias analyses of total scores;
validity and reliability of growth scores; and
usability information regarding administration format, time of testing, scoring format, and available benchmarks.

In reviewing the reports, NCII asks the vendor to provide its purpose and validity evidence. However, the two are rarely linked. Validity evidence focuses on concurrent validity—correlations with other assessment results—and predictive validity—correlations with state assessments given in the spring.

Reviews provided by Buros also contain detailed information; however, they do not evaluate as many assessments, and the annual yearbook cannot keep up with the influx of products in the K–12 market.

States often attempt to support their districts in making decisions about interim assessments by providing information such as

alignment of the test items to the state standards,
accessibility and fairness of the assessment,
degree of accuracy in projecting proficiency on the state assessment,
reliability to make student- or classroom-level decisions, and
inclusion of growth measures that allow schools or districts to examine within-year growth and set meaningful targets for students.

Some states go further and create “approved” lists of interim assessments that have met certain state criteria. Oftentimes, states with approved lists allow districts to use state funds to purchase only those assessments on the list. And while these lists may provide some useful information, they rarely address the purposes for which the districts may want to use the data.

For example, the state may indicate that the assessment is aligned to the state standards but not provide detail on the level of the alignment. Are the items aligned at the domain level, or do they include items aligned to specific points within a learning progression of those standards? Likewise, unless one does a deep dive for each operational definition, a label like “fair” could be interpreted in multiple ways.

Evidence could reflect simple item statistics or the inclusion of diverse item writers and reviewers. Moreover, evidence on the interpretability and usability of score reports is sorely lacking. Current reviews rarely indicate the degree to which educators understand the data, interpret it correctly, or use it appropriately. Those elements are critical in establishing the validity of an assessment for a given district’s use.

Hope for the Future: Vendor Self-Evaluations and District Evaluations

So, how can the field still take advantage of the EdReports effort to externally evaluate interim assessments? The Center for Assessment and Lyons Assessment Consulting recently released guidance for district leaders based on the EdReports review process. This tool, DAPP, is publicly available for districts or others to use.

First, the vendors themselves should review the DAPP framework and required evidence. Even if the data are not shared publicly, vendors should take it upon themselves to determine the degree to which they could assemble the evidence and conduct internal reviews of their assessments. The understandings they gain by using the protocols for self-evaluation would go a long way toward refining and improving their products.

Curriculum Associates, a participant in the original EdReports process, noted that it found the process to be “resource intensive” but “well worth it.” According to Kristen Huff, Vice President of Assessment and Research at Curriculum Associates, the process “forced us to take all of these rich rationales that we had in our minds and in our words and put them on paper.”

Likewise, although Renaissance, the interim assessment company that produces the Star assessments, ultimately chose not to submit evidence on their assessments as part of the EdReports process, Darice Keating, their Senior Vice President of Government Affairs, stated, “There’s value in this particular type of project because districts, states, others are very interested in how they evaluate one assessment provider over another. So, I think it’s helpful to look at what those claims would be and whether those claims are met.”

“Our intention is to arm district people with resources so they can ask their test vendor good questions about what the product is and its technical quality.” —Susan Lyons, DAPP co-author

Second, districts could also use the protocol to request relevant evidence from potential vendors. They could then assemble committees of district and school personnel to review the evidence to determine the best product for their purpose. The DAPP framework recommends interviews with current users of each assessment focused on specific questions, which could be immensely helpful to districts as they select tools for their teachers.

Erika Landl and Susan Lyons, DAPP protocol authors, noted that the protocol was revised specifically to guide district leaders in their thoughtful selection of an interim assessment (Gewertz, 2023). Landl indicated that the original work was extremely comprehensive and technical but that they revised it to create the current DAPP for district leaders.

Using DAPP as a guide, districts leaders can first work with their educators to determine the need for the assessments and clarify how they will be used. Armed with that information, district leaders should be able to ask better questions about topics ranging from technical quality to usability of score reports and make their best judgments about the test with the strongest evidence that meets their intended purpose and use.

If this aborted effort to thoroughly evaluate interim assessment still results in vendors reflecting on the quality of their products and district leaders asking more informed questions about the purpose and use of assessments, it will not have been a wasted effort.

References

Gewertz, C. (2023). How to choose local assessments: A guide for district (and state) leaders. National Center for the Improvement of Educational Assessment. Retrieved August 4, 2023, from https://www.nciea.org/blog/how-to-choose-local-assessments/

Landl, E., & Lyons, S. (2023). Choosing the right tests: The District Assessment Procurement Protocol (DAPP). EdReports.org, Center for Assessment, Lyons Assessment Consulting.

Schwartz, S. (2023, July 24). ‘Interim’ tests are used everywhere: Are they good measures of student progress? Education Week. https://www.edweek.org/teaching-learning/interim-tests-are-used-everywhere-are-they-good-measures-of-student-progress/2023/07

Interim Assessments: Why Evaluating Them Is So Hard

Evaluations Based on Purpose and Use Matter

Information Currently Available to Districts for Internal Evaluation

Hope for the Future: Vendor Self-Evaluations and District Evaluations

References

More Related to this Post

Graduation Requirements and Measures: A Review of Performance Assessment Implementation in Select States for the New York State Education Department

Interim Assessments: Why Evaluating Them Is So Hard

Evaluations Based on Purpose and Use Matter

Information Currently Available to Districts for Internal Evaluation

Hope for the Future: Vendor Self-Evaluations and District Evaluations

References

More Related to this Post

Resources

Graduation Requirements and Measures: A Review of Performance Assessment Implementation in Select States for the New York State Education Department