Comparing Early Literacy Assessments: What Really Matters

March 13, 2024

By Mariann Lemke, Dan Murphy, Aaron Soo Ping Chow, and Angela Acuña

Research shows that children who are not proficient early readers are at risk of poor social, educational, and economic outcomes. Identifying students who may be at risk of reading difficulty and proactively intervening with instructional supports ensures that all students have the foundational skills needed to be successful readers.

Many states now mandate that literacy assessments be administered to all early elementary students (typically in grades K–2 or K–3) to identify those who may need additional support. States take different approaches to providing screening assessments—many have approved lists of commercial products from which districts can choose; a few others require a single assessment.

Recently, WestEd analyzed data from two states that allow their local education agencies to decide which literacy screening assessments to use: Colorado and, more recently, Massachusetts. This blog explores similarities and differences between early literacy screening assessments and how they matter for student learning.

Lists of “Universal Screeners” Include Apples and Oranges

To choose which assessments were approved for their state lists, Massachusetts and Colorado state leaders evaluated each assessment against a state-specific set of criteria. Each assessment vying for a spot on the state’s approved list had to meet the state’s requirements, which included factors related to the technical soundness of the assessments, such as reliability and accuracy. Requirements also included checks to make sure the content aligned with state guidelines. And they may have included other criteria, like whether assessments were available in languages other than English, such as Spanish. See Massachusetts criteria and assessment summary and Colorado criteria and assessment summary.

However, even when applying consistent criteria to select assessments, assessments that are considerably different can end up on the same list. For example, several of the approved assessments in each state are administered one-on-one by an assessor, with students responding aloud or in writing to the same set of prompts. Others are online computer-adaptive assessments administered in a group, where the questions students answer vary based on their performance (although algorithms to determine which questions are shown also can vary).

For example, a “phonics” task on one assessment might include asking students to decode and read aloud nonsense words, whereas on another assessment, a “phonics” task might mean looking at a letter or word shown onscreen and choosing which one matches a sound or word read aloud by the computer (see sample phonics items and WestEd’s report for Colorado, although the state has since updated its list of approved assessments).

Sample “phonics” items
Sample items table

Even if the content included in a particular assessment aligns with a set of minimum state requirements, the assessment might also assess other topics or skills. One assessment, for example, includes a measure of “visual discrimination,” which requires students to identify words that match or are different, while other commonly used assessments do not include such a measure. Whether or in what grades comprehension (listening or reading), vocabulary, or other topics are assessed also varies by assessment. That means that whether students are identified as at risk can depend on their performance on different content and different tasks, which can be administered differently.

So, putting them all on a single list of approved “universal screeners” is a little bit like comparing apples and oranges. All are called “universal screeners” despite their differences.

Lists of “Universal Screeners” Even Include Different Kinds of Apples and Different Kinds of Oranges

In each state, approved assessments are intended to be used for a very specific purpose—identifying students whose performance suggests they are at risk for future reading challenges so that schools and districts can intervene early. That means the tests need to define a specific level of performance that indicates risk. These definitions, like the assessments themselves, vary.

For example, one assessment takes a normative approach to setting risk levels, using the 25th percentile score to indicate students who need “intervention” and the 10th percentile for students who need “urgent intervention” within grades and time periods. Benchmarks for other assessments are based on data about students’ performance later in that grade or in future grades. One assessment’s benchmarks, for example, are scores that the publisher’s research found that most accurately predict student performance on a different assessment at the end of the school year.

The meaning of the risk levels can even vary within assessments. For example, kindergarten benchmarks on one assessment were set based on research using an earlier version of the assessment itself. So, in essence, kindergarteners identified as at risk are students who would be expected to perform poorly on very similar tasks at the end of the year compared to their peers. For other grades, benchmarks are set based on research using a very different type of test (a group-administered multiple-choice test in comparison to an individually administered set of tasks where students respond aloud).

The list of approved assessments in two states for which WestEd has analyzed data includes assessments that are, in some cases, qualitatively very different, even though they are intended to be used for the same purpose. That means that individual students could be classified as at risk on one assessment but not another. So when or if states combine the numbers of “at-risk” students from various assessments to get a summary picture of the level of risk in the state, there is definitely some “noise” in those summary numbers.

Do Differences Between Assessments on an Approved List Matter?

In each state, part of WestEd’s work has included analysis that links screening assessment benchmarks to the state assessment in English language arts (ELA) in grade 3. This analysis gives us an empirical way to compare definitions of risk across assessments using state assessments as a common metric. Figure 1 shows how benchmark scores for literacy screening assessments used in the state map to the Massachusetts Comprehensive Assessment System (MCAS) in ELA. A similar figure for Colorado can be found in WestEd’s report linked from the state’s website.

Figure 1: Literacy Screening Assessment Benchmark Scores Mapped to the MCAS Scale (Grade 3)

Figure 1 illustrates the literacy assessment benchmark scores that indicate risk mapped to the MCAS scale.

Benchmark scores indicating risk are scores below the following benchmarks for each assessment shown in Figure 1: FastBridge aReading: Low Risk, mCLASS: At Benchmark, DIBELS 8th Edition: At Benchmark, i-Ready: Early On Grade Level, Star Reading: At or Above Benchmark, Lexia RAPID: High Likelihood of Success, MAP Growth: No Intensive Intervention.

In both states, benchmarks that indicate risk generally correspond to a level below the state’s “Meeting Expectations” cut score, although there is some variation among them. In Massachusetts, benchmarks indicating risk correspond to the “Partially Meets” performance level. For example, DIBELS 8th Edition’s “At Benchmark” cut score maps to a value of about 490 on the MCAS, while the cut score to meet state expectations is 500. Other screeners all fall in the “Partially Meets” range. So despite their differences in content, administration, and definitions of risk, screening assessment benchmarks appear to identify students with relatively similar types of state assessment performance.

This fact offers states and districts that are choosing assessments for a list or from a list a reason to think more carefully about when similarities and differences matter. The purpose of using early literacy screening assessments is to identify students who might need additional support in reading. If all the assessments that met the state’s bar for approval produce broadly similar results as far as identifying students, then districts and schools that have to choose an assessment should consider what similarities and differences really matter to them when identifying students who might need support.

For some schools, perhaps the length of the test really matters. Or maybe a district needs a group-administered assessment because one-to-one testing isn’t feasible. Or maybe a school prefers an assessment where students respond aloud so they can hear examples of where or when students struggle. Maybe teachers find one assessment’s reports easiest to interpret and use.

For schools and districts using screening assessments, identifying students at risk is just a beginning step. Schools then need to have effective supports available to help students. They need to monitor and make sure their supports are working. Perhaps even more importantly, schools, districts, states, families, and communities must work to create environments where fewer students are identified in the first place by providing strong early learning opportunities. States, districts, families, and others using data from literacy assessments shouldn’t ignore the fact that there are differences among assessments. But everyone using assessments should keep the focus on making sure all children have access to what they need to become proficient readers.

Mariann Lemke is a Senior Associate at WestEd who has over 20 years’ experience managing assessment and evaluation projects at the federal, state, and local levels. As a Senior Research Associate at WestEd, Dan Murphy works on assessment research and development, with a focus on advanced measurement techniques and innovative assessment. Aaron Soo Ping Chow is a Research Associate in Literacy at WestEd. Angela Acuña is a Research Assistant with English Learner and Migrant Education Services at WestEd.

Comparing Early Literacy Assessments: What Really Matters

Lists of “Universal Screeners” Include Apples and Oranges

Lists of “Universal Screeners” Even Include Different Kinds of Apples and Different Kinds of Oranges

Do Differences Between Assessments on an Approved List Matter?

More Related to this Post

Joyful Foundational Literacy Skills: PreK–1st Grade

A First Look at Early Literacy Performance in Massachusetts: Results of Initial Analysis Based on State Grantee Literacy Screening Assessments

Comparing Early Literacy Screening Assessment Benchmarks in Massachusetts

The Formative Writing Framework: Advancing Balanced Literacy Through Engagement and Empowerment

Comparing Early Literacy Assessments: What Really Matters

Lists of “Universal Screeners” Include Apples and Oranges

Lists of “Universal Screeners” Even Include Different Kinds of Apples and Different Kinds of Oranges

Do Differences Between Assessments on an Approved List Matter?

More Related to this Post

Insights & Impact

Critical Insights for Improving Assessment Practices in K–12 Education

Four Key Assessment and Accountability Trends Education Leaders Should Monitor

Reading Risk Benchmarks and What They Mean for Assessing Student Progress

The Science of Reading: The Importance of Comprehension

Resources

Joyful Foundational Literacy Skills: PreK–1st Grade

A First Look at Early Literacy Performance in Massachusetts: Results of Initial Analysis Based on State Grantee Literacy Screening Assessments

Comparing Early Literacy Screening Assessment Benchmarks in Massachusetts

The Formative Writing Framework: Advancing Balanced Literacy Through Engagement and Empowerment