Equity in Measuring School Quality:
Comparing the Robust and Equitable Measures to Inspire Quality Schools (REMIQS) Framework and State Accountability Systems
Eric Toshalis, Mary Rauner, Bradley Quarles, Cailyn Torpie, and Noman Khanani
Eric Toshalis:
Welcome, everyone. Thank you so much for joining us today for this presentation about the report that we are releasing today entitled “Comparing the Robust and Equitable Measures to Inspire Quality Schools, the REMIQS framework, and State Accountability Systems.” I’m joined by my partners from the WestEd team who will be introduced shortly today to present to you the model and the analysis that we have conducted and its implications for the field. The objectives for today, next slide please, are as follows. We intend to introduce the model and the project.
A little bit of background about how we got to where we are today. Outline the process and the content for the REMIQS framework. And then describe the analysis that we conducted with a level of granularity that won’t be as detailed as you’ll find in the report, but will be enough to give you an idea of why we did what we did, how we did what we did, and what we revealed in that process. We’ll summarize some of the study findings. And then at the end we’ll provide some suggestions for revisiting state accountability systems based upon what we learned. And we’ll finish with some question and answer period for those of you that are attending.
With that, I just wanna jump into a little bit of background about how we got to where we are today with the REMIQS Project. Basically, this project began back in late 2017 as an inquiry that was generated by the funders in what was then the Student-Centered Learning Research Collaborative that I directed at Jobs for the Future. Those funders were the Barr Foundation, the Nellie Mae Education Foundation, the Oak Foundation, the Schusterman Family Philanthropies, and the Carnegie Corporation of New York. The Barr Foundation was our anchor funder for this work.
We formed an exploratory group made up of Ezekiel Dixon Roman, Jack Schneider, Kara Finnegan, Paul Leather, and Hunter Gelback to look at the possibility of redesigning the way or influencing the way that school quality was being measured across states and at the federal level in order to try to get us away from an over-reliance on test scores. Our original questions in that research were, how might we better measure school quality beyond our over-reliance on test scores to capture where equity is being achieved, not just to identify failing schools and therefore participate in new accountability measures, but to find out where the good sites are happening and where equity was actually being achieved. How might we do that by leveraging, merging, and matching multiple existing data sets.
And then where are the high schools that tend to rise above the rest when it comes to achieving that equity, using that model to locate those schools, and then eventually to study those schools to determine which schools and what those schools do to achieve that outlier level of equity compared to other similar schools. The phases involved the feasibility study with leading scholars and the Urban Institute in 2018 and 2019. And I put in the chat, as you’ll see there, a summary report of what we’ve learned during that feasibility study. Ultimately, we determined, yes, there’s a there there, and we should go deeper into this and figure out what kind of model we could build based upon available data.
We then refined the model with Urban Institute from 2018 to 2020. A series of reports were released from that, and I will put those in the chat. And then there were in-depth case studies of participating state data systems after that. This is when the project at REMIQS moved from Jobs for the Future over to KnowledgeWorks, and when we transitioned in our partnering from Urban Institute over to WestEd. We began to look at the various state data systems and which ones would be most applicable for this model, making some determinations. The WestEd team will talk a little bit more about how they got to that decision.
And then we entered into the quantitative filtering phase where we actually ran the model based upon data from five different states. We used that model to identify a host of schools that we wanted to study in depth by actually sending research teams to those sites to study what makes them tick, how are they able to achieve these greater equitable outcomes given the populations that they were serving. And, unfortunately, because of the pandemic and due to the political blowback, and the gubernatorial elections, and state legislature elections, and some of the turnover in those states, it was not a conducive environment for districts to say yes to us poking around in all of their approaches and their instructional strategies, their relationship with kids and families.
Too much for the schools at that point, particularly given some of the rather pronounced blowback against equity issues around the country. We just couldn’t get school sites, and even in some cases, states to say, yes, we want you to investigate these things. And so we had to pull back from that investigation and basically just stick with what did we learn from this quantitative filtering model. And that’s what brings us to today, the comparative analysis of state accountability systems in KnowledgeWorks’ partnership with WestEd.
There are two other corollary items that resulted from the REMIQS project. One is the research guidebook, and this is essentially what we were prepared to do when we were set to investigate each one of the schools that we were prepared to go investigate across the five different states. This is all of our framework. This is the model, this is the consent forms, this is the interview protocols. This is the stages of the research project. It’s everything off the shelf for those states, potentially districts, consortia, even schools, researchers that might wanna go and investigate those things on their own. Here’s something ready to go off the shelf. We really wish we could have implemented, but we didn’t. But we hope you will consider implementing or working with people who may do that.
In addition, and this is something that we’ll be releasing between now and the end of the year, we also worked with a school site that had been chosen in our research in Massachusetts to implement a yearlong youth participatory action research curriculum that was working with high school students to take a lot of the REMIQS learnings and the REMIQS model to figure out and investigate issues of equity in their neck of the woods and to have students lead those investigations using rigorous methods that we help them construct and supported them throughout their year long process. That curriculum and guidebook will be forthcoming by the end of the year and will be released broadly and then re-publicized in the first quarter of 2024. We look forward to sharing that with you.
With that, I will kick it over to the next slide and to the WestEd team. Just quickly, you’ll see on the screen now a whole host. These are the project goals as I’ve mentioned before. You’ll see on that next slide please a whole host of partners that we worked with, and that was on the previous slide, a couple slides up, you’ll see it, where all the folks that we worked with, and that includes the stakeholder committee, the advisors, and our funders. You’ll see all of those named in the actual report. And I’m not gonna spend the time to go through that very long list of partners, but I will say this.
There’s no way we could have made all of the interpretive turns, all of the thorny statistical decisions that had to be made, all the dataset questions that had to be resolved without consulting with our esteemed advisors and with our fantastic stakeholder committee members, all of whom are listed in the report, and, of course, our funders who supported all this work. So I recommend that you please look for that listing there because these are some really high quality people that helped us get to where we are today. So with that, I’ll kick it to the WestEd team and to Mary to lead us into how we did what we did in the quantitative filtering phase. Thanks so much. Mary Rauner:
Thank you, Eric. I’m Mary Rauner from WestEd, and Bradley Quarles and I led this project from the WestEd side. And we have a fantastic team, many of whom will be joining us in presenting today. So, I’m gonna get started by describing the REMIQS framework. The framework addresses concerns regarding the use of standardized test scores as the dominant measure of school quality for state accountability systems because these scores are highly correlated with family background. We developed the framework to provide a broader and more holistic framing of school quality, but we’re not proposing it as a, you know, specific alternative accountability system.
Instead, we hope that the REMIQS framework and the analyses that we’ll describe here today to you will encourage and inform discussions around accountability systems, including ways to measure school quality and support schools that serve vulnerable student populations. Next slide. So now I’m gonna briefly discuss how we developed the REMIQS framework. First, we identified schools with a strong performance of resilient and historically marginalized students in five states: Texas, Virginia, Arizona, Massachusetts, and Kentucky. We selected these states because they have state longitudinal data systems that allowed us to have access to the data we need, and because they have demographic and geographic diversity.
I also wanna take a moment to talk about the term resilient and historically marginalized students. The students that we’re referring to are Black or African American, Hispanic or Latino, Indigenous, multiracial, from low-income families, qualified for special education services, or designated as English language learners. We use the term resilient to highlight the achievements and contributions of members of these groups, despite the systems that may undermine them. We also acknowledge that there are systemic barriers for many other students based on racial, ethnic, socioeconomic, linguistic, religious, ability, immigration status, gender expression, and sexual orientation differences. The groups I listed are in the analysis because simply the data were available.
The terminology we used is based on ESSA, the Every Student Succeeds Act terminology, and we acknowledge the deficit framing of some of these terms and the importance of using asset-based alternatives. As one example, we prefer to describe students whose first language is not English as emerging bilingual students, because it highlights that these students are developing more than one vocabulary, linguistic structure, culture, and history that enhances their understanding and their contributions to learning environments. Finally, throughout the webinar, we may shorten resilient and historically marginalized students to historically marginalized students or simply marginalized students.
So back to developing the framework. After we identified these schools, we gathered student level data from the state longitudinal data systems, and we then identified the schools that improved student outcomes as opposed to schools that served high performing students. And I’ll describe how we made this distinction in just a minute. The statistical model that we employed estimated school level effects on our outcomes of interest and measured multiple outcomes more than simply grades and test scores. So this whole process allowed us to compare school performance within each state. We had also hoped to make comparisons across states, but as we dug into the data, we found that data across states were not compatible or commensurate. Next slide, please.
So, to identify schools that were improving outcomes, not just those that served high performing students, we used four high level selection criteria. We included public or charter high schools. We didn’t include private or selective schools because they have different administrative structures, and they don’t operate under the same state mandates. We also included schools in continuous operation from 2011 to 2014 because this allowed us to track students from their first year in high school through post-secondary enrollment. Schools where at least 25% of students were from resilient and historically marginalized backgrounds, and schools with a 9th grade cohort that included at least 100 students from historically and marginalized backgrounds.
So, after implementing the criteria across states, we finalized the school and student sample as you see in the right hand of this slide. And as the table shows, the analysis was comprised of over a million students. Next slide. Now, I’m gonna spend a few minutes outlining the statistical approach that we used and some of the methodological decisions that we made along the way. Next slide, please. Conventional measures of school accountability may not holistically reflect school quality. And this is because they’re typically based on the data that’s already collected, like standardized test scores, rather than what should be collected to develop an accurate picture of quality and because student performance is aggregated.
So, if some student groups experienced school systems differently than the majority of the students at that school, the outcomes and experiences for these groups can be masked. What we’ve seen in practice is that school level accountability rankings often correlate with the demographics of the students served. This is problematic because we know a school’s quality should be independent of the students who attend it. So, we created the REMIQS framework to measure school quality in a way that hinges less on demographic composition and more on the school outcomes for all of their students. In the framework, we measured school quality for a specific population of students, those who are historically marginalized, and we conceptualized school quality as high-quality instruction, a supportive school environment, and a school in which all students are developed to meet their full human potential. Next slide, please.
So, before jumping into some of the specifics, I’ll briefly contextualize some of the challenges that we encountered while doing this work. As it relates to data, as I mentioned before, the data across states was measured and collected differently. So we couldn’t accurately generate cross state comparisons. And there were some instances when data were missing, erroneous, or poorly defined. We were also working with lagged data, which just means that it was from previous years. So, if a school performed well with a 2010 cohort of students, there’s no guarantee that the school was still performing well when we conducted the analysis. This is due to all the reasons you’d expect, like administrative and staff turnover, as well as the impact of the pandemic.
And there were some times when we faced challenges when coordinating with entities that controlled the state education data, so SEAs, or state educational agencies, and higher education coordinating boards. So, there were also other challenges that we had to overcome. And the COVID-19 pandemic was the biggest one for sure, making the project more difficult for a number of reasons. First, the timeline had to be completely revised. Second, we originally planned to travel to the states to gain access to data, but we were not able to do that. So that meant that we had to find kind of ingenious technical solutions within pretty big bureaucracies. Lastly, and importantly, practitioners were responding to the crisis of the pandemic. So understandably, research took a much lower priority for them. Next slide.
Now I’m gonna briefly describe our statistical approach. We fit a hierarchical linear model to measure school effects. The specific model was the random intercepts model, and this allowed us to control for student and school level factors so we could isolate the school’s effect on student achievement. We included a number of controls, also called covariates. First, and arguably our most important covariate was 8th grade assessment scores. This allowed us to control for the academic preparedness of incoming students. Without it, a school might be considered high quality simply because it served highly prepared students. We also controlled for student demographics, including gender and historically marginalized student categories. We controlled for school enrollment as well. And, finally, cohort demographics like the percentage of students from historically marginalized groups, students of color, low-income students, English language learners, special education students.
So, the table on this slide shows the outcomes that we measured in the school level summative scores that measured school quality, and the weights that we assigned for each measure. As you’ll see in more detail later on in the presentation, these are notably different than those of state approaches to measuring school quality. And as a quick preview, I’ll say that college enrollment and persistence or success as written here, have the two highest weights, while more conventional metrics like assessment scores have much lower weighting. Next slide, please. Now that I’ve described how the REMIQS framework was developed, I’m gonna hand it over to my colleague at WestEd, Cailyn Torpie, who will describe the analysis that we conducted using this framework, the REMIQS framework, and state accountability data.
Cailyn Torpie:
Thank you, Mary. So, the research that we’re going to describe now builds on the REMIQS framework and composite scores that Mary discussed by comparing them with the state accountability system methodologies and school ratings at the same five states, which are Arizona, Kentucky, Massachusetts, Texas, and Virginia. Our research was guided by two questions. First, we wanted to know how the methodologies across each state compared to the REMIQS framework. Next, we wondered how state’s summative school ratings compared to the REMIQS school composite scores. Throughout, we explored the equity implications of the similarities and differences for assessing high school quality.
The methodologies of the statewide accountability systems we compared and contrasted with REMIQS were developed under the statutory framework of the Every Student Succeeds Act, or ESSA. Authorized in 2015, ESSA requires states to collect and publicly report school performance in several indicators, which are academic achievement, academic progress, graduation rate, progress in achieving English language proficiency, and school quality and student success, or SQSS. States must collect and report measurements on these indicators for all enrolled students and for economically disadvantaged students, students with disabilities, students designated as English learners, and students of every major racial or ethnic group.
The states often report on additional student groups of interest based upon their local contexts. The flexibility offered by ESSA has resulted in considerable variation in how states hold schools accountable for school performance. Namely, states have flexibility in the metrics they choose to hold schools accountable for, and the weight of each indicator measure score. For example, some states may use a five-year graduation rate instead of four-year rate when holding schools accountable for high school completion. Additionally, regarding math and reading performance, some states may weigh student growth higher than current achievement. These decisions impacts the ratings that schools receive, which in turn may have implications for how equity is conceptualized and achieved. Next slide.
Because ESSA affords states broad latitude in developing their accountability systems, there were differences in the student groups, metrics, and the weights that states used in calculating their school ratings. In addressing research question one where we were comparing the methodologies across frameworks, the research team summarized the ways REMIQS and each state accountability framework measures school quality and explored the equity implications of those differences. In this first set of analyses, we examined the accountability metrics used for REMIQS in each state, the approaches to aggregating metrics into summative school ratings, and the equity implications of how each state or REMIQS factors into the performance of individual student groups.
For the comparative analysis of methodologies, we used the frameworks that states employed under their initial approved ESSA plans submitted in 2017 to align as closely as we could to the timing and context of when the REMIQS framework was developed. We acknowledge that the methodologies used in these states have since changed. Across all five ESSA indicators, there is considerable variation across the states and REMIQS in their metrics and business rules. We’ll now highlight some of the main differences across states and REMIQS for select indicators. As examples, I’ll highlight this variation in two of the five indicators for academic achievement and school quality or student success. Next slide. ESSA requires states to assess academic achievement through performance on annual state assessments and to weigh ELA and math equally.
There were however differences in the ways that the five states and REMIQS defined the student universe in the subjects that were measured with some states, including science and/or US history in addition to ELA or RLA and math. There was also variation in the metrics used with REMIQS using value added models that Mary spoke to earlier in this presentation, some states using proficiency rates, and others opting for alternative metrics such as average scale scores or proficiency weighted at the student level, which gives partial credit to students approaching proficiency and additional points to students who exceed proficiency.
The weights and summative ratings across all frameworks attributed to the academic achievement indicator ranged from 10 to 45%. The REMIQS methodology weighted academic achievement the lowest at 10% in its summative school ratings across all frameworks. ESSA allows states to use measures outside of traditional accountability metrics in their school ratings as a part of the SQSS indicator. Though there is considerable flexibility, the SQSS indicator measures must still meaningfully differentiate school performance, be valid and reliable, be used within each grade span, and be comparable across schools statewide, and be reported annually for all students and student groups.
Given state’s flexibility in developing SQSS metrics, there were large differences across the states in the student universe metrics and weight allocated for this indicator. As shown here, there is tremendous variation in the metrics that states used within their SQSS indicator. Further complicating the analyses, some states included multiple measures that may have had different student universes. REMIQS used data exclusively from high school students from historically marginalized groups as Mary previously discussed. Arizona and Texas included students from the most recent graduating cohort, whereas Virginia included all students. Kentucky and Massachusetts included all students for some measures but included only certain grades or student groups for others. Next slide.
The decisions that states make in their accountability systems have important equity implications. To begin, understanding the student universe, which defines the student population of a given indicator, is critical to interpreting the meaning of the metrics used for any indicator. Systematically emitting underrepresented student groups from metric calculations can undermine the validity of the metric. The REMIQS framework only includes students from historically marginalized and resilient backgrounds. So by design ratings in the REMIQS framework are centered on how schools serve those students. Additionally, the selection of metrics used for accountability is an opportunity for states to elevate equity within their systems.
For example, including more subjects and/or multiple measures across indicators can capture a broader and more diverse set of concepts, competencies, understandings, and skills, which could be considered more equitable. Raw measures of proficiency can obscure important differences and therefore may not capture the most valid and valued measures of student academic performance. The REMIQS framework used a value-added model to measure the impact of schools on student academic performance. To isolate the effect of schools on performance, this model controlled for prior performance on assessments as well as students in school level characteristics such as student demographics and school’s demographic composition.
Although measuring improvement of individual student groups could be considered an equitable aim, relying on aggregate improvements produces a less precise measure of school impact than student level progress. If metrics compare the performance of different groups of students year over year, there is a risk of a higher degree of noise or external factors influencing changes in performance that may not be directly a result of changes in school quality. Relative measures of growth communicate individual students’ growth, and in the aggregate school’s growth in academic performance compared to other students in schools within the same system. And since these models control for prior performance, they could more effectively capture school impact.
However, these measures do not capture students’ progress towards attaining proficiency, nor did they explicitly reward progress among historically marginalized student groups that resulted from narrowing opportunity gaps. And finally, the weights attributed to the various metrics and indicators within state systems is consequential for the equity of schools’ overall ratings. Next slide. Now that we’ve shown some examples of the variation among states and between states and REMIQS of the indicator level, we’ll now discuss how these indicators are translated into composite school ratings. So, I’ll turn it over to my colleague, Noman.
Noman Khanani:
Thanks, Cailyn. Hi, everyone, my name is Noman Khanani, and I’m a researcher at WestEd. Before getting into how the different states calculate summative ratings, I’ll provide some legislative context. So, in addition to reporting performance on each accountability indicator through public report cards that Cailyn described, ESSA also requires that state accountability systems identify schools in need of improvement through one of two forms of support, comprehensive support and improvement, and targeted support and improvement. CSI, or comprehensive support and improvement schools, are those that are in the bottom 5% of Title 1 schools for all students, or those that have a graduation rate of 67% or lower.
TSI, targeted support and improvement schools, are those that are consistently underperforming for any group of students as defined by each state. Now, each state uses distinct formulas that account for the aforementioned accountability indicators to identify schools. Under ESSA, states are required to weigh academic achievement measures more heavily than other indicator measures in their formulas. So that’s all that’s required by ESSA. But many states still choose to publicly report a single aggregate measure of school quality, what we refer to throughout this presentation as state school ratings. And these are weighted composites of their accountability measures. Next slide, please.
To give you an idea of how state accountability systems may weigh different indicators to generate school rating scores, in this slide, we depict how Massachusetts and Arizona calculate school ratings as an example compared to REMIQS. In the full report, you can find these weights for Kentucky and Texas as well, but not for Virginia because they are one of a handful of states that do not publicly report school ratings or their methodology. In the figure on this slide, the X-axis indicates the five different accountability indicators that Cailyn discussed. And the Y-axis represents the percentage of weight that each indicator accounts for by each accountability framework, which are distinguished by the different colored bars.
So blue represents REMIQS, gray represents Arizona, and light blue represents Massachusetts. So, as an example for interpreting this figure for Massachusetts, 40% of their state summative rating can be accounted for by academic achievement, 20% for academic progress, 20% also for graduation rate, and 10% each for English language proficiency and school quality and student success. So that all adds up to 100%. And as you can see, Massachusetts and Arizona are quite similar to one another in how much weight each indicator accounts for as part of the school summative score. But we can see a major difference between the states and REMIQS. Namely speaking, academic achievement is only weighed at about 10% for REMIQS.
And REMIQS, most of the school composite score is made up by the school quality and student success domain. Now, REMIQS doesn’t actually refer to the measures used as school quality and student success, but based upon the measures that are captured by other state frameworks, we’ve classified those measures as school quality and student success for REMIQS. So, this includes measures like attendance, advanced course placement, and post-secondary enrollment and success. Conversely, state scores tend to emphasize measures that are based upon student performance on standardized academic achievement tests. As is the case in Massachusetts where 40% of the weight is on academic performance alone and 20% is on academic progress. Next slide, please.
Now, how do states translate their composite scores into intuitive measures for their respective audiences? The table on this slide describes the form that the state school ratings take when shared publicly, that is or equal to 80% of the statewide average. So even though technically their score may not fall within the range of what would be an A, if they’re at like an 88 or 89, and they have a good enrollment of students with disabilities, they can get bumped up to an A. Similarly, states may drop ratings for schools that have significant achievement gaps. So even though on average a school may be performing really well, has a score that’s above 90 and would be given a five-star rating or a letter A grade in some states, if there is a significant achievement gap between a given group that’s specified in the ESSA plan, they can get dropped down a letter grade.
Lastly, generated scores may either be normative or criteria based. Normative means that scores represent how schools are performing relative to others in the state. Criteria means that the scores are based upon meeting specific targets set by the state. So for normative, for example, Massachusetts uses this way of reporting accountability scores. If a school has a score of 95%, it doesn’t necessarily mean that they are performing really, really well on all of their accountability measures. It just means that they’re in the 95th percentile. They’re doing better than the other schools. But theoretically speaking, they can also all be performing pretty poorly, and so that wouldn’t mean that much.
And conversely, a criteria measure, criteria oriented based system is about meeting specific targets that are, again, specified by states in each school’s plan. Next slide, please. So, I’ll conclude the section by talking briefly about the equity implications as they relate to the ways that summative school ratings are constructed and communicated. And some of these are similar to what Cailyn was talking about. As mentioned earlier, states are not required to report a single aggregate school rating measure, but many still do so because it offers an easy intuitive way to communicate school quality to parents, educators, policymakers, others in the community such as prospective home buyers.
And so how these scores are calculated can be really consequential for the ways that equity or inequity are conceptualized and revealed in schools and within systems of education. The weight of specific indicators, of course, are really influential. School quality may simply reflect a single indicator or two if that is what is significantly weighted the most. So states where academic achievement makes up a significant component of ratings, which is often the case under ESSA, it may really just be reiterating state performance on standardized tests on their school rating score. So, in other words, the school rating score is not really a rating of school quality in a holistic sense. It’s really just a rating of how well the school’s doing on academic achievement.
And as Mary mentioned earlier, we know from prior research that performance on state standardized tests are highly correlated with school demographics. So, this begs the question, are ratings in states that heavily weight academic achievement simply communicating who the school serves, or is it really a measure of school quality? So, a single score collapsed into one of five categories may not appropriately communicate school performance. So there may not even be a great way to do this when you reduce school quality to a single score. Similarly, in some states, the performance of historically marginalized and resilient student groups are less emphasized, if included at all. And this leads to questions about whether measures of school quality are reflecting how well all students are served in each school.
Now, if our school ratings are biased by not accounting for other dimensions of school quality and do not appropriately account for the performance of historically marginalized students, there are risks of perpetuating bias, which increasingly research has documented. The types of schools parents choose for their children, where educators decide to apply for work, some of the funding restrictions that apply to schools, and other factors are all impacted by school rating scores that are publicly reported by states, even though it’s not required by ESSA. So, it’s really crucial that if states decide to do something like that, that we get these right. I’ll pass it on to Brad who’ll talk more about the comparison of school summative ratings across the states and REMIQS.
Brad Qualres:
Thanks, Noman. I’m Brad Quarles. I’m a senior research associate at WestEd, and I co-led the work with Mary. So, I think Noman did a great job of articulating some of the broader conceptual tensions of the work, so now we’re gonna talk through a few pairwise correlations we ran to help us compare REMIQS and state school ratings in Kentucky, Massachusetts, and Texas. As Mary mentioned earlier, data access was a challenge throughout the work, and we lacked the data to include Virginia and Arizona in these analyses. So, first we tested the association between the REMIQS school composite score and the school ratings in the three states. We also tested the association between the REMIQS school composite scores and the indicators for each state.
So, this analysis centered on the component parts that fed into each methodology’s ratings. And so, these analyses sort of shed light on how the differences in indicator level associations may drive school level rating differences. And then finally we looked at the correlations between the percentage of students from historically resilient and marginalized backgrounds within a respective school and the REMIQS framework and the state systems. So, though not causal, these analyses show the extent to which school composite ratings may be related to the proportion of historic resilient and marginalized students in a given school. The other thing I want mention before we move on is, as Eric mentioned in the setup, some of our analyses are really in the weeds.
So, what I talk through in the next two slides is largely gonna center on the big picture takeaways from the first and third analysis. We can go onto the next slide. So, when assessing how well each state’s school rating correlated with their respective REMIQS school composite score, we observed weak to moderate statistically significant associations across those three states. And so this is good. We’d hope to develop a broader model of school quality. And these findings suggest that REMIQS is in fact measuring something different about school quality and is in fact a departure from the state school ratings. Obviously, you know, these differences are in large part driven by the methodological decisions of the respective methodologies and the metrics and ways that each rating system use to their respective student universes.
So REMIQS, of course, centers on historically resilient and marginalized students, whereas the states utilize broader inclusion criteria with different restrictions and business rules under ESSA. So, given REMIQS theoretical grounding in equity and elevating the experience of historically resilient and marginalized students, we think that this finding sort of reinforces the notion that this approach to measuring school quality may offer insights on how states can make equity more visible in their accountability systems. Next slide. So as expected, the presence of historically resilient and marginalized students was less correlated with REMIQS composite scores than state school rating.
So, there was still a statistically significant and negative relationship, but it was less particularly for two of the states and comparable for Texas. So if you recall, Kentucky and Massachusetts are two states whose school ratings heavily weighted student achievement and standardized assessments. Conversely, the correlations between historically resilient and marginalized students on Texas school ratings and REMIQS scores were relatively comparable and weaker than the associations of Kentucky and Massachusetts. So, it requires a little bit of further investigation, but we think a few things are going on here. So for one, the close correlations between REMIQS composite scores and Texas school ratings suggested to us that REMIQS and the Texas state system, you know, are at least more similar measures of school quality than that of Kentucky and Massachusetts.
And particularly we want to highlight Texas summative ratings explicitly account for the performance of individual student groups through its closing the gaps metric. And so, the report discusses the fact that, you know, these sorts of metrics in particular may be a means of ensuring that state systems are holding schools accountable for serving historically resilient and marginalized students. Additionally, Texas, unlike those other two systems, will substitute academic progress for academic achievement in state schools when the progress scores are higher than that of academic achievement. And so we think that may be explaining some more of the differences.
We also think that some of the innovative ways that Texas utilizes its school quality and student success metric, you know, Cailyn walked through a number of different ways, some of the flexibilities of that metric, that may also be accounting for some of what we observed. Next slide. So, tying all of this together, what can states do? So, findings from our investigation uncovered four methodological decisions that we think may influence how well states measure the performance and experience of historically resilient and marginalized students. So first, they can actively and rigorously examine outcomes for specific student groups rather than relying on unitary measures that center like an all-students group.
And so, we also think that this sort of focus on disaggregation sort of speaks to a lot of what I imagine many folks on this call live in, sort of this notion of equity and prioritizing lived experiences of those who are most marginalized and maligned throughout social life. And then I think another key here is the flexibility that ESSA affords states under the school quality and student success metrics. States can take advantage of this flexibility to foreground equity in ways that might get short shrift under more traditional metrics. And so, finally I saw a couple comments in the chat about this tension between growth and achievement.
And we’re certainly not the first to acknowledge it, but it’s important, you know, to state that there’s a high correlation between academic achievement and school demographics and neighborhood demographics. And so, we think it’s important that states explore options for measuring academic progress because we think that this might ensure that states systems are isolating the school effects, the effects that schools are having on students rather than just measuring the demographics of the building. So, there are, oh, the last thing I’ll add before turning it back over to Eric who’s gonna guide us through a quick Q&A in the time that we have left is that the REMIQS model also controls for 8th grade performance, and we think that that may also be contributing to some of the differences that we observe between the REMIQS framework and the state systems. So, Eric, I’ll turn it over to you to lead us through some Q&A.
Eric Toshalis:
Fantastic, thank you WestEd team. Again, much, much appreciation for all of the hard work and the incredible toil it took during a pandemic to pull all of these data sets together and make meaning of them. Really, really, really powerful, powerful work being done under some crazy circumstances. So props to you all for where we got to with this. I’m gonna try to go with these questions. Here’s the quick scan if you want to download the report now. I’ll try to go to these questions more or less in the order they’re received. Carol Kirstead has a question. “Why focus only on college enrollment and success and not on job career?”
I can briefly try to answer that and would invite folks in the WestEd team to complexify it if needed. That really came down to the fact that we didn’t have mergeable/matchable at the student level data for job career across the sampled states that would’ve made that a robust measure that would make any sense. We just couldn’t match to a sufficient level that would provide us with outcome measures and conclusions that were really worth exploring. And we tried. If you look in the report that I gave earlier, the link that I provided at the top of the presentation to the original model that was proposed by Urban Institute, the number of variables that we had intended to include were massive, but what’s actually available out there and can go all the way to student level data is far smaller than that.
And this is one of those that fell by the wayside. Would anyone from the WestEd team add to that at all?
Brad Quarles:
I’ll just briefly add. So, Cailyn and I were on the team that helped build DC’s accountability system, and we had lots of conversations about including wage data, but the end counts were always super small, so we weren’t actually getting a robust number, and we were limited to the data that DC had of earners. So, you know, utilizing those sort of measures robustly requires a lot of data sharing across states and different cohorts are working through that, but the business rules for it are just really, really, really challenging.
Mary Rauner:
Yeah, and I will just add that we did include employment data, but it was only for the students who did not enroll in college. So we added two years of earned wages after high school when possible, which is, you know, not what you’re asking, but we did attempt to get at that a little bit for the students who didn’t go directly to college.
Eric Toshalis:
Another question from Jim asking about the growth measures and value-added measures and how those were sort of de rigueur maybe 5, 10 years ago, you know, what happened to those? Yeah, Joe, I’m not sure whether you’re asking what happened to them in the field or what happened to them in this particular model. There were multiple statistical turns we made where it didn’t make sense for us given the trade-offs to continue to prioritize that type of analysis. And there are multiple critiques in the field of VAM models for what they show, but also what they conceal. So, you know, not being as deep in the statistics as the WestEd team, is there any reasons why or how we might answer that one, Mary or Brad or anyone else?
Brad Quarles:
I would invite Noman and Cailyn to also chime in. But the only other thing that I would add in addition to what Eric said is there’s a really big challenge in communicating those data and methodologies outward to key audiences. Again, in DC, we had a number of conversations about value added models, but you wanna make sure that these data can be us usable to families, and they have a clear sense of them for making decisions about their families and making that information digestible is really challenging.
Eric Toshalis:
Catherine has a question. How did we define college quote/unquote success, and over what time period? For instance, was it just persistence through year one? Was it a longer timeline? And did we include any post-secondary programs, career training, community colleges, four-year colleges, et cetera?
Mary Rauner:
So, we had four measures that we included. One is enrollment, and it was enrollment in a two- or four-year post-secondary institution within two years of high school graduation. And I do not believe it included technical, like career training programs. It had to be in an institution. College success is the percent of students who had a 3.0 GPA after their first year of college. Persistence was retention from year one to year two, and graduation was earning a post-secondary degree. So those are the four measures that we included. And, you know, it’s important to note that not every state had all of those measures. So if you look at the technical report, and I think it’s appendix C, it describes at least three states where at least one of these measures was not included and what we did about it.
Eric Toshalis:
Next question is from Drew. “Was evaluated model used for all outcome measures or just the academic achievement measures?” My understanding is that the model was fit to all of the measures together, and to find the schools that had promoted the greatest outcomes in aggregate of all of those particular post-secondary and secondary outcomes together. So, the value-added model wasn’t just academic, it was also college persistence, college enrollment. Well, those are also academic outcomes. Unless I’m mistaken in that understanding of the model. Yeah, great. Another question from Catherine about slide 26. “How can the REMIQS weight academic achievement at only 10%? I thought ESSA.”
The quick answer to that one is we don’t have to follow ESSA. We’re doing our own jam over here. And what we’re trying to do is to show what’s possible and how you might be able to reveal the way that schools might be achieving greater levels of equity if we consider things beyond very reductive test scores. Hence, the way that we weighted the model is partly a response to ESSA restrictions. The things that we put in the model is partly a response to the stuff that doesn’t show up in some of those ESSA restrictions. So in many ways we’re in conversation with ESSA about that as a way to show how some states could still adhere and need to according to what ESSA requires, but may also play around with some their weightings in there to show some different things and maybe shine a better light on where equity is being achieved and where it isn’t being achieved.
If anyone from the WestEd team wants to add to that, please do.
Mary Rauner:
Well, I think it is aligned with, and forgive me, I don’t remember who posted this question, but somebody asked well, given that ESSA requires that states weight academic performance fairly high, how do we go about making some shifts? And, you know, we recommend looking at the more flexible areas like the SQSS and what you can do in that space. So there are ways to do it, and a lot of it too is the control factors, right? So, are you controlling for preparation of students?
Brad Quarles:
Another critical piece is the utilization of students groups. So, a lot of states they rely on the all-students student groups, which subsumes everyone, and that’s like the driving academic achievement measure that they use, which obviously in different schools is gonna conceal a lot of things depending on the size of those end counts for different kinds of kids.
Eric Toshalis:
Fantastic. Pamela asked the question, “Did you consider looking at freshmen on track data?” That’s a good one. That might be more in the report. I don’t remember whether that was one of the original variables.
Brad Quarles:
So we’re talking about high school freshmen or college freshmen? I don’t see the question.
Eric Toshalis:
It’s in the Q&A. Pamela, maybe you can clarify in the Q&A whether you meant freshmen in high school or college. I’m thinking it’s high school.
Brad Quarles:
So I think that’s, again, another data availability thing. Like, we didn’t have access to that data, and frankly some SCAs don’t have access to that data.
Eric Toshalis:
And then Mitch asked the question, “Given that ESSA requires that states weigh academic performance significantly more than other factors, how would we recommend that state agencies use the concepts behind REMIQS to introduce more equity into their accountability systems under current federal guidelines?” So yeah, Mitch, you’re cutting right to the key thing here, and that was some of the implications that I think you heard toward the end of the presentation that WestEd was giving there. Would any of your team wanna add to that, Mary, Brad, Noman?
Mary Rauner:
Yeah, that was the one I was referring to. So just, you know, really trying to lean into the SQSS, and think about preparation, you know, student preparedness measures to incorporate.
Brad Quarles:
I also think that, again, like states have to evaluate all schools, but there’s also a critical piece about how, what they’re messaging, right? So, it’s one thing to just put out an A to F grade, it’s another thing to put out that grade and also to publish a handful of other data points that tell the tale that just like the rating, which again is heavily reliant on academic achievement, may not tell.
Eric Toshalis:
Right on, thank you. Mark asked a question, “Talk about more about the 8th grade component.” Just real quickly, and this may or may answer your question, but the reason we included the cohort of 8th graders is to control for the level of preparation and readiness that is coming into the high school. So that, for example, if there were a bunch of 8th graders, or we’re in a particular high school or middle school, rising 9th graders were coming in, and they were all doing very, very well as a result of the resources that are given to them and the surrounding community and all the factors that we might suppose, then the value add in quotes for the high school might not be that much.
We’re just basically measuring that these were kids that were already very well prepared to be very well successful versus kids who were not necessarily well prepared but then really started to hit it out of the park as a result of their experience in high school. Then that high school would be doing technically better in our model because they would be taking the raw material of the students that they received and doing better. There’d be a greater value add over the four years in high school and into post-secondary than there would’ve been otherwise. That’s the reason we included the 8th grade cohorts in the model is to control for what the high school was getting on the front end, if that makes sense. Great.
Pamela, “For the students who went to college, did you find it worth exploring match? In other words, did students end up in schools below the level they were qualified for?” I don’t know if we were able to answer that.
Brad Quarles:
Yeah, this again, that’s another data availability thing. Again, most don’t get that data. They’re just getting National Student Clearinghouse data on enrollment and graduation, unless states… Yeah, there’s a lot of other variables that you’d have to put in place and data share agreements that you need to have in order to have the data to run that analysis.
Mary Rauner:
But, Pamela, it gets to the core of so much of what we do because we care so much about access and matches is so important to that. So thank you for bringing that up. It’s the next study.
Eric Toshalis:
And then I’m guessing it’s Christiane, you’re asking whether we sort of triangulated with the LPI’s Design Principles Report. That was influential in us in designing what would have been the qualitative investigation of the identified schools that showed up in our model. We had used multiple frameworks from other folks who had, the Chicago Consortium, LEAP Innovations, the New England School Consortium, several other folks that created these models, like, this is how we know this is a quality school. We sort of brought those models in together and created our own framework that included the LPI model and would’ve been instrumental in actually studying what makes these schools tick.
As I mentioned at the beginning of the presentation, we were unable to conduct those investigations because we couldn’t get enough schools to say yes and open their doors to us. Hence, the turning around and just sticking with the statistical model that we’ve been talking about today. So, yes, we would’ve been including it had we been able to do it, and it was, you know, part of the literature review that defined how we would do this statistically, but it wasn’t actually the model that we implemented, if that makes sense. And then Drew, “How much did the sample of students affect the results?” Yeah, like, yeah, totally, a lot. “Did you compare results based on all students versus using only marginalized students?”
We, again, prioritize students that was defined during the presentation as being those students who were from resilient and marginalized backgrounds. And so, we’re primarily looking at how those students fared over that time. And so, whether they fared better than or worse than, you know, the average was really less material to the model than whether they performed better at that school better than other schools with similar demographics. So you can see the details of that in the full report if you want to dive into that. And I think, and we’re getting at the hour. So, apologies if we weren’t able to get to all of the rest of the questions there.
You’ll see at the end here, there’s another QR code, and you’ll see at the very end at the next slide that you’ll have information there where you can contact KnowledgeWorks and/or WestEd to get deeper if you would like to. I would recommend that you check out the report first because the weeds are definitely in there and some of the links that were provided before. Everyone that attended this will get a follow-up recording of this presentation and the slide deck, I think. We will try to make that available in a PDF format, and the copy or a link to the copy of the report to make sure those of you who want to go deeper can get your answers questioned. Again, I want to thank all of the people that have been involved in this, particularly our funders, particularly the Barr Foundation for its unending support of this project throughout its many phases.
I want to thank the WestEd team for their incredible work on all of the many turns that we had to make to make this really an impactful report. And I want to thank all of my colleagues at KnowledgeWorks for continuing to support these types of questions about school quality, about how we measure school quality, and how we inspire school quality around the country. And with that, really excited and thanks so much for everyone for attending, and please keep the questions coming, and reach out to us if you want to know more. Have a great rest of the day, and thanks for being here.