The Every Student Succeeds Act requires schools to use “evidence-based interventions” to improve schools. The law also includes definitions of what evidence means, and recent guidance from the Department of Education has provided additional clarification on what passes as “evidence-based.” Mathematica has also put out a brief guide on different types of data that have similar categories as the Department of Education, but also provide explanations for data we may see in the media or from academic researchers that do not qualify as hard data but can still help us understand policies and programs.
What follows is a brief summary of what qualifies as “evidence-based” starting with the strongest first:
Experimental Studies: These are purposefully created experiments, similar to medical trials, that randomly assign students to treatment or control groups, and then determine the difference in achievement after the treatment period. Researchers also check to make sure that the two groups are similar in demographics. This is considered to be causal evidence because there is little reason to believe the two similar groups would have had different outcomes except for the effect of the treatment. Studies must involve at least 350 students, or 14 classrooms (assuming 25 students per class) and include multiple sites.
Quasi-experimental Studies: These still have some form of comparison group, which may be between students, schools, or districts that have similar demographic characteristics. However, even groups that seem similar on paper may still have systematic differences, which makes evidence from quasi-experimental studies slightly less reliable than randomized studies. Evidence from these studies are often (but not always) considered to be causal, though experiment design and fidelity can greatly affect how reliable these conclusions are across other student groups. Studies must involve at least 350 students, or 14 classrooms (assuming 25 students per class) and include multiple sites.
Correlational Studies: Studies that result in correlational effects can’t necessarily prove that a specific intervention caused students in a particular program to have a positive/negative effect. For example, if Middle School X requires all teachers to participate in Professional Learning Communities (PLCs), and they end up with greater student improvement than Middle School Y, we can say that their improved performance was correlated with PLC participation. However, there could have also been other changes at the school that truly caused the improvement, such as greater parental participation, so we cannot say that the improvement was caused by PLCs, but that further study should be done to see if there is a causal relationship. Researchers still have to control for demographic factors; in this example, Middle School X and Middle School Y would have to be similar in both their teacher and student groups.
With all studies, we also have to consider who was involved and how the program was implemented. A good example of this is the class-size experiment performed in Tennessee in the 1980s. While their randomized control trial found positive effects of reducing class size by an average of seven students per class, when California reduced class sizes in the 1990s they didn’t see as strong of effects. Part of this was implementation – reducing class sizes means hiring more teachers, and many inexperienced, uncertified teachers had to be placed in classrooms to fill the gap, which could have reduced the positive effect of smaller classes. Also, students in California may be different than students in Tennessee; while this seems less likely for something like class size, it could be true for more specific programs or interventions.
An additional consideration when looking at evidence is not only statistical significance (whether or not we can be certain that the effect of a program wasn’t actually zero, using probability), but the effect size. If an intervention has an effect size of 0.01 standard deviations* (or other units), it may only translate to the average student score changing a fraction of a percentage point. We also have to consider if that effect is really meaningful, and if it’s worth our time, money, and effort to implement, or if we should look for a different intervention with greater effects. Some researchers would say that an effect size of 0.2 standard deviations is the gold standard for really making meaningful changes for students. However, I would also argue that it depends on the cost, both of time and money, of the program. If making a small schedule tweak could garner 0.05 standard deviations of positive effect, and cost virtually nothing, then we should do it. In conjunction with other effective programs, we can truly move the needle for student achievement.
School administrators should also consider the variation in test scores. While most experimental studies report on the mean effect size, it is also important to consider how high- and low-performing students fared in the study.
Evidence is important and should guide policy decisions. However, we have to keep in mind its limitations and be cautious consumers of data to make sure that we’re truly understanding how the study was done to see if its results are valid and can translate to other contexts.
*Standard deviations are standardized units used to help us compare programs, considering that most states and school districts use different tests. The assumption is that most student achievement scores follow a bell curve, with the average score being at the top of the curve. In a standard bell curve, a change of one standard deviation for a student at the 50th percentile would bump him/her up to 85th percentile, or down to the 15th percentile, depending on the direction of the change. A report of the effect size of a program typically indicates how much the mean of the students who participated in the program changed from the previous mean or changed from the group of students who didn’t receive the program.