Editors’ Note: Nicole P. Marwell and Jennifer E. Mosley discuss their new book, Mismeasuring Impact: How Randomized Controlled Trials Threaten the Nonprofit Sector (Stanford University Press, 2025).
Recent scholarship has offered varying interpretations of what the appropriate function of foundations should be within a democracy. One dominant perspective highlights foundations’ contributions as drivers of social innovation, arguing that their financial capacity and independence allow foundations to experiment with promising approaches that have the potential to benefit society. To make this argument convincingly, one tool at foundations’ disposal is to support rigorous evaluation of program effectiveness, thereby drawing on scientific norms to demonstrate the societal benefit of the innovations they have supported. In particular, foundations might be drawn to the use of randomized controlled trials (RCTs), which are experiments that compare the outcomes of a randomly selected “treatment group” of people who receive a program with a randomly selected “control group” of people who don’t. RCTs are widely regarded as the “gold standard” for evidence of program impact, and foundations that use them often do so to bolster the case that foundation investments are effective ones.
In our new book, Mismeasuring Impact: How Randomized Controlled Trials Threaten the Nonprofit Sector, we report on what three key groups of nonprofit sector professionals—nonprofit managers, program evaluators, and foundation program officers—think about the growing use of RCTs to evaluate nonprofit social programs. In interviews with these key nonprofit stakeholders, we found a surprising degree of concern regarding this development and use this data to identify five problems with relying on RCTs for nonprofit evaluation. Foundation program officers were particularly worried about the consequences of widespread RCT use in the nonprofit sector. In what follows, we discuss how and why RCTs have grown in stature in the sector, and what the program officers we interviewed saw as some of the main challenges.
RCTs have a long history as a tool for evaluating social programs, both in the United States and internationally. Many people concerned with making sure social programs are improving people’s lives have embraced the idea of a “hierarchy of evidence.” This formulation suggests that there are better and worse types of evidence, and that the RCT naturally sits atop the hierarchy. But our reading of the history suggests that the RCT did not find its place at the top of the evidence hierarchy simply on its own merits. Rather, it took significant work for this outcome to be achieved—efforts we refer to as the “Gold Standard movement.”
The Gold Standard movement has been shaped over the last three decades through activities occurring on two separate, but related, battlefields. First, what we call the evidence battle focused on the question of what kind of evidence best determines whether or not a social program “works.” The evidence battle eventually yielded the construction of the “hierarchy of evidence,” with causal evidence, especially from RCTs, believed by many to sit at the top. Second, what we refer to as the funding battle concerns efforts to put the evidence hierarchy into action, tying the allocation of financial resources for social programs to RCT evidence.
The simplest story of the evidence battle goes like this. Sometime around 1980, a growing number of economists became dissatisfied with the then-dominant approach to doing microeconomics, leading to what has been called the “credibility revolution.” At the time, most microeconomic research that sought to inform public policy decisions relied on building econometric models from theory, and then testing the models with observational data. Critics of this approach argued that its results were highly contingent on a model’s underlying theoretical assumptions, and showed that quite different results occurred when different assumptions were used. This new wave of research argued that if we wanted to determine whether the changes observed in social program participants were in fact the result of participating in that program, experimental (that is, RCT) and quasi-experimental research designs would be required.
Nailing down whether a program is causing change in its participants is a hard question because of what is referred to as the “counterfactual” scenario. When someone takes part in, for example, a job training program, we can only observe what happens to them afterwards: if they got a job, what kind of job, at what wages, and so on. We cannot also observe the counterfactual: what would have happened to them in terms of employment if they had not participated in the program. This is where credibility revolution scholars argued that an RCT can give us a “credible” answer. This perspective on how to evaluate program effectiveness arguably reached its zenith in 2019, when the Nobel Prize in Economic Sciences was won by three of the most high-profile practitioners and promoters of RCTs. All told, this simple version of the evidence battle story tells us that RCTs rose to prominence because they provide the best evidence for understanding whether or not a social program works.
Scholars who have delved into the history of economics during this period, however, offer a second version of the evidence battle story, one that is decidedly more complex. In this story, significant points of contention have always existed regarding the reliability and validity of RCT evidence—notwithstanding the “credibility revolution.” Indeed, objections to the idea that RCTs necessarily offer superior evidence have been ongoing in multiple fields, including economics.
The use of RCTs to evaluate social policy in the U.S. picked up momentum in the 1960s, assisted in part by fast-rising allocations of federal funds: by 1968, each time a new program received federal funding, 1 percent of its cost was allocated to the evaluation of its results. Over the next several decades, interested parties put the building blocks of the Gold Standard movement into place, taking steps to ensure causal evidence would play an increasingly important role in policymaking. Members of that movement refer to their work as advancing “evidence-based policy.” This moniker is misleading, however, because many scholars and practitioners outside the Gold Standard movement agree that policy should be evidence-based—they just advocate for a wider range of evidence to be considered.
In 2001, a new organization emerged to aggressively pursue the goal of driving government spending towards social programs with causal evidence of effectiveness: the Coalition for Evidence-Based Policy. According to its mission statement, “[t]he Coalition advocates many types of research to identify the most promising social interventions. However, a central theme of our advocacy, consistent with the recommendation of a recent National Academy of Sciences report, is that evidence of effectiveness generally cannot be considered definitive without ultimate confirmation in well-conducted randomized controlled trials.”
Three separate external evaluations conducted during the Coalition’s first ten years of operation praised the Coalition’s single-minded focus on promoting RCT evidence, but tempered that assessment with advice that it might want to consider the virtues of other forms of evidence as well. Ultimately, the Coalition continued to promote only RCT evidence, becoming the standard-bearer for the Gold Standard movement and ratcheting up the funding battle.
In 2002, the Coalition played a key role in the Bush administration’s efforts to privilege RCT studies in allocating federal funding for education research. Many of the Coalition’s recommendations made their way into the operating mandate for the newly formed Institute for Education Sciences, which the Bush administration created to replace what it identified as the ineffective existing research arm of the Department of Education. Education researchers loudly registered their opposition to limiting research funding in this way, arguing that it made no sense to preemptively decide what research method to use when the questions driving the research were still unformed.
As the Bush administration gave way to the Obama administration, the Gold Standard movement consolidated its influence inside the executive branch. Until then, much of the success of the Gold Standard movement had been happening in the domain of international development. Searching for new ways to attack the problem of global poverty, researchers worked with private philanthropy and international nongovernmental organizations to conduct RCTs of promising programs. Drawing inspiration from this work, some of President Obama’s staff turned their attention to the U.S. nonprofit sector, a major recipient of the public funds spent on social programs.
Between 2010 and 2016, President Obama’s Social Innovation Fund (SIF) made hundreds of millions of dollars in grants to thirty-nine intermediary organizations—nonprofits whose principal work is funding or supporting service-providing nonprofits—which in turn made sub-grants to just under three hundred nonprofits that were operating promising programs in local communities across the country. Built into the grants to support these nonprofits’ program work was a requirement that they undertake rigorous evaluation—generally, RCT or quasi-experimental evaluation—of their program impacts.
The SIF’s articulated goal was that SIF-funded programs that showed rigorous evidence of impact would be scaled up, either to meet additional need in the local area where the nonprofit operated, or by replicating the program in other nonprofits in other places. The SIF, then, represented a watershed moment in which a new standard of evaluation was held up to the U.S. nonprofit sector.
The SIF also continued to funnel federal funds into the professional evaluation industry. Because the SIF required independent evaluations of programs it funded, it sparked growth in U.S. nonprofits seeking to hire evaluation firms. Intermediary organizations and philanthropic foundations raised funds to help support SIF-mandated research, while evaluation firms took up the new challenge of conducting RCTs in nonprofits that mostly were not well-prepared to do them. As articulated in numerous policy papers and communications from the SIF and the Obama administration more broadly, the initiative’s goal was to make rigorous evaluation a key part of the process of distributing the government funds that support so many of the nation’s nonprofits. Over time, these advocates argued, if the SIF vision was implemented correctly public dollars would flow only to those organizations providing rigorous—often RCT—evidence of their effectiveness.
The SIF evaluation experience revealed that it was much more challenging than previously understood for nonprofits to conduct RCTs. To begin, while a 2016 report by Xiaodong Zhang and Jing Sun on the SIF indicates that the initiative had some three hundred sub-grantees, this and subsequent reports show that only around eighty evaluations were actually completed.[i] Of these eighty or so evaluations, only thirty-two assessed program outcomes or impacts, and only half of those thirty-two were adequately powered (that is, had a large enough sample size in both comparison groups) to provide credible evidence on at least one outcome. To sum up: three hundred nonprofit organizations were asked by the SIF to conduct a high-quality evaluation study, and only sixteen of them delivered.
The experience of the SIF underlines what many evaluation experts have repeatedly warned: that RCTs are a poor match to evaluate the complex activities of nonprofit organizations. Nevertheless, due to the success of the Gold Standard Movement, many nonprofit sector stakeholders feel compelled to discuss and advocate for the use of RCTs to evaluate nonprofit programs and organizations. This is despite the ongoing evidence battle over whether RCTs of social programs actually deliver the scientific results their advocates claim they do.
Foundation Program Officer Perspectives on RCTs
The world of philanthropy, as might be expected, has divergent views on the importance of RCT evidence for nonprofit organizations. On the whole, however, we were surprised by the degree to which the foundation program officers we talked to for our book seemed to see through the mystique when it comes to RCTs.[ii] We initially assumed that the legitimacy that government funders accord to RCTs would be replicated among private foundations. Instead, while a few of the program officers we interviewed echoed the ideas behind the Gold Standard movement and saw RCTs as a path to legitimacy for social programs and the nonprofits that deliver them, many more foundation officials told us that they saw problems with how RCTs are being used in the nonprofit sector. This latter group of foundation program officers often decried what they saw as the inordinate pressures driving nonprofits towards conducting RCTs and told us about how their foundations were discussing and choosing different approaches to their support of nonprofits.
For those that supported the growth of RCTs, it was often because of how important they have become in the public policymaking process. “We’re very influenced by policy priorities,” one program officer told us. “The Commission on Evidence-Based Policymaking was an important thing for us. So was the passage of the Evidence Act. We want to be able to get public systems to adopt practices and reforms that are consistent with our [foundation’s] principles and our objectives, and in order to do that we think there’s a rising interest in having hard evidence to the effect that they work.” By “hard evidence,” this respondent meant RCT evidence for specific social programs, which she noted was particularly important as a marker of legitimacy in government circles.
Other program officers saw RCTs as having a certain utility, but emphasized their limitations as well. For example, one program officer told us, “I don’t think it’s a bad thing for people to accumulate [RCT] knowledge about the programs they are using and how they affect the people that they serve; I think that’s a good thing.” At the same time, they also said, “I think devising policy that privileges only studies that have evidence that has been accumulated from RCTs is a dangerous thing.” In a similar vein, a different program officer emphasized that while RCTs can sometimes provide useful evidence, they do not provide the answer to every important question about social programs:
I think that there’s funders in the world that tend to take the view that if you don’t validate it with an RCT, then you don’t know it works and it’s not worth doing. We don’t buy that… We’ve done a lot of work on community change, and that’s not easily evaluable using random assignment methods. We find ourselves with kind of a foot in both worlds, frankly. I mean, we’re arguing in favor of rigorous impact evaluation including RCTs to our own staff, who are kind of afraid that it is just too much of a medical model, a scientific intrusion on a process that’s just as much art as science. And then we turn around and talk to our public agency partners and say, “Well, you know, you’re setting standards of evidence that are ultimately going to restrict how you’re able to intervene and what problems you’re able to address,” and that’s not appropriate either. So, you know, the two far ends of the continuum are both wrong, and we’re trying to occupy the middle ground.
One issue some program officers were particularly concerned about is that the RCT method does not reflect the plurality of voices of the communities nonprofits serve. These program officers told us that they try to improve nonprofit effectiveness by bringing in the voices of those affected by the social issues a nonprofit is targeting, in order to develop relevant knowledge embedded in specific local contexts. As one program officer put it, RCTs “just can be really convoluted and really messy with a lot of bias—researcher bias—that’s built in upfront around who gets to create the knowledge, who gets to create the question, ultimately who gets to determine what the answer is and whose voice is really valid.”
Others did not go so far as calling RCTs “dangerous,” but did agree that they “would like to see RCTs be complemented by really hearing the perspectives of people… RCTs are not designed to return the learning back to the system, or to assume that people who are actually experiencing the intervention could actually understand their own experience.” This respondent went on to say that the foundation they worked for was increasingly asking itself the question of how to use evaluation to enhance equity both at the nonprofits receiving the foundation’s support, and in the communities served by those nonprofits. “[A]s as an example of how that is showing up for us is,” the respondent explained, “we are, with every evaluation, thinking about what the benefit is to the grantee partner. What do our grantee partners, what do nonprofit organizations actually make from this evaluation activity?”
Overall, these program officers shared their sense that RCTs were not well-suited to their philanthropic goals because the method requires an arm’s-length approach to the nonprofit organizations and participants on whom the RCT is being performed. They expressed instead a commitment in their philanthropy to building close, trusting relationships with their grantees, believing that such relationships were key to helping nonprofits make meaningful progress in combating social problems, as well as to rebalancing the power relationships between foundations and nonprofits. In this model, helping grantees realize their own goals is the point of evaluation, not just speaking back to policy or demonstrating that the foundations’ investments can meet the “gold standard.”
-Nicole P. Marwell and Jennifer E. Mosley
Nicole P. Marwell and Jennifer E. Mosley are professors at the Crown Family School of Social Work, Policy, and Practice at the University of Chicago. Their research on nonprofit organizations has been published widely in leading journals in the fields of nonprofit studies, sociology, public administration, and social work.
[i] When research for our book was conducted, the SIF reports referenced here were publicly available on the website of the federal government’s AmeriCorps program. As of this writing, none of these reports are available online any longer. Authors have copies of the reports.
[ii] The foundations in our study were drawn from the population of private foundations based in the United States with: (1) an existing research and evaluation funding portfolio; (2) assets and funding capacity that would allow a foundation to fund an RCT if it desired (given the expense of RCTs, supporting one is relatively rare among foundations; we defined the funding capacity as at least $10 million in annual giving); and (3) a focus on funding social programs. We ultimately developed a population of sixteen foundations, with multiple program officers identified at a few of the larger foundations. We reached out to twenty-one individuals at fourteen foundations. Program officers at eleven of the foundations agreed to be interviewed, with multiple interviews at several foundations. In all, we spoke to sixteen program officers at eleven foundations.