This Campbell systematic review examines the effectiveness of certification schemes in improving the welfare of farmers and workers. The review summarises findings from 43 quantitative studies, and 136 qualitative studies. There is not enough evidence on the effects of CS on a range of intermediate and final socioeconomic outcomes for agricultural producers and wage workers. There are positive effects on prices. But workers? wages do not seem to benefit from the presence of CS. Income from the sale of produce is higher for certified farmers, but overall household income is not. Context matters substantially for the causal chain between interventions of certification schemes and the well being of producers and workers. Generally, the quality of the studies is mixed, with a significant number of studies that are weak on a number of methodological fronts. Plain language summary Certification schemes have unclear impact on the well being of farmers and workers Certification schemes (CS) set and monitor voluntary standards to make agricultural production socially sustainable and agricultural trade fairer for producers and workers. The evidence base is very limited and inconclusive. Certification increases prices and income from produce, but not wages or total household income. Certification agencies should adopt simpler programmes adapted to local context and rigorously test their impact. What did the review study? Certification sets and monitors voluntary standards, and can encompass systems engaging in a wider range of activities in policy, advocacy, and capacity building, and in building markets and supply chains, to make agricultural production socially sustainable and agricultural trade fairer. Certification is meant to affect a wide range of socioeconomic and environmental outcomes, to improve the well being of farmers and agricultural workers employed by corporate plantations or individual producers. Certification schemes use a combination of standard‐setting actions, training, different types of market interventions, and the application of adequate labour standards. This review assesses whether certification schemes work for the well being of agricultural producers and workers in low‐ and middle‐income countries. What studies are included? Included studies evaluate the effects of CS on socioeconomic outcomes for agricultural producers and workers. Eligible CS are based on second‐ (industry‐level) or third‐party certifications, and exclude own‐company standards. For the effectiveness review, studies must use experimental or non‐experimental methods demonstrating control for selection bias. Qualitative studies are included to answer questions about barriers, facilitators and contextual factors; these report on relevant outcomes, have sufficient reporting on methods, and provide substantive evidence on relevant themes. The review includes 43 studies used for analysing quantitative effects, and 136 qualitative studies for synthesizing barriers, enablers and other contextual factors. What is the aim of this review? This Campbell systematic review examines the effectiveness of certification schemes in improving the welfare of farmers and workers. The review summarises findings from 43 quantitative studies, and 136 qualitative studies. What are the main findings of this review? There is not enough evidence on the effects of CS on a range of intermediate and final socioeconomic outcomes for agricultural producers and wage workers. There are positive effects on prices. But workers’ wages do not seem to benefit from the presence of CS. Income from the sale of produce is higher for certified farmers, but overall household income is not. Context matters substantially for the causal chain between interventions of certification schemes and the well being of producers and workers. Generally, the quality of the studies is mixed, with a significant number of studies that are weak on a number of methodological fronts. What do the findings of this review mean? For farmers and workers the results show there is no guarantee that living standards improve through certification. To have a positive impact, CS need favourable conditions and the support of other factors. Some of these conditions depend on deeply rooted socioeconomic factors that, in the short to medium run, will not likely be altered substantially by certification. For CS practitioners and businesses, there are several lessons to learn. Claims about impact should match what is achievable and verifiable. Standards and interventions could be revised, away from multiple standards with fewer overlaps between systems and rationalisation of interventions. Impact evaluation standards should be given more attention. CS need to develop a deeper understanding of context, and adapt and pre‐test the type and range of interventions. Researchers and evaluators should consider using a range of methods for different kinds of research questions, and have a clear understanding of what kind of design is more appropriate for each question. They should also use a more consistent, rigorous approach in reporting methods and results. How up‐to‐date is this review? The review authors searched for studies published until July 2016. This Campbell Systematic Review was published in February 2017. Executive summary BACKGROUND The rise of voluntary standards and their associated certification for agricultural products is a well‐established phenomenon in the contemporary dynamics of agricultural trade. Supply chain management is increasingly influenced by a proliferation of standards, and by the organisations setting and monitoring them over a growing number of products. While the objectives of standards and certification schemes (CS) vary, the focus of this review is on social sustainability standards, which are closely related to ethical trading and to schemes that focus on socio‐economic outcomes of participants, essentially agricultural producers (particularly smallholders) and wage workers, whether employed by corporate plantations or individual agricultural producers. OBJECTIVES This systematic review addresses the extent to which, and under what conditions, CS for agricultural products result in higher levels of socio‐economic well being for agricultural producers and workers in low‐ and middle‐income countries (L&MICs). The primary review question is: What are the effects of certification schemes for sustainable agricultural production, and their associated interventions, in terms of endpoint socio‐economic outcomes for household/individual well being in low and middle income countries? The subsidiary review question is: Under what circumstances and why do certification schemes for agricultural commodities have the intended and/or unintended effects? What are the barriers and facilitators to such certification's intended and/or unintended effects? SEARCH METHODS We systematically searched for available literature from a wide range of sources. Several bibliographical databases were consulted. A very significant amount of time was devoted to a systematic search for relevant items through hand searching in targeted databases and websites, including consultation with relevant stakeholders in the community of standard‐setting organisations. In this field the ‘grey’ literature is very important. Thus, the standard bibliographic databases would not be enough to find all relevant material. Papers in English, French, Spanish, German and Portuguese were considered. The references retrieved for this review are up‐to‐date as of November 2015. Some key references were added in July 2016 as a result of consultations with the ISEAL Alliance. Selection criteria We included studies that evaluated the effects of CS on socio‐economic outcomes for agricultural producers and workers. We defined eligible CS as those based on second (industry‐level) or third‐party certifications thereby excluding own‐company standards. We examined the main types of interventions usually implemented by CS, organized around four groups: (a) capacity building, (b) market interventions (including price interventions, credit support, guaranteed market outlets, etc.); (c) premium‐funded social investments, and (d) labour standards. In most cases CS adopt combinations of these groups of interventions. We included studies that report at least one intermediate or final outcome of interest. For the effectiveness review, we selected studies that use experimental and quasi‐experimental methods, and other studies that demonstrated control for selection bias and sufficient confounders. We selected studies that provided relevant comparisons with non‐certified groups. For questions on barriers and facilitators and contextual factors we searched for and screened qualitative studies that reported on relevant outcomes, that had sufficient reporting on methods, and provided substantive evidence on key selected themes to complement the effectiveness review. We used a combination of single screening with substantial piloting and supervision in initial stages, and double screening with arbitration for disagreements in coding and inclusion/exclusion decisions for full‐text review. DATA COLLECTION AND ANALYSIS We developed separate coding tools according to the requirements of our two review questions. To compare effects on variable outcomes across studies we calculated standardised mean differences. The quantitative results were synthesised using inverse variance‐weighted random effects meta‐analysis. Only one effect size per outcome per study was included in any given synthesis. The analysis of qualitative material was organised around three main thematic areas: barriers and enablers in implementation dynamics; distributional dynamics, including gender equity issues; other internal and external contextual factors and barriers and enablers. RESULTS The initial search returned 10,753 studies, which, after dropping duplicates, a large number of irrelevant papers, and applying the selection criteria, were reduced to a final sample of 43 studies from 44 papers for review question 1 (effectiveness), and 136 studies from 114 papers for review question 2. All were published between 1990 and 2016. The majority of our material comes from research reports, working papers, book chapters, and theses. The included studies for the quantitative and qualitative syntheses provide evidence on a range of rural settings in L&MICs, with dominance of cases from Latin America. Despite the fact that there are many CS operating with agricultural commodities, included studies only cover a group among them (12 CS), which have attracted more research in the form of impact evaluations. Fairtrade certification is particularly well represented in the literature, with over half of the total number of included studies. Several agricultural products are covered by the included studies but coffee (38%) and fruits (17%) combined account for more than half of studies. In terms of population, a large majority of studies (77%) focus on agricultural producers, whereas the research on employment outcomes is rather limited. The quality of the included studies is mixed. The proportion of quantitative studies with high risk of bias ratings was relatively large. There are no randomized controlled trials (RCTs) but there is a range of quasi‐experimental designs employing different techniques of data analysis. Given the paucity of calculable effect sizes per outcome and the variety of methods used in different studies the meta‐analysis encountered difficulties, and the number of studies with low or moderate risk of bias included for the synthesis of effects for each outcome is very small. Although there are many included qualitative studies of high quality, especially ethnographic research, the overall quality of this group is mixed as well. Several studies, especially non‐ethnographic contributions, are only borderline in terms of minimum reporting standards. In terms of quantitative results, we find that the available quantitative evidence does not give a clear picture of the impact – or lack there of – of certification schemes. The synthesised effects for our key intermediate and final outcomes are summarised below. For each outcome we present the difference between certified groups and control groups in standardised percentages, with a central estimate and a likely range around the estimate, which reflects the uncertainty inherent in the estimate, added in parentheses.1 • Yields: We found no clear effect on yields. While certification is associated with a decrease in yields of 20%, the overall effect is not statistically significant (central estimate ‐20%, range from ‐52% to 19%; SMD ‐0.42, 95%‐CI from ‐1.23 to 0.39). The five studies synthesised for this outcome range from negative to positive in their effect sizes. One study was rated as having low risk of bias, and two studies each were rated as moderate and high, respectively. • Price: Prices for certified producers were 14% higher than for non‐certified producers (range from 4% to 24%; SMD 0.28, 95%‐CI from 0.09 to 0.49). Three of the four studies we synthesised for this outcome provided positive effect sizes. One study was rated has having high risk of bias while the other three were rated as moderate. The overall effect is statistically significant. • Income from certified production: Incomes from the sale of produce were 11% higher if the produce was certified (range from 2% to 20%; SMD 0.22, 95%‐CI from 0.03 to 0.41). For this outcome we synthesised ten studies whose individual effect sizes ranged from negative to positive, though none of the negative effect size estimates were statistically significant. Half of the studies were rated as having moderate risk of bias and the other half as high. The overall effect is statistically significant. • Wages: We find that wages for workers engaged in certified production were 13% lower than for workers working uncertified employers (central estimate ‐13%, range from ‐22% to ‐3%; SMD ‐0.26, 95%‐CI from ‐0.46 to ‐0.06). Of the eight studies synthesised all but two provide negative effect size estimates and the positive effect size estimates are not statistically significant. One of the studies was rated as having low risk of bias, while five were rated as moderate and two as high risk. The overall effect is statistically significant. • Total household income: Effects on the total household income of farmers are unclear. While household incomes of farmers engaged in certified production were 6% higher than those of households not engaged in certified production, the overall effect is not statistically significant (range from ‐3% to 16%; SMD 0.13, 95%‐CI from‐0.06 to 0.32). The effect size estimates for individual studies range from negative to positive, though all statistically significant studies provided positive estimates. Four of the studies synthesised were judged to be of moderate risk of bias, while the other four were rated has high risk. • Assets/wealth: We found no statistically significant effect on wealth. Certified producers on average had slightly higher wealth levels than uncertified producer who had been selected to be similar to them, and the overall effect was a 3% increase in assets, but this effect was not statistically distinguishable from zero (range from ‐7% to 13%; SMD 0.05, 95%‐CI from ‐0.15 to 0.26). For this outcome we had just two studies, both of which provided positive effect sizes. One study was rated has having high risk of bias, the other as moderate. • Illness: We also found no clear effect on producer's health. Pooling the included studies suggests a 7% lower incidence of illness in certified producers compared to non‐certified producers, but the overall effect is not statistically significant (central estimate ‐7%, range from ‐16% to 2%; SMD ‐0.15, 95%‐CI from ‐0.32 to 0.03). Please note that, as these findings concern illness, a negative synthesised effect means an improvement in health. Just two studies provided estimates for this outcome, both of which pointed towards a lower incidence of illness. Both studies were rated as having high risk of bias though. • Schooling: Children in households of certified producers receive 6% more schooling than children in households of non‐certified producers (range from 0% to 12%; SMD 0.12, 95%‐CI from 0.01 to 0.24). The individual effect sizes provided by included studies range from negative but not statistically significant to positive. Three of the five studies synthesised for this outcome were rated as having high risk of bias, the other two as moderate. The overall effect is statistically significant. In most cases, disaggregation by type of CS did not yield conclusive results, although for some CS results were more mixed than for others. Such is the case of Fairtrade for yields and income measures. The qualitative synthesis discussed a wide array of factors affecting the causal chain in different nodes along the chain, such as: producer organisations (POs) and their characteristics, particularly heterogeneity and power relations within them; relations with buyers and exporters; business models linking buyers and producers (whether open spot markets, contract farming or a mix); national institutions shaping the dynamics of agricultural trade and labour relations; barriers imposed by direct and indirect certification costs, which negatively affect adoption or the size of benefits accruing to producers; availability of additional external support, often critical for adoption and sustained maintenance of standards; inconsistency in monitoring and auditing practices; heterogeneity of participant groups and the effects of inequality on POs management and the sharing of benefits; difficulties in addressing deep‐rooted structures of inequality based on gender; the relative invisibility of large segments of agricultural wage workers, Not ably those employed by small farmers. The mixed and inconclusive quantitative effects, combined with the wide range of contextual factors to take into consideration, underline that CS operate in complex environments with multiple interventions, goals, actors and contexts, and as such they do not operate in a social, institutional and economic vacuum. AUTHORS’ CONCLUSIONS Overall, we found mixed results and a dominance of weak or not statistically significant effects. There were both positive and negative effects for different outcomes. Even within a given CS there is substantial variation in effects across different outcomes. Thus, it is hard to conclude anything about whether any particular CS performs better compared to others over a range of outcomes. Without more systematic high‐quality quantitative evidence on intermediate and final outcomes it is difficult to draw meaningful conclusions with actionable findings. Context hugely matters, as the range of contextual factors and barriers and enablers is vast. This is not surprising and most Theories of Change developed for selected CS acknowledge the centrality of context specificity. Nonetheless, the reviewed qualitative research reveal a number of key barriers and facilitators or contextual features that seem important to understanding the impact of CS. Practitioners can extract some lessons about the kinds of contextual factors that seem prominent in mediating the impact of their interventions, such as the characteristics of POs with which they partner, the deep‐rooted social relations of inequality, including gender dynamics, in rural areas of L&MICs; the direct and indirect certification costs, and their determinants; the specificities of each supply chain and especially existing relations between established buyers and producers; and the national and local contexts of regulation and economic development. There are various implications for researchers. First, there is scarcity of high‐quality impact evaluations, and a disproportionate attention to some CS and almost no attention to several other CS. The volume of research with rigorous study designs has fortunately expanded in the last 10 years but this review calls for more studies and on more outcomes, especially on employment effects, which have received less attention so far. Second, mixed‐methods theory‐based evaluations with appropriate counterfactual designs are likely to generate more valuable findings, given the importance of context and the need to link effects with barriers and facilitators in each study. Third, reporting standards must be improved, so published papers should devote more space and attention to reporting details of how research was conducted, limitations and all the relevant statistical information. Many studies had to be excluded from this review or from effect size calculations because of basic reporting gaps.