%@LANGUAGE="JAVASCRIPT" CODEPAGE="1252"%>
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Research Into Practice 2005 Shoestring Evaluation: Designing Impact Evaluations Under By Michael Bamberger Jim Rugh Dr. Mary Church, Lucia Fort This paper discusses the shoestring evaluation approach, which was developed to assist evaluators in conducting evaluations that are as methodologically sound as possible when operating with budget and time constraints and with limitations on the types of data to which they have access. The approach is used in two main scenarios. In the first scenario, the evaluator is not called in until the project or program has been operating for some time and typically no baseline data had been collected on the project population or a control group. As managers, policy makers, and funding agencies often only start to focus on assessing impacts when the time to make decisions on future funding is approaching, the evaluator is frequently required to work without an adequate timeline and often with a limited budget. In the second scenario, the evaluator is called in at the start of the project but for budget, political, logistical, or methodological reasons it was not possible to collect baseline data on a control group, or in some cases even on the project population itself, using methodologies that will be comparable to later evaluation. In response to the growing demand for evaluation with these budget and time constraints, a number of rapid and economical assessment methods have been developed. Unfortu-nately, in an effort to deliver evaluation results on time and within budget, many of the basic principles of sound evaluation design, such as random sampling, specification of the program theory, instrument development, control for researcher bias, and general quality control may be compromised. The shoestring evaluation approach provides tools for working within limitations of budget, time, and data, while at the same time providing a framework for the identification of threats to the validity or adequacy of the evaluation findings, and guidelines for addressing the different threats once they have been identified. The approach was originally developed to assist evaluators working in developing countries, where the budget, time, and data constraints are often most severe; but feedback from colleagues working in the U.S. and other industrial nations suggests that the approach may be more widely applicable. However, all of the case studies in the paper are taken from the authors’ experience in developing countries. Most of the tools and methods used in the approach will be familiar to experienced evaluators. What is new is the way the tools are combined into a six-step strategy to ensure the best quality evaluation under the particular budget, time, and data constraints affecting an evaluation. Consequently most of the data collection and analysis methods are only referenced briefly. We do, however, discuss some of the less familiar methods, such as the use of recall and other methods for reconstructing baseline data and control groups, the strengths and weaknesses of different quasi-experimental designs for addressing the three sets of constraints, the development of an integrated framework for assessing the validity and adequacy of multi-method evaluation designs, and the strategies for addressing the different threats to validity and adequacy. The goal is to conduct evaluations that are credible and adequately meet the needs of key stakeholders, given the conditions under which such evaluations need to be undertaken. SHOESTRING EVALUATION SCENARIOS: TYPICAL TIME, DATA, AND BUDGET CONSTRAINTS FACING THE SHOESTRING EVALUATOR Table 1 describes typical evaluation scenarios in which the evaluator is faced with constraints related to budget, time, and data. Sometimes the evaluator is faced with a single constraint. For example, in some cases the budget is limited but the evaluator does not face excessive time constraints, while in other cases the main constraint is time. Or sometimes the evaluation can be planned before the project begins and there is an adequate budget but the evaluator is told that for political or ethical reasons it will not be possible to collect data on a control group. Many unlucky evaluators find themselves simultaneously contending with two or all three constraints! The following paragraphs discuss some of the most common problems encountered under each of these constraints. TABLE 1
Time Constraints. The most common time constraint is when the evaluator is not called in until the project is already well underway and the evaluation has to be conducted within a much shorter period of time than the evaluator considers necessary, either in terms of a longitudinal perspective over the life of the project, or in terms of the time allotted for conducting the end-of-project evaluation. Under this scenario it is not possible to conduct a baseline study using methodology that will be comparable with the preplanned final evaluation. The time available for planning stakeholder consultations, site visits and fieldwork, and data analysis may also have to be drastically reduced in order to meet the report deadline. These time pressures are particularly problematic for an evaluator who is not familiar with the area, or even the country, and who does not have time for familiarization and for building confidence with the communities and the agencies involved with the study. The combination of time and budget constraints frequently means that foreign evaluators can only be in the country for a short period of time—often requiring them to use shortcuts that they recognize as methodologically questionable. Budget Constraints. Frequently, funds for the evaluation are not included in the original project budget, so the evaluation must be conducted with a much smaller budget than would normally be allocated. As a result, it may not be possible to apply the desirable data collection instruments (tracer studies or sample surveys, for example), or to apply the methods for reconstructing baseline data or creating control groups. Lack of funds is also the cause of several of the time constraints discussed earlier. Data Constraints. When the evaluation does not start until late in the project cycle, there is usually little or no comparable baseline data available on the conditions of the target group before the start of the project. Even if project records are available, they are often not organized in the form required for comparative before and after analysis. Project records and other secondary data often suffer from systematic reporting biases or poor record keeping standards. Even when secondary data is available for a period close to the project starting date it usually does not fully match the project populations. For example, employment data may only cover larger companies, whereas many project families work in smaller firms in the informal sector, or school records may only cover public schools, etc. Another problem is that survey data is often aggregated at the household level, so that information is not available on individual household members. This is a particular problem for gender analysis. Most agencies are only interested in collecting data on the groups with which they are working. They may also be concerned that the collection of information on non-beneficiaries might create expectations of financial or other compensation for these groups, which further discourages the collection of data on a control group. It is also often difficult to identify a control group even if funds are available. Many project areas have unique characteristics that make it difficult to find comparable control areas. For example, the project may cover all of the poorest communities, or it may have selected all of the most dynamic communities, or it may only be organized in districts where there is strong political support and a commitment of local government funds. In other cases, the project impacts concern sensitive topics such as women’s empowerment, contraceptive usage, domestic or community violence, or corruption, on which information is difficult to collect even when funds are available. Similar data problems can arise when the project is working with difficult to reach groups such as drug addicts, criminals, ethnic minorities, migrants, illegal residents, or in some cases women. THE SHOESTRING EVALUATION APPROACH: The shoestring evaluation approach proposes six steps for ensuring maximum possible methodological rigor in impact evaluations conducted under time, budget, or data constraints (see Figure 1). The approach can be used by evaluation practitioners, managers, or funding agencies and can be applied at the start, mid-term, or end of the project (see Table 2). Managers can use the approach to identify ways to reduce the cost and time required for the evaluation (steps 2 and 3 of the approach). If the evaluation is being subcontracted to outside consultants, managers may also use the checklist in Table 7 to assess the strengths and weaknesses of the proposed evaluation design (step 5). Funding agencies may also find the approach useful when assessing the validity of the conclusions and recommendations produced by the evaluation, and in some cases to suggest measures to address and correct some of the weaknesses identified in the evaluation (step 6). Evaluation practitioners, on the other hand, will often be asked by managers and/or funding agencies to propose the minimum costs and time required to conduct the evaluation (steps 2 and 3). In some cases they will use the threats checklists (step 5) to negotiate with managers on the need to relax one or more of the budget or time constraints in order to avoid some of the major threats to validity and adequacy (for example they may make the case for conducting a household survey, or for the inclusion of a control group). Once all key stakeholders have agreed on these issues, the evaluators will then use steps 2, 3, and 4 to develop the best and most robust evaluation design within these constraints. FIGURE 1
TABLE 2
STEP 1: PLANNING AND SCOPING THE EVALUATION Understanding Client Information Needs A clear understanding of the client’s priorities and information needs is an essential first step in the design of any evaluation and also an effective way for the shoestring evaluator to eliminate unnecessary data collection and analysis, hence reducing the cost and time of the evaluation. The timing, focus, and level of detail of the evaluation should be determined by the client information needs and the types of decisions to which the evaluation must contribute (Patton 1997). While it is usually a simple matter to define the client (the agency commissioning the evaluation), a more difficult issue is to define the range of stakeholders whose concerns should be taken into account in the evaluation design, implementation, and dissemination. This question is not unique to shoestring evaluations and consequently is not discussed here, other than to point out that time and budget constraints will often create pressures to limit the range of stakeholders who can be consulted and involved. The evaluator should assess early on whether these constraints may eliminate some important groups, particularly vulnerable groups that are often more difficult and expensive to reach. The shoestring evaluator should meet as early as possible with clients and key stakeholders to ensure that the reasons for commissioning the evaluation are fully understood. The discussion of the program theory model with clients can help focus on these critical information needs. It is particularly important to understand the policy and operational decisions to which the evaluation will contribute and to agree on the level of precision required in making these decisions. Typical questions that decision makers must address include the following:
Many of these questions do not require a high level of statistical precision, but they do require reliable answers to questions, including the following:
The shoestring evaluator must understand which are the critical issues that must be explored in depth and which are less critical and can be studied less intensively. It is also essential to understand when the client needs rigorous (and expensive) statistical analysis to legitimize the evaluation findings to members of congress or parliament or funding agencies critical of the program, and when more general analysis and findings will be acceptable. The answer to these questions can have a major impact on the evaluation budget, particularly on the required sample design, size, and level of rigor. Defining the Program Theory Model on Which the Project Is Based Once the priorities and information needs of the clients and stakeholders have been defined, the evaluator should ascertain the program theory model or hypotheses on which the project is based. While program theory models can be used in all evaluations, they are particularly useful for shoestring evaluations to identify the critical areas and issues on which the limited evaluation resources or time should focus, and help define the ways in which triangulation methods can be used most effectively. A program theory “consists of an explicit theory or model of how the program causes the intended or observed outcomes” (Rogers, Petrosino, Huebner, & Hacsi, 2000, p. 5). All projects and programs are based on an implicit theory about the most effective way to achieve the intended program outputs and impacts, and the factors constraining or facilitating their achievement. In some cases the program theory is spelled out in project documents, and may be summarized in the form of a logic model (e.g., log frame), while in other cases it can only be elicited by the evaluator through consultations with program staff, participants, and partner agencies. This will often be an iterative process in which an initial theory model is constructed by the evaluator on the basis of preliminary consultations and is then discussed and modified through further consultations. Leeuw (2003) identifies three ways to reconstruct the underlying program theory: the policy-scientific approach, the strategic assessment approach, and the elicitation methodology. The second and third approaches are probably most useful for most shoestring evaluations. In the strategic approach, the theory model is identified in group discussions with key stakeholders. In the elicitation methodology, strategic documents are reviewed, managers are consulted, and decision making processes are observed. One of these two approaches can be applied in most shoestring evaluations. The program theory model also helps assess whether failure to achieve program objectives is due to an inadequate program theory or to ineffective implementation procedures (Lipsey, 1993; Weiss, 1997). Many theory models define four stages of the project cycle: inputs, implementation, outputs, and outcomes or impacts. However, we consider it useful to add two additional stages to evaluate how the project was designed (for example: Was the project designed top-down or were participatory planning methods used? Was the project designed around a set of interventions that are expected to produce certain outcomes, or by identifying desired impact and then determining what the appropriate interventions should be?), and how effectively the stream of outcomes was sustained. The model should also identify contextual factors (the setting) that can affect implementation and outcomes. Contextual factors should include the economic, political, and organizational context as well as the socioeconomic characteristics of the affected population groups. Patton (2002) and Hentschel (1999) describe qualitative approaches for the analysis of many of these contextual factors. A key element of theory models is the identification and monitoring of the critical assumptions on which the choice of inputs, the selection of implementation processes, and the expected linkages between the different stages of the program cycle are based. Logical Framework Analysis (sometimes more generically referred to as Logic Models) is a widely used program theory approach that requires the critical assumptions to be identified and their validity assessed at each stage of project implementation. Identifying the Constraints to Be Addressed by the Shoestring Approach Step 1 also identifies the constraints facing the evaluation and determines which of steps 2, 3, and 4 (tools for addressing budget, time, and data constraints) will be required. STEPS 2 AND 3: ADDRESSING BUDGET AND TIME CONSTRAINTS This section describes five strategies for addressing typical budget and time constraints that evaluators face (see Table 3). While most of these strategies apply to both time and budget constraints, the following section discusses some additional strategies that can be used when adequate resources are available but where the evaluator is working under time constraints. TABLE 3
Simplifying the Evaluation Design In their review of meta-analyses in the fields of psychological, educational, and behavioral treatments, Lipsey and Wilson (1993) report on 74 meta-analyses where the same topic was addressed with both randomized experiments and other types of study designs that are usually simpler and less costly. The average effect sizes did not differ between the two types of study, suggesting that the randomized experiments and simpler non-experiments tend to produce similar causal conclusions in the fields examined. This provides qualified support to the careful use of shoestring evaluation designs as one way to reduce impact evaluation costs, while recognizing that this increases the risk of threats to the validity of the conclusions. When budget and time are not constraints, many impact evaluations would use one of the two robust designs described in Table 4. However, the shoestring evaluator must frequently select among the following five less robust designs in Table 4, which are less demanding in terms of time or budget. All of these less robust designs eliminate one or more of the pretest or posttest observations on the project or control group, and consequently are vulnerable to more of the threats to the validity of the evaluation conclusions (described in step 5 of the model). Step 4 discusses ways to strengthen some of these designs. Table 4 begins by describing the two most methodologically robust quasi-experimental designs (QED), which can be used when evaluators have adequate time and resources, and when there are no major problems of access to data. Table 4 also describes five alternative models that can be used when one or more of the shoestring constraints are factors. It should be noted, however, that even the two most robust designs are subject to a number of threats to validity (see Shadish, Cook, & Campbell, 2002, for an extended discussion). In model 2, which is probably the most widely used evaluation design when budget and time are not major constraints, observations are conducted on a randomly selected sample of project beneficiaries (P1) and a matched control group (C1) before the project (X) begins. The observations are repeated on both groups (P2 and C2) at the completion of the project. The impact of the project intervention X is estimated as the difference of means or proportions between the observed change in the project and control groups. This can either be measured by a test for difference of differences of means or proportions or by using multivariate analysis to control for attributes such as income, age, education, and family size. It is normally not possible to randomly assign subjects to the project and control groups, so some element of subjectivity will be involved in selecting the most comparable control group. Each of the five less robust models involves eliminating one or more of the pretest or posttest observations on the control or project groups:
TABLE 4
Note. P = project participants; C = control group; P1, P2, C1, C2, etc. = first, second (third and fourth) observations of the project or control groups in a particular evaluation design; X = project intervention (this is normally a process rather than a discrete event). Each of models 3 through 7 can produce significant time and cost savings. However, there is a price to pay, as all of these designs are less able than the two more robust designs to address threats to validity and are thus more likely to lead to wrong conclusions concerning the contribution of the project intervention to the observed outcomes. However, when their strengths and weaknesses are fully understood and addressed, these designs can provide an acceptable level of precision for many, if not most, management needs at a much reduced cost. Clarifying Client Information Needs The costs and time required for data collection can sometimes be significantly reduced through a clearer definition of the information required by the client and the kinds of decisions to which the evaluation will contribute (see step 1A). For example, if the evaluand is a pilot microcredit project targeting women farmers in a region with strong social controls on women’s ability to control productive resources, the client’s main interest may be to assess whether it is possible for women to apply for loans and to control how money is used. In this context it may be possible to convince the client that it is not necessary to invest time and resources in a control group in order to compare outcomes until the basic question, “Can the program be implemented as planned?” has been answered. As another example, clients will often say, “It would be interesting to compare the outcomes of the project for (different religious groups, recent urban migrants vs. those who have been living in the city for a long time, etc.).” Often, discussions with the client will reveal there is no particular reason to think all of these factors will be important for the project. Once it is understood that each additional level of stratification of the sample will imply a significant increase in sample size and cost, the client will often agree that some or all of these factors could be eliminated from the sample design, resulting in significant reductions in sample size, cost, and time. Reducing Sample Size In many cases, a clearer understanding of the kinds of decisions to be made by clients and the level of precision required for these decisions can result in a significant reduction in sample size. As an illustration let us assume that an evaluator is asked to estimate whether there has been a significant change in the proportion of the target population attending school after an experimental school meals program has been introduced. The change would be assessed by a comparison of two groups: either a before and after comparison of the project population or a comparison of the ex-post attendance rates for the project and a control group. If the project is only expected to produce a small change, the client may require that the sample be large enough to determine whether a change as small as, say, 5% is statistically significant. Using certain simplifying assumptions, it would be necessary, in this case, to interview a sample of 1,536 respondents. However, if a minimum difference of 10% is acceptable then the corresponding sample size could be reduced to 384 families, and for a 20% change only 96 interviews would be required. To simplify the computations, the above examples are based on the estimation of differences between proportions. The same general principles apply when estimating differences between means, but in this case the required sample size is harder to estimate in advance as it depends on knowing the standard deviations of the two populations (for a more detailed discussion of the determinants of sample size see Lipsey, 1990). It is important to avoid arbitrary reductions in sample size simply to save money and time; and the estimation of sample size must always be based on a full understanding of the client’s information needs and the required level of precision. It is also important to explain to the client that each additional level of disaggregation of the estimates (e.g., by different regions, sex of household head, the type of benefits received) will require a corresponding increase in the sample size. Reducing Costs of Data Collection and Analysis Considerable savings in costs and time can often be achieved through reducing the length and complexity of the survey instrument. A ruthless pruning of survey instruments to eliminate nonessential information can often significantly reduce the length of the survey. Areas in which the amount of information to be collected can often be reduced include: demographic information on each household member; sources of household income and expenditures; and behavior such as travel patterns, time use, and agricultural activities. It is again important to define information requirements with the client and not to arbitrarily eliminate information simply to produce a shorter survey instrument. Be clear on what indicators are likely to significantly contribute to the results being evaluated. Be ruthless in leaving out what would simply be interesting to know for a researcher, but unaffordable for a shoestring evaluation. Some additional ways to reduce the costs of data collection include the following:
Box 1 presents three case studies illustrating ways to cut the cost and time of data collection.
Integrating Quantitative and Qualitative Approaches While mixed method approaches are recommended for all evaluation designs, the integration of quantitative and qualitative data collection and analysis methods is particularly important for shoestring evaluators faced with budget and time constraints. The triangulation of several independent estimators can help validate information collected from smaller samples or when using the cost saving methods described above (Bamberger, 2000a, 2000b). Specific Ways to Reduce the Time Required to Collect and Analyze Data While most of the above methods can save both money and time, there are a number of additional ways to economize on the time required to collect and analyze data. Some of these methods may increase costs, so it is important to clarify with the client the relative importance of the budget and time constraints:
STEP 4: ADDRESSING DATA CONSTRAINTS The shoestring evaluator may be faced with at least four sets of problems resulting from a lack of critical evaluation data:
Reconstructing Baseline Data on the Project or Control Groups When the evaluation does not begin until midway through the project or even until the end of the project, the evaluator will frequently find that no reliable information is available on the conditions of project participants or control groups before the project interventions began. The following approaches can be used to reconstruct the baseline conditions. Though not as accurate as would be obtained from a good baseline study, the data may perhaps be sufficiently reliable for the purposes of a shoestring evaluation (see also examples in Box 2).
Using secondary data. Secondary data on previous years is often available on factors such as morbidity, access to health services, school attendance, farm prices, and travel time and mode from government agencies, central statistical bureaus, nongovernmental organizations (NGOs), and university researchers. While these sources can provide a useful (and often the only available) approximation to baseline conditions, it is essential to assess their strengths and weaknesses with respect to: differences in time periods (which are particularly important when economic conditions may have changed between the survey date and the project launch), differences in the population covered (e.g., did the surveys include employment in the informal as well as the formal sectors and were both women and men interviewed?), whether information was collected on key project variables and potential impacts, and whether or not the secondary data is statistically valid for the particular target population addressed by the project being evaluated. Records from other projects in the same area can often provide information on conditions before the current project began. For example, surveys are often conducted to estimate the number of children not attending school, sources and costs of water supply, or availability of microcredit. An assessment must be made of the reliability and utility of these data for the purpose of the evaluation. Using recall. Recall is a potentially valuable, although somewhat treacherous, way to estimate conditions prior to the start of the project and hence to reconstruct or strengthen baseline data. The limited available evidence suggests that while estimates from recall are frequently biased, the direction, and sometimes the magnitude, of the bias are often predictable so that usable estimates can often be obtained. Schwarz and Oyserman (2001) provide a useful review of cognitive and behavioral factors affecting recall and of ways to design data collection instruments to reduce some of the potential biases. Recall is therefore a potentially useful tool, particularly in the many situations where no other systematic baseline data is available. The utility of recall can often be enhanced if two or more independent estimates can be triangulated. While recall is generally unreliable for collecting precise numerical data such as income, incidences of diarrhea, or farm prices, it can be used to obtain information on major changes in the welfare conditions of households. For example, families can usually recall which children attended a school outside the community before the village school opened, how children traveled to school, and travel time and cost. Also families can often provide reliable information on access to health facilities, where they previously obtained water, how much they used, and how much it cost. On the other hand, families might be reluctant to admit that their children had not been attending school or that they had been using certain kinds of traditional medicine. They might also deliberately underestimate how much they had spent on water if they are trying to convince planners they are too poor to pay the water charges proposed in a new project. Two common sources of recall bias have been identified. First, the underestimation of small and routine expenditures increases as the recall period increases. Second, there is a telescoping of recall concerning major expenditures, such as the purchase of a cow, bicycle, or item of furniture, so that expenditures made outside of the recall period (e.g., the past 12 months) will often be reported as having been made within the reference period. While most of the research on recall bias has been carried out by U.S. studies such as the Expenditure Surveys, the general results are potentially relevant to developing countries. The Living Standards Measurement Survey (LSMS) program has conducted some assessments on the use of recall for estimating consumption in developing countries. The LSMS program was launched in the 1980s by the World Bank to develop standard survey methodologies and questionnaires for comparative analysis of poverty and welfare in developing countries (Grosh & Glewwe, 2002). For a review of the recall bias literature, see Deaton & Grosh (2000). The most systematic assessments of the reliability of recall data in developing countries probably come from demographic studies on the reliability of reported contraceptive usage and fertility. The existence of a number of large-scale comparative studies such as the World Fertility Survey means national surveys using comparable data collection methods are available for different points in time. For example, similar surveys were conducted in the Republic of Korea in 1971, 1974, and 1976, each of which obtained detailed information on current contraceptive usage and fertility, as well as detailed historical information based on recall for a number of specific points in the past. This permitted a comparison of recall in 1976 for contraceptive usage and fertility in 1974 and 1971 with exactly the same information collected from surveys in those two earlier years. It was found that recall produced a systematic underreporting, but that the underestimation could be significantly reduced through the careful design and administration of the surveys (Pebley, Goldman, & Choe, 1986). Similar findings are available from demographic analysis in other countries. The conclusion from these studies is that recall can be a useful estimating tool with predictable and to some extent controllable errors. Unfortunately, it is only possible to estimate the errors where large-scale comparative survey data is available, and there are few, if any, other fields with a similar wealth of comparative data. Interestingly, there are a number of studies suggesting that recall can provide better estimates of behavioral changes in areas such as primary prevention programs for child abuse, vocational guidance, and programs for delinquents than conventional pretest and posttest comparisons based on self-assessment (Pratt, McGuigan, & Katzev, 2000). This is due to the fact that before entering a program, subjects often overestimate their behavioral skills or knowledge through a lack of understanding of the nature of the tasks being studied and the required skills. After completing the program they may have a better understanding of these behaviors and may be able to provide a better assessment of their previous level of competency or knowledge and how much these have changed. The present authors are not aware of any “response shift” studies examining things like self-assessment of poverty, empowerment, or community organizational capacity in developing countries, but these are all areas where the response shift concept could potentially be applied to reconstruction of baseline data for shoestring evaluations. Working with key informants. Key informants such as community leaders, doctors, teachers, local government agencies, NGOs, and religious organizations may be able to provide useful reference data on baseline conditions. However, many of these sources have potential biases (such as health officials or NGOs wishing to exaggerate health or social problems, or community leaders downplaying community problems in the past by romanticizing conditions in the “good old days”). Caldwell (1985), reviewing lessons from the World Fertility Survey, uses some of these considerations to express reservations about the use of key informants for retrospective analysis in fertility surveys. Using participatory methods. Participatory methods such as many of the Participatory Rural Appraisal (PRA) tools can be used to help the community reconstruct past conditions and identify critical incidents in the history of the community or region (Rietberger-McCracken & Narayan, 1997). Reconstructing Control Groups There are additional difficulties in constructing control groups, as this entails identifying appropriately comparable control areas as well as measuring the conditions in these areas. With few exceptions, project areas are selected purposively to target the poorest areas or those with the greatest development potential rather than randomly, so it can be a challenge to identify control locations that are reasonably similar to the project areas. One of the cases in which randomization is used in the selection process occurs when demand significantly exceeds supply and some kind of lottery or random selection is used. This sometimes occurs with social funds (Baker, 2000) or with community supported schools (Kim, Alderman, & Orazem, 1999). It will frequently be necessary to complement the limited quantitative data with judgment when deciding what is a good or acceptable control group. When the statistical data is available, cluster analysis can provide a powerful tool for selecting a comparison group that can be matched on the variables of most interest to the project (Weitzman, Silver, & Dillman, 2002). It cannot be assumed the control group is “pure.” Rarely, if ever, in society are all factors equal between a project group and a control group (or comparison community), other than the project intervention itself. It is important to look for and document interventions by other organizations in the control community. The analysis at the time of the evaluation should then try to determine the relative influence of changes brought about by the project’s interventions compared to different internal and external influences in the comparison group. It is sometimes possible to construct an internal control group within the project area. Households or individuals who did not participate in the project or who did not receive a particular service or benefit can be treated as the control for the project in general or for a particular service (for example, subjects may be categorized according to such factors as their distance from a road or water source, whether any family member attended literacy classes, or the amount of food aid they received). When projects are implemented in phases, it is also possible to use households selected for the second or subsequent phases as the control group for the analysis of the impacts of the previous phase. For example, the economic status of a new cohort of women about to receive their microfinance loans might serve as a control group to compare with those who received loans during the past year. Selection bias. Throughout the discussion of control groups it is important to constantly check for potential selection bias. With respect to internal control groups, the families in a project area who did not participate in the project are likely to be different in potentially important ways from those who did participate. In some cases nonparticipants may have been excluded or discouraged on the basis of their political affiliation (or lack thereof), sex, ethnicity, or religion. In other cases they may not have had the motivation or self-confidence to apply or get selected. Similar factors may explain why some communities were not selected. The following section discusses some of the statistical procedures that can be used to at least partially address the selection bias issue and to improve the comparability of the control group. How effective the statistical controls are will depend on the adequacy of the control model and the reliability of the measurement of the control variables (Shadish et al., 2002). Problems in Working with Nonequivalent Control Group Data from Surveys Evaluations frequently compare the project population with nonequivalent control areas selected to match the project population as closely as possible. When subjects were not randomly assigned to the project and control groups, it is possible to strengthen the analytical value of available control groups by statistically matching subjects from the project and control areas on a number of relevant characteristics such as education, income, and family size. The evaluations of Ecuador’s cut flower export industry and the Bangladesh microcredit programs are examples of this approach (see Box 3). If differences in the dependent variables (the number of hours men and women spend on household tasks, men’s and women’s savings and expenditure on household consumption goods, etc.) are still statistically significant after controlling for these household characteristics, this provides preliminary indications that the differences in the dependent variables may be due, at least in part, to the interventions of the project. While this type of multivariate analysis is a powerful analytical tool, one important weakness is that the evaluation design does not provide any information on the initial conditions or attributes of the two groups prior to the project intervention. For example, the higher savings rates of women in the communities receiving microcredit in Bangladesh might be due to the fact that they had previously received training in financial management or that they already had small business experience. These nonequivalent control group designs can be strengthened if they incorporate some of the methods discussed above for reconstructing baseline data.
Collecting Data on Sensitive Topics or from Groups Who Are Difficult to Reach A third set of problems, not unique to shoestring evaluations, concern the collection of data on sensitive topics such as domestic violence, contraceptive usage, or teenage violence; or from difficult to reach groups such as sex workers, drug users, ethnic minorities, the homeless, or in some cultures, women (Bamberger, Blackden, Fort, & Manoukian, 2001). These situations require the use of appropriate qualitative methods such as participant observation, focus groups, and key informants. These issues are particularly important for the shoestring evaluator as budget and time constraints may create pressures to ignore these sensitive topics or difficult to reach groups. Box 4 presents three case studies on the collection of data on sensitive topics.
STEP 5: IDENTIFYING THREATS TO THE VALIDITY AND ADEQUACY OF THE EVALUATION DESIGN AND CONCLUSIONS In their efforts to reduce time and costs and to overcome data limitations, evaluators have frequently ignored some of the basic principles of evaluation design, such as random sampling, specification of the evaluation model, instrument development, and full documentation of the data collection and analysis process. As a consequence, many shoestring evaluations suffer from serious methodological weaknesses that threaten the validity or generalizability of evaluation findings. The analysis of threats to validity of conclusions from quasi-experimental designs is familiar to quantitative evaluators at least since Cook and Campbell’s 1978 publication. However, there is a continuing debate concerning the extent to which similar criteria can, and even should, be applied to qualitative evaluations. The challenge for the shoestring approach is to develop guidelines for assessing the validity and adequacy of multi-method evaluation designs. Shadish et al. (2002) have updated Cook and Campbell’s (1978) four categories of threats to conclusion validity, namely:
Other writers such as Miles and Huberman (1994) and Guba and Lincoln (1989) have proposed additional criteria of reliability and objectivity. Table 5 presents a checklist based on Shadish, Cook, and Campbell’s four categories of threats to conclusion validity that can be used to assess potential weaknesses in all of the seven shoestring designs presented in Table 4. Additional subcategories pertinent to shoestring evaluations have been added to the checklist by the present authors. TABLE 5
Note. Items in italics were added by present authors. While this kind of checklist is often used for assessing the validity of quantitative evaluations (Cook & Campbell, 1978; Shadish et al., 2002), there is a continuing debate on the appropriate criteria for judging the adequacy or quality of conclusions drawn from qualitative evaluations. Schwandt (1990) argues that it is not possible to specify criteria for assessing qualitative research, while Patton (2002) proposes five different sets of criteria for judging the quality and credibility of different types of qualitative enquiry. However, other writers believe it is possible to establish uniform criteria for assessing qualitative evaluations. Guba and Lincoln (1989) proposed the use of four sets of “parallel” or “foundational” criteria for judging goodness or quality of qualitative evaluations, which parallel the post-positivist criteria (see Table 6): TABLE 6
Although Guba and Lincoln were not completely comfortable with the use of their parallel criteria, considering them primarily methodological criteria and preferring to use “authenticity criteria” like fairness, ontological authenticity, educative authenticity, catalytic authenticity, and tactical authenticity, other authors such as Miles and Huberman (1994) and Yin (2003) have proposed the use of these parallel criteria as a way to move towards comparable criteria for assessing the validity and adequacy of quantitative and qualitative evaluation designs. Table 7 presents a first attempt to develop an integrated checklist for assessing the validity and adequacy of multi-method shoestring evaluation designs. It uses the four sets of parallel criteria proposed by Guba and Lincoln plus a fifth category, utilization, proposed by Miles and Huberman. For evaluations that include a quasi-experimental design component, there is a cross-reference to the threats to internal, statistical, construct, and external conclusion validity given in Table 6. The shoestring evaluator can also find additional guidance from sources such as the American Evaluation Association’s “Guiding Principles for Evaluators,” which includes 23 points for assessing the quality of evaluations in terms of (1) systematic enquiry, (2) competence, (3) integrity/honesty, (4) respect for people, and (5) responsibilities for general and public welfare (Shadish, Newman, Scheirer, & Wye, 1995). TABLE 7
STEP 6: ADDRESSING AND REDUCING THREATS TO VALIDITY AND ADEQUACY OF THE EVALUATION DESIGN A key element of the shoestring approach is that it recommends practical measures to correct or reduce threats to validity and adequacy once they have been identified. The following are examples of approaches for addressing problems identified in each of the four sets of threats to validity of quantitative evaluation designs presented in Table 5.
The following are examples of ways to address threats to validity and adequacy of evaluation designs in each of the five categories of the integrated checklist given in Table 7:
SUMMARY: SHOESTRING EVALUATION IN A NUTSHELL It is unfortunate but probably true to say that planning for most international development impact evaluations does not begin until a project or program is well underway, and most of the evaluations must be conducted under budget and time constraints, often with limited access to baseline data and control groups. Despite these constraints, there is a growing demand for systematic assessments of the impacts of development projects and their potential replicability. Fortunately, most policy makers, managers, and funding agencies need answers to relatively straightforward questions, which do not require great methodological sophistication. Typical questions include: (1) Is the project achieving its basic objectives? (2) Who has benefited and who has not? (3) Is the project sustainable? and (4) Is this approach replicable? The straightforward nature of these questions makes it possible to provide reasonably robust and useful answers, even within the typical constraints under which evaluators operate. The pressures of working under budget and time constraints have often resulted in a lack of attention to sound research design, with limited attention given to identifying and addressing factors affecting the validity of the findings. The shoestring evaluation approach is being developed to respond to the demand for ways to work within budget, time, and data constraints while at the same time ensuring maximum possible methodological rigor within the given evaluation context. The shoestring evaluation guidelines discussed in this paper are summarized below. While many of these principles can be applied to all impact evaluations, each of them has a specific application to scenarios where the evaluator is subject to budget, time, or data constraints. For example, while all evaluations must understand the client’s information needs (step 1A of the guidelines), a careful prioritization of these needs can help define areas in which time and costs can be reduced, and other areas where this is not possible without compromising the goals of the evaluation. Similarly, the checklists of threats to validity and adequacy (Step 5) should be applied to all evaluations, but they have particular relevance when assessing the implications of using designs that lack critical baseline or control group data. Step 1. Planning and Scoping the Evaluation
Step 2. Addressing Budget Constraints
Step 3. Addressing Time Constraints (in Addition to the Methods Discussed in Step 2)
Step 4. Addressing Data Constraints
Step 5. Identifying Threats to the Validity and Adequacy of the Evaluation Design and Conclusions
Step 6. Addressing Identified Weaknesses and Strengthening the Evaluation Design and Analysis
In conclusion, despite the fact that major challenges arise when evaluators must work under serious budget, time, and data constraints, evaluators are constantly asked to address important operational and policy questions under these constraints. While the required methodological adjustments inevitably increase the range and seriousness of threats to the validity of the evaluation conclusions, it is hoped that the shoestring approach described in this paper can help support efforts to produce adequately robust and useful evaluation findings when working with real world constraints. REFERENCES Baker, J. (2000). Evaluating the impacts of development projects on poverty: A handbook for practitioners. Washington, DC: World Bank. Bamberger, M. (Ed.). (2000a). Integrating quantitative and qualitative research in development projects. Washington, DC: World Bank. Bamberger, M. (2000b). The evaluation of international development programs: A view from the front. American Journal of Evaluation, 21, 95–102. Bamberger, M., Blackden, M., Fort, L., & Manoukian, V. (2001). Gender. In J. Klugman (Ed.), A sourcebook for poverty reduction strategies (Vol. 1, pp. 333–376). Washington, DC: World Bank. Retrieved October 22, 2004, from www.worldbank.org/poverty Caldwell, J. (1985). Strengths and limitations of the survey method approach for measuring and understanding fertility change: Alternative possibilities. In J. Cleland & J. Hobcraft (Eds.), Reproductive change in developing countries: Insights from the World Fertility Survey (pp. 45–63). Oxford, United Kingdom: Oxford University Press. Cook, T., & Campbell, D. (1978). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally. Dayal, R., van Wijk, C., & Mukherjee, N. (2000). Methodology for participatory assessments with communities, institutions and policy makers: Linking sustainability with demand, gender and poverty. Washington, DC: World Bank. Deaton, A., & Grosh, M. (2000). Consumption. In M. Grosh & P. Glewwe (Eds.), Designing household survey questionnaires for developing countries: Lessons from 15 years of the Living Standards Measurement Study (pp. 91–134). Washington, DC: World Bank. Dimitrov, T. (in press). Enhancing the performance of a major environmental project through a focused mid-term evaluation: The Kombinat za Cvetni Metali environmental improvement project in Bulgaria. In Influential evaluations: Detailed case studies. Washington, DC: World Bank. Gomez, L. (2000). Gender Analysis of Two Components of the World Bank Transport Projects in Lima, Peru: Bikepaths and Busways. World Bank internal report. Grosh, M., & Glewwe, P. (Eds.). (2002). Designing household survey questionnaires for developing countries: Lessons from 15 years of the Living Standards Measurement Study. Washington, DC: World Bank. Guba, E., & Lincoln, Y. (1989). Fourth generation evaluation. Thousand Oaks, CA: Sage Publications. Hashemi, S., Schuler, S. R., & Riley, A. P. (1996). Rural credit programs and women’s empowerment in Bangladesh. World Development, 24, 635–653. Hentschel, J. (1999). Contextuality and data collection methods: A framework and application to health service utilization. The Journal of Development Studies, 35(4), 64–94. The impact of social funds in Eritrea. (n.d.). Unpublished report. Justice, J. (1986). Policies, plans and people: Culture and health development In Nepal. Berkeley: University of California Press. Kim, J., Alderman, H., & Orazem, P. (1999). Can private school subsidies increase schooling for the poor? The Quetta Urban Fellowship Program. World Bank Economic Review, 13, 443–466. Khandker, S. (1998). Fighting poverty with microcredit: Experience in Bangladesh. Oxford, United Kingdom: Oxford University Press. Kumar, K. (Ed.). (1993). Rapid appraisal methods. Washington, DC: World Bank. Leeuw, F. (2003). Reconstructing program theories: methods available and problems to be solved. American Journal of Evaluation, 24, 5–20. Lipsey, M. W. (1990). Design sensitivity: Statistical power for experimental research. Newbury Park, CA: Sage Publications.
Reprinted with permission from “Shoestring Evaluation: Designing Impact Evaluations under Budget, Time and Data Constraints,” by M. Bamberger, J. Rugh, M. Church, and L. Fort, 2004, The American Journal of Evaluation, vol. 25, no. 1. Copyright 2004 by Elsevier.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by
Pacific Resources for Education and Learning All rights reserved, including the right of reproduction in whole or in part in any form. ISBN 9742816-1-1 This product was supported in part by awards from the
U.S. Department of Education (U.S. ED) and other |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||