OUR lesson today will provide an overview of data management and analysis. Data management and data analysis is a multi-step process. Here we have laid out six key steps in the data management and a data analysis process. First, we start with data cleaning. Next we move to merging final datasets, then calculating key variables. Our fourth step is to archive the final dataset before moving on to step 5, which is indicator generation. Then finally we have step 6, data analysis. Today we are going to give an overview of the six steps in the data management and analysis process. As we talk about data management and analysis, it is important to remember each household survey is unique. As such, the number of datasets and the variables in each dataset may vary depending on the survey objectives and the survey design. While data management and analysis general principles are applicable to all household surveys, the technical process to carry out data management and analysis will need to be customized for each individual household survey. The survey questionnaires, modules, and variables that we will mention when discussing data management and analysis, are typical in household surveys, but there may be differences from one survey to another. Let's talk about the first step, data cleaning. Using a statistical software package of your choice, the following steps should be done during data cleaning. Cleaning variable names, cleaning any operational errors, and operational errors include for example, dropping empty observations, dropping test observations, checking for and resolving duplicate household IDs, making corrections to administrative and sub-administrative area information, making corrections to household numbers, and dropping households that should not have been surveyed. Labeling variables and value sets, calculating dates as century, month or day codes, checking variable formats, meaning string versus numeric and reformatting where needed. Recoding variables as needed, and this could be multi response variables or recoding variables as binary variables and 001, and dropping administrative and photo variables that are not needed in the dataset. In addition, data cleaning includes the following additional data cleaning steps that should be done using a statistical software package of your choice. Checking for survey completeness, identifying and removing any duplicate records, identifying incomplete records, checking that skip patterns have been consistently followed, checking that responses are within a range based on the questionnaire, and then performing consistency checks. Consistency checks, for example, include things like checking the GPS coordinates are within the range of the region and district of a household, for example. Once the data cleaning process is complete, we can move to step 2, which is calculating key variables. In determining what are the key variables you may need to calculate, it's helpful to go back to the analysis plan and to look at which variables were proposed as stratifiers for the analysis. For example, things like wealth, education, age, marital status, literacy, and ethnicity are all potential stratifiers for your analysis. Then you want to determine if the variables and the dataset can be used as is, or if they need some transformation to create the stratifiers needed for your analysis. You can create new variables as needed. For most household surveys, you will need to calculate wealth quintiles to use the stratifiers, and you'll need to collapse levels of education in age groups to create the stratifiers needed for your analysis. You will also need to create variables related to the survey design, including your samplings strata, and your survey weights. Let's talk a little bit more in detail about survey weights. You may need to calculate multiple survey weights depending on the survey design and objective. For example, household, woman, man, child, or anthropometry weights. The information needed to calculate survey weights includes cluster and strata variables, the number of households in each cluster, and the proportion of completed interviews for households and individuals. For example, women or child questionnaires. The main steps to calculate the household weight variable include the following. First, create the strata variable if applicable, then calculate the probability to sample the primary sampling unit or the PSU by stratum. For example, cluster. Next, calculate the design weight, which is the probability to sample the households within the PSU. Then calculate the household response variable, or the proportion of household questionnaires completed. Finally, calculate the household weight variable using the probability to sample the PSU, the design weight, and the household response. The main steps to calculate an individual weight variable, such as for the women's questionnaire, include the following. First, calculate the women's response variable, which is the proportion of women questionnaires completed. Then calculate the women weight variable using the household weight variable and the women's response variable. For the RADAR questionnaire, you would need to calculate a household weight as well as a woman, man, and child weight if all of those modules were implemented. Once the key variables have been calculated, we can move to Step 3, which is merging the final dataset. Household survey data is generally collected as a number of separate data files that need to be merged. For example, the RADAR household coverage survey data is exported from ODK as eight separate data files. All data files can be linked via a unique household and respondent ID variables. For the RADAR survey, ODK auto-generates the unique link variable. However, for other surveys where this is not auto-generated, this can be manually created by combining the household and respondent identifiers. Why do we need to merge datasets? Well, household data is captured across multiple datasets, and merging these datasets makes data management much easier. For example, in the RADAR questionnaire, we have household information in household dataset, the household characteristics dataset, which includes, for example, assets and water and sanitation information. We have the household mosquito nets dataset and the household members under mosquito nets dataset. In addition, the household dataset contains key variables which are required to set up your survey design for analysis across all datasets, and this includes variables such as the cluster, strata, and survey weights. Finally, the household dataset contains additional key variables of interests that are required for disaggregated analysis across all datasets, such as the wealth index. What do we need to merge? We recommend that you merge together all of the household level data sets. For the RADAR survey that includes household, household characteristics, household mosquito nets, household members under mosquito nets, and this will become the merged household dataset. We also recommend that you merge key variables from this merged household dataset into all other datasets. For the RADAR questionnaire, this would mean merging key variables into the household members, the woman, the man, and the child datasets. The key variables to merge include household information, such as: administrative area, sub-administrative area, village, and household ID, the date of the household interview, the wealth index or wealth quintile variable, and variables related to the sample design such as cluster, stratification, and weights. Just a little bit about incomplete interviews. It is important to assess response rates and the proportion of completed questionnaires, both household and individual, and to use this information on the number of complete and incomplete questionnaires to generate the weight variables. Once the weight variables are generated, all incomplete interviews can be removed from the dataset. You do want to make sure to save the raw dataset with the incomplete interviews in case you need to return to this information in the future, but it's important to remember that data analysis should be carried out only for completed questionnaires. When all data cleaning and finalization steps are complete, the dataset should be ready for archiving, distribution, and analysis. Providing the final dataset and the questionnaire should provide all of the information an analyst needs to work with the data. You should now be ready to move on to Step 4, archiving the dataset. The final dataset should be properly archived for future use. The first step in the archiving process is to remove identifiers from the dataset, and this can include: names, dates of birth, and displacing or removing GPS coordinates depending on institutional policies. Next, create a ReadMe file that explains the contents of the dataset, and then you'll likely want to export the final dataset to various file formats, for example, Excel, R, Stata, et cetera. Finally, the last step in the data archiving process is saving the dataset, the ReadMe file, and the questionnaire that was used to collect the data together in a single place. You may want to consider saving the dataset in a publicly accessible repository, but at a minimum, you want the data saved in a shared folder. It shouldn't be saved on a single computer. Now that the data has been archived, we can begin Step 5, indicator generation. To begin the indicator generation process, first, use the radar indicators sheets to identify indicators of interest. The radar indicators sheets will provide information corresponding to indicate a numerator and denominator. Next, using a statistical software of your choice, generate indicators using a four-step process. The first step will be to look at the data. The second step will be to generate indicators accounting for numerator and denominator. The third step will be to examine your indicator and the don't know and non-response values. You'll want to ask yourself, is the don't know and missing due to non-response more than five percent? If yes, don't use this indicator in your analysis. If no, you can continue to use this indicator in your analysis and you'll want to set the don't know responses to missing to exclude them from the analysis. Finally, the fourth step is to finalize the indicator accounting for your don't know and non-response values, and just to note here, there are alternative options for handling don't know responses. You could also impute a value or assign the modal value, et cetera. For the radar questionnaire, the approach I have presented here is the approach we have taken and what we do recommend. Let's walk through an example of indicator generation using the indicator ANC 4+ or antenatal care four or more visits. Here we have the definition of ANC 4+ from the radar indicators sheets, which gives us information on the numerator, denominator, and the questions required to calculate ANC 4. The numerator is the number of women, 15-49 years of age with a live birth in the two years prior to the survey, attended at least four times during pregnancy by any provider for reasons related to the pregnancy, and the questions required for calculating the numerator are, questions CB1, while you were pregnant with, name of the child, did you see anyone for antenatal care? As well as question CB3, how many times did you receive antenatal care when you were pregnant with, name of the child? The denominator is the total number of women aged 15-49 years surveyed with a live birth in the two years prior to the survey, and the question required for calculating the denominator is question FE3. What is the month and year of your most recent birth? Keeping in mind the definition of antenatal care four-plus visits and the questions required to calculate that, we're going to start looking at Step 1, look at the data to look at the antenatal care variables in the radar questionnaire. First on the left, we have here a tabulation of the question CB1, which is, while you were pregnant did you see anyone for antenatal care? We can see that 1,060, out of 1,102 women sought antenatal care in the last pregnancy, and on the right we have a tabulation of CB3, which gives us the distribution of the number of ANC visits that women received. If we look at the distribution of visits, we can see that there are three don't know responses and we're going to come back to how do we handle these don't know responses in just a few minutes. Now for Step 2, we will generate the ANC 4 indicator accounting for numerator and denominator, and this example here, we're going to be using Stata to generate our indicator. I'm going to walk through the code that we've created to do this so you can see how we generate the indicator accounting for both numerator and denominator. In this example code here, first we're going to create a variable called recent birth to convert the year and month of the most recent birth into a single date variable in CMC format. Then we're going to generate a variable called interview date and set it equal to the variable HHdoi, which is the date the interview was conducted, which happens to also RDB in CMC format. Next, we're going to create a variable called under2, to identify births in the last two years. We'll set the variable under2 equal to 1 if the interview date minus the recent birth divided by 12 is less than or equal to 2. A quick reminder, our denominator is women 15-49 with a live birth in the last two years. This age range, 15-49 years was eligibility criteria for the questionnaire. All of the women who responded to the questionnaire meet that criteria for the radar questionnaire. But not all women will have had a live birth in the past two years. We need to make sure we account for that in our indicator calculation, which is why we've gone through these steps this far. Now we can move on to generating the ANC4 indicator. The next line of code says cap drop anc4. This will drop any variable in the data set called anc4 if it already exists. Next, we'll generate a variable called anc4 and set it equal to 0 if under2 equals 1. This ensures that only births in the last two years are given a value for the anc4 variable. Next, we'll replace anc4 and set it equal to 1 if anc4 equals 0, meaning that the birth happened in the last two years and CB1 equals 1, meaning yes, the woman saw someone for ANC. CB3 is greater than equal to 4 and not equal to 8 or missing, meaning the woman received at least four ANC visits during the pregnancy. Next, we replace anc4 and set it equal to 8 if anc4 equals 0, meaning the birth happened in the last two years and CB1 equals 8, meaning the woman did not know if she received ANC or if CB1 equals 1 and CB3 equals 8 or missing, meaning that the woman received ANC, but she did not know the number of visits. Finally, we will label the value and label the value set. Now that the indicator has been generated, we can tabulate our anc4 indicator with him without the missing values and examine the percentage that don't know and missing values account for. Remember if don't know and missing due to non-response is less than five percent, we can set that don't know and missing responses to missing and exclude them from the analysis. The last line of code you see here replaces the don't know responses with missing under the assumption that these account for less than five percent of the responses. Now let's move to Step 3, examining our indicator and the don't know and non-response values. We're going to take a look at this anc4 indicator and we'll examine our don't know and the non-response values. The first tabulation on the upper left is a tabulation of our new anc4 variable. It shows us that our anc4 variable is being calculated among 1,102 women, which aligns with what we learned from our earlier exploration of the data. We can explore that further in the second tabulation on the bottom left, by tabulating ANC4 and the under2 variable recreated to see that there are 1,102 women in the data set with a child under2. The remaining 1,427 women in the data set have not had a child in the last two years, and thus, we are excluding them from this indicator calculation. Finally, we can investigate further the don't know responses using the tabulation on the right, which tabulates the responses to questions CB3, the number of ANC visits, and our new anc4 indicator. We can see here there are three women who did not know how many ANC visits they received. We've created a don't know category for them in our ANC variable. We now need to assess whether these don't know responses account for more or less than five percent of our sample of respondents. Looking back at our first tabulation on the upper left, we can see that the don't know responses account for 0.27 percent of respondents. We can see this is less than our five percent threshold. Just to note that everything that we've looked at, that this slide is all unweighted tabulations. We will address weights a little bit later in the lecture. Now we move to Step 4, finalizing the indicator accounting for the don't know and non-response. Based on the information that the don't know and non-response is less than five percent, we can continue to include this indicator in our analysis and set the don't knows to missing. You can see in the tabulation here, we now have 1,099 women in our ANC 4 indicator instead of the 1,102, now that these three women who responded don't know, so the number of ANC visits, have been set to missing. Our indicator construction is now complete, and we can move forward with our analysis. Again, just to note, this is an unweighted tabulation just to explore the indicator construction. We will address weights a little bit further on in the lecture when we get to analysis. Now that we have generated indicators, we can move to the last step, which is data analysis. The first step in the data analysis process is generally to develop a tabulation plan. The data tabulation plan provides model tables that set forth the major findings of a survey in a manner that will be useful to policymakers and program managers. It also helps provide guidance concerning the most important indicators that should be presented in the survey report, as well as the level of analysis expected, and it ensures timely dissemination of survey results. Both the DHS and MICS have example tabulation plans for reference if you would like to see what a detailed tabulation plan looks like. The example table provided here is one table from the DHS tabulation plan as an example of the level of detail specified in a tabulation plan. This example tabulation plan table for ANC lays out how the ANC results will be presented. We can see here that the results will be presented by age at birth, birth order, residence, region, education, and wealth quintile. In addition, the results will be presented by the type of ANC provider, as well as the proportion of women who received no ANC, and ANC by a skilled provider. The data analysis for your survey should closely follow a tabulation plan to create results, tables, and figures. The analysis should include tabulations of indicators, cross tabulations by key strata such as geographic area, education, wealth, etc, and data visualizations to display key findings. More in-depth analyses to assess associations should only be considered after a review of the initial findings. It's important to remember that all analyses must account for the complex survey design. Recall the service design for a household survey is generally multi-stage cluster sampling. You may have stratified your sample by urban/rural and/or administrative area, you sampled clusters within the strata, and you sampled households within the clusters. For the woman and child questionnaires, all eligible individuals in sampled households were interviewed. For the man questionnaire, the design can be to interview all eligible individuals in a certain percentage of sampled households, typically 50 percent, depending on the indicators of interest. A quick note on the man's questionnaire. The man's questionnaire sampling depends on the survey objectives and variables. Most of the time, man's variables do not require big sample size, leading to a survey of a proportion of households, which is typically around 50 percent. However, there are some surveys for which the man's questionnaire applies to all of the households. Depending on how you stratified and how you sampled clusters, different households may have had different probabilities of being sampled. As a result, data analysis must account for the survey design to produce accurate estimates. This includes accounting for survey weights, stratification, and clustering. Why do we need survey weights? Weighting your analysis allows estimates or indicators to be representative of your study area. An analysis must be weighted when sampling units do not have the same probability of selection. Let's look at an example from the Nigeria 2013 DHS data, which compares unweighted and weighted estimates. On the left, we have an unweighted tabulation showing the proportion of children with diarrhea receiving ORS, and it's 32.95, whereas on the right, we have a weighted tabulation showing the proportion of children with diarrhea receiving ORS is 33.71. We can see here that applying the survey weights results in a different point estimate than looking at an unweighted tabulation. In this example, it may only be by one percent, but in other cases this difference can be quite large. Why do we need stratification? Stratification during sampling must be taken into account when calculating variance and confidence intervals around estimates. Stratification reduces variance, and this should be accounted for in the analysis phase. Let's look at an example from the Nigeria 2013 DHS, which compares standard errors with and without the strata specified. On the left, we have a tabulation of the proportion of children with diarrhea receiving ORS without the strata specified and the standard error is 0.151, whereas on the right we have a tabulation of the proportion of children with diarrhea receiving ORS with the strata specified and the standard error is 0.136. As we can see in this example, standard errors without strata specified are larger than standard errors with strata specified. If you are conducting your analysis using the statistical software STATA, STATA implements a set of commands that will take into account the survey designed for you. These are the SVY commands. To set up the SVY commands on STATA, you must first survey set your data using this command; svyset and then your psu variable [pweight equals and then the name of your sample weight variable], stratum (the name of your stratum variable) and then single units(centered). In this command the PSU is the primary sampling unit, and so this is generally the cluster. The pweight is the survey weight which has been calculated based on the sample, and the stratum is the variable used for stratification in the survey design. Let's look at an example using the SVY commands using the same Nigeria 2013 DHS data from earlier in the lecture. Our first step is to survey set the data, which we can see here in the upper-right. Now we can use either SVY tab on the left or SVY prop on the right to get a point estimate of the proportion of children with diarrhea receiving ORS. The difference between SVY tab and SVY prop is that SVY prop will give you the point estimate along with the standard error and a confidence interval, whereas SVY tab will just give you the point estimate. You can also use the SVY commands to cross tabulate variables. In this example we have a cross tabulation of ORS by region. The command for this is SVY tab or prop and the two variables you want to cross tabulate. Again, you can use the tab command or the prop command, and with the prop command you get a standard error and a confidence interval in addition to the point estimate. In this example, we now have the proportion of children with diarrhea receiving ORS by region, and we can see this varies across region from a low of 28 percent to a high of 44 percent. Finally, we can use the data from this SVY prop cross tabulation to create a figure that helps interpret the findings. In this figure the height of each bar is our point estimate for the proportion of children with diarrhea receiving ORS in each region, and we have also added error bars to represent the competence intervals. We ordered the regions in descending order so that it's easier to see which regions have the highest coverage of ORS and which have the lowest. If you're implementing a RADAR survey, there are some resources available to you to help facilitate the data management and analysis process. Data management and analysis tools for the radar survey are available as a series of Stata.do files. We use Stata because it's relatively straightforward to use, and there is an easier learning curve relative to other software such as R. Provided that the variable names don't change, these.do files can run on outputs from the RADAR ODK questionnaire. In addition to the Stata.do files, there's also accompanying documentation that specifies where country-specific adaptations must be made for the files to run. Some data management processes can be automated, while others are survey-specific and must be adjusted for each survey implementation. All of these resources are available on the Radar coverage survey web page.