4.1 Data Cleaning Overview
4.1.1 Analysis Development
All data cleaning should be done in a single notebook that you clarify and expand over time. Use dplyr
and janitor
for data cleaning in short, well-organized pipelines. Your final data set should be stored in a .csv
file in the data/
folder along with the original raw data.
4.1.2 General Approach
The amount of data cleaning needed will vary significantly based on the data set you are using and the measures you have selected. Thus there are no “one size fits all” instructions for cleaning your data. In general, you will want to focus on making sure a number of criteria are met:
- Variables should have clear, intuitive names.
- All missing data have been recoded to
NA
values (if they were not already coded that way) - you will need to refer to your data set’s code book to determine how missing data are handled and in which variables they may be a concern. - Categorical variables that have numerous categories should be checked carefully for categories that have only a few observations, and those should (generally speaking) be folded into an
other
category. - All categorical variables should be stored in two ways: as a single categorical variable and as a series of “dummy” logical variables. Imagine a categorical measure for race that has three categories:
(1) white
,(2) black
, and(3) asian
. Your final cleaned data should contain this variable as well as a variable namedwhite
that isTRUE
if the responded is white andFALSE
if they are not. Your data should have similar logical variables forblack
andasian
as well. These measures become relevant when we start calculating difference of means and regression lines. - If your data set does not include identification numbers for each row, you will need to create them to facilitate assumption checks to your regression models.