Cleaning up the data is a key step in the data analysis process. In almost all the job interviews I’ve landed, data quality has been mentioned as an important component of the position. A mistake in the data set can certainly ruin a research project, and the data we get isn’t always in the ideal format. Here are a few tips I’ve gathered while learning about cleaning up data in Excel. These are especially relevant for text data, but can be helpful for numerical data as well.

  • Before you clean your data, be sure to save a copy so you can easily reverse any unintended changes.
  • Be careful when importing data. For instance, setting the column data format to “general” will delete any leading zeros. If the zeros in front are important, such as in a zip code, make sure the data is under the “text” format instead.
  • Sometimes text is a garbled mess, difficult to read and in the wrong location. Information may not look the same across cells, with one column with text to the right and another with a number to the left. In these cases, you can use the format painter to ensure all your cells have the correct format. For example, format text to be centered and the appropriate size and font, and then easily apply that format to all your cells.
  • Random white space and invalid characters can be a problem. The function TRIM removes extra spaces, while CLEAN removes all non-printable characters from text. PROPER will make the first letter in each word uppercase. UPPER or LOWER are also options, which will make all the letters upper or lower case, respectively. Combining these, you can use the function =PROPER(CLEAN(TRIM(cell reference))) to fix up the text in a cell. Be sure to copy and special paste your new column as values (rather than copying the function).
  • If you have multiple items in one cell, such as a list, you can separate this out into multiple columns using text to columns. Alternatively, if you need to combine items in separate cells, you can utilize CONCATENATE. For example,  =CONCATENATE(A1&” “,A2&” “) will combine two different cells, adding spaces between the words.
  • Delete what you don’t need, such as blank rows or columns. Consider what you’ll do about missing data. You can easily highlight missing data cells with conditional formatting, setting a style where the cell value is equal to ISBLANK().
  • To prevent future data entry mistakes and invalid data, you can employ data validation to restrict the type of data that can be entered.

Further Reading:

School of Data: Using a spreadsheet to clean up a dataset

Knight Digital Media Center: Intro to Cleaning Data

Breaking Into Wall Street: Cleaning Up Data (VIDEO, ~14 min)

Tips for Cleaning Up Data in Excel

Leave a Reply

Your email address will not be published.