For the second part of my cleaning data series, I’ll go over some simple techniques used to clean data in Python. Python is a powerful tool that constantly amazes me, both in its ability and simplicity. Reviewing data with it can go a long way once you know it.
First, how can Python help you look at your data? Some simple functions you can use on your data set to diagnose the data are:
- .head() — which will let you take a look at the first 5 lines of a table to get a view of how the table is laid out.
- .info() — will give you a concise summary of the data frame.
- .describe() — is used to generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.
- .columns — will give you a breakdown of the columns in the table.
- .value_counts() — will numerically count the values in a row. This is great when you want to know how many times something comes up.
- .dtypes — to see what kind of property make up a column. Be it a string, integer, float, or object.
Python also has the ability to open multiple files at once using the glob module, and glob function. Once a glob has opened the files, I can loop through the variable and append it to a list, and from there use the panda’s - .concat() function to combine the list into one large table.
What about those times when we need to reevaluate our table, and the way it looks? Change some of the columns into variables. That’s when panda’s .melt() function enters stage right.
new_table = pd.melt(frame= “old_table”, id_vars=[‘columns’, ‘to’, ‘stay’, ‘the’, ‘same’], value_vars=[‘columns’, to_change’], value_name= “new column”, var_name= “column name for value_vars”)
It’s a bit confusing at first but a few practices can really display the power of this function.
Often you’ll find a lot of duplicates in a raw table. Python makes it really easy to identify these as well as to remove them using the: .duplicated() & .drop_duplicates() function.
.drop_duplicates() will only work if every cell in a row is the same so sometimes you may want to specify which row to drop by using the subset attribute — .drop_duplicates(subset=[‘row1’]).
At times there may be columns with multiple types of information in them that you want to separate, like a column with gender and age (M18 or F19). A way to do this would be to grab the string using the .str function and split it by the index.
Say the column was formatted like ‘Male_18’ and ‘Female_19’. We’d want to split our string by the ‘_’. Please enter the .split() function. Inside the split function, enter the place where a character should be split by – be it an underline, dash, or space — .split(‘_’).
A great way to grab any strings you may have split into a series object is the .get() function. E.g. info.str.get(0).
There are times when an overzealous employee enters too much data in a column (guilty as charged). And we need to change the string ‘$280.00’ to an integer using the pd.to_numeric() function. But first, we have to get rid of that pesky dollar sign. Who’s that entering stage right now? Why it’s the .replace() function!
The replace function takes three parameters: first the object we wish to replace. The second with what we will replace it with, and a regex attribute set to True.
Lastly, I want to talk about what to do with missing values in a table. To first see what is missing, use the .isnull() function which will tell you if you indeed have anything missing. To locate where you have values missing, attach .values.any() to the end of it, or .sum() to count the missing values.
The .dropna() function will drop all rows missing a value, but if that’s too broad we can ask the function to drop only rows that have certain columns missing information using the subset attribute. For example:
new_df=old_df.dropna(subset=[‘column1’]).
To fill in these missing datas with say the average of all sales use the .fillna() function. For example:
bill_df = bill_df.fillna(value={‘bill’: bill_df.bill.mean()})
Genius!
So as you can see, a lot of programmers have run into the problems you’re just beginning to encounter. And there are so many great tools to choose from. These are just some of the essentials that’ll take you a long way.