(850) 270-3180 [email protected]

Working with Pandas in Python has given me a totally different view of data…

I come from a php/MySQL, html forms, type of development background. Back in those days – before mobile phones, SQL Injections, and Facebook ruined the Interwebs (jk/k…); often we would create forms that captured very specific data. We would then stick that info into the appropriate slots in the db, and life was good.

Lately I have been working with large data sets. Turns out that when you get data out in the wild it can be messy. Even dirty. It seems to me that allot of ones time in Pandas is spent cleaning up data. In my experience cleaning up dirty data is as much an art as it is a science. Yes, you have to know and understand the appropriate commands, but how you go about it is completely wide open.

If you are anything like me, you might want to avoid some headaches and change all of the column names to lowercase…

df.columns = map(str.lower, df.columns)

 

Most of the time I like to get rid off all the non ascii characters.  Here I replace them with a space:

df.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

 

Or if you have a particular character you want to remove from all the entries in a column:

df['ColumnName'] = df['ColumnName'].str.strip('%')

 

Most of the time I’ll drop some columns that I don’t need. Here is how I drop one, a series, or a range of columns:

df.drop('column_name', axis=1) # drop single column
df.drop(df.columns[[6,7,8]], axis=1) # drop multiple columns
df.drop(df.columns[5:14], axis=1) # drop a range or slice of columns

 

Addresses are tricky, dirty data items, no? For example, sometimes you will get an address all broken up and need all the info in a single column. In this example I’m grabbing different city, state, and country columns. I’m sticking a comma between each item and creating my revised address column:

df["address"] = df["address"] + "," + df["city"] + "," + df["state"] + "," + df["country"]

 

It turns out that the opposite is much trickier, but it works something like this:

df['city'], df['state'] = zip(*df['address'].apply(lambda x: x.split(': ', 1)))

or

df['city']=df['address'].apply(lambda address: address.split(',')[0])# 1 gets the SECOND item in a LIST....

 

In a similar vein, what if you wanted to separate a timestamp into some of it’s component parts?

df['hour']=df['timeStamp'].apply(lambda time: time.hour)
df['month']=df['timeStamp'].apply(lambda time : time.month)
df['day']=df['timeStamp'].apply(lambda time : time.dayofweek)

 

One of the biggest headaches in dirty data is missing values. Lets say you want to clean up missing values in multiple columns? You could of course do something generic ala:

df.address = df.address.fillna('')

 

Or you could do something awesome like this:

df = df.fillna({
'address': 'missing',
'phone': '999-99-999',
'age': '99',
'income': '999,999'
})

 

This will save you from having a string in a field that is supposed to contain a number. Now, you could help minimize a type error when you first import your dirty data in Pandas:

df = pd.read_csv('black_book.csv', dtype={'age': int})

That little addition to you import tells Pandas to insure the age column contains integers. But if something does slip by and you need to change the data type of a column, no worries:

df[column_name] = df[column_name].apply(lambda x: pd.to_numeric(x))

 

And while we are talking about files, this is as good a time as any to tell you to save(as) your cleaned up data:

df.to_csv('cleaned-up-file.csv', encoding='utf-8')

 

Whew! and all that is just to get me started trying to massage some data into telling me it’s secrets…

If you don’t want to do all this but still want to learn the secrets in your data, let me know and we can discuss how I can help you clean your data.