Real world data is NEVER as clean as the nicely prepared files data science nuts like to download for projects from Kaggle and the like. But the plus side to real world data is that there are no limits to the data available, as long as you know how to wrangle it.
New data is being produced at rates many consider alarming. And it is happening every single second of every single day. We live in an age where data reigns, and those of us who can wrangle it like a wild bull (not that I could ever imagine wrangling a bull, but data often feels like one) will come out on top instead of drowning underneath it all. This article is an introduction on how to do just that.
These two links contain all of the code and many explanations of this project in addition to this article:
In the project that I use for examples in this article, for my data, I chose to access the FBI's Most Wanted API, due to the ease of use for new users.
For a data scientist or data engineer who knows how to utilize them, web APIs are like Willy Wonka's chocolate factory to Augustus Gloop. ( And now I wish I had chosen a better metaphor.) What I mean is: for the data-hungry, you can find just about anything you want out there and gobble it up like an Austrian glutton! (Again, I find myself unhappy with my metaphor...)
First, let's look at what APIs are. API stands for "application programming interface". Often APIs are used by developers to build websites with features their clients desire, such as a Google Maps interface. So instead of having to write all the code for such an interface, a developer can use an API to incorporate such functionality.
Some APIs are open-source and free, whereas others require fees. Most require an API key, which is basically your own personally identification / authentication when you access the API.
Top API Examples (with documentation links):
- Google - 17 different APIs under the umbrella of Google Maps
- IBM Watson Discovery - Contains online search content and numerous real world data
- Amazon ML - Good for understanding customers and predicting customer activity
- Census.gov - Important information about demographics and economics in the US
- Twilio - Communications API that allows messaging and calling capabilities
- Yummly - Information about food and recipes
- Reddit - Offers many read and write capabilities for the social network giant
- Spotify - Access music information from Spotify
- Weather.gov - Weather information
- SkyWatch - Access satellite imagery
- Instagram - Data about Instagram users and posts
- New York Times - Access data provided by the New York Times
- Twitter - Access data about users and posts
- Zillow - Home sale and purchasing information
The main functionality of an API for a data scientist rests in the request - response relationship. When we work with an API, we are making a request for data, and the data we receive based on that request is the response. To use this functionality within Python, it is important to become familiar with the Requests library. It is fairly intuitive and simple to use once a user has perused the docs and tried a few requests for data.
One of the best things about APIs is that the data is dynamic, for the most part. It is not a snapshot of data saved in a CSV file. It is real-life, living, breathing data. However, that does have its challenges, as you will see when we get to the data cleaning portion of this article.
For Python users, a first interaction with JSON just looks like a slightly odd Python dictionary. But when you look closer, there is often more nested data than one usually finds in a Python dictionary. And that is where data cleaning can get challenging. In the data I will use for our examples today, there is not an extensive amount of deeply nested data, but there is enough to see a thorough example of how this all works.
So let's look at this in code. These first cells are the importing of the data, followed by an example of what raw JSON formatted data looks like, via the example of one record from the API. Throughout the code, I try to thoroughly explain my steps as well, which I supplement with links to the documentation for the functions I use, as well as short explanations of the operation each performs.
Since this data only came in one page at a time, with the equivalent of 20 rows of data per page, I created a function to compile all the data from the FBI Most Wanted API into one dataframe:
At this point, there is a working Pandas dataframe, and the data cleaning can commence. Of course, it is always a good idea to get a very good look at the data before you begin any cleaning, just so you know what you are really dealing with and can plan out the steps of data cleaning to be the most optimal and need the least backtracking necessary.
A quick look at the
most_wanted.info() overview makes it clear that data cleaning will be a nice amount of work. There are many missing values, a fair number of columns with nested values, and many columns with wordy text values. So let's get to work cleaning!
The order in which I chose to clean the data was well-planned, but as you will see as I go through the process, it was not totally perfect. But again, this real world data, and real world data be like that!
I chose to only keep a somewhat small fraction of the data. There were initially 53 columns of data, but many of those columns contained values such as the minimum and maximum values of identifying factors like an individual's height, etc. For the purposes of this project, I kept only the most useful data. So as you can see in the following cells, I narrowed the data down from 53 cumbersome columns to 12 useful ones:
The first column that I noticed was odd was the
title column, which for most entries, contains the name of the individual. I found it a good place to start with renaming the column and checking for any missing values, because no-name = not very useful.
The next column that I found to be an issue was the
files column, which contains the FBI Most Wanted poster for the individual and the label of the language the poster is in. The poster link was buried within a dictionary inside of a list, so it was clear that the first step with this column was extracting the poster path and saving that as the only value to keep in this column. And then it also made sense to rename the column containing all poster paths as
poster. Unfortunately, the posters are in PDF format, and I was unable to get a visual representation in the Jupyter notebook of the PDFs.
The next column that made sense to tackle was the
images column, which contained anywhere from one image for each individual to a list full of images. I decided to keep only the one main image for the individual and save it to an
image column. The image, like the poster path mentioned above, was deeply buried in dictionaries and lists. In addition to extracting the image, I include a short function to create the HTML path necessary to display sample images in the dataframe using the
The next aspect I noticed as I worked with the data was that there seem to be a large number of children in the data, as you can see above in the images. So I dug a bit deeper to find that there are a large number of missing persons who are also included in the FBI's Most Wanted. Originally, I was not at all aware of this as a facet of the database itself. I thought it would all be individuals accused of horrible criminal activity. So I decided to parse the text in the
details column and try to find an accurate count of the number of records that reflected missing persons rather than criminal activity.
Spoiler alert! I found out later that this could have been done more easily. But we will get to that. At least here, we get a good example of how to use masks for filtering data!
I created a number of masks to filter the data that contained phrases that at first glance indicated that the person was a missing person and not wanted for criminal activity. I then extracted all the missing persons I found this way into their own dataframe to find the sum for all missing persons. We will see later how this number was not very representory of the actual total.
dates_of_birth_used was a bit of a nightmare of a column. Firstly, even though I did not utilize the datetime info from this column for data calculations, in my opinion, it is always a good idea to convert dates into a datetime format. The tricky thing here is that each value in the column was a list. Some records had many dates listed, and many just had one. But since the format was that of a list, I could not just send the whole column to
pd.to_datetime() and call it good. I had to come up with a better way.
And after spending what felt like an hour just extracting the dates, converting them to datetime objects, and putting them back again, I realized that would not make any sense if I did not separate the records with multiple dates out to represent more than one date per record. That is when I realized that it all could have been taken care of in about 30 seconds. And that is the final route I went with. However, I did learn a great deal by my mistakes, and that is always a good thing!
pd.explode() to expand the list data to have its own row while keeping the same index. So while individuals who have used more than one birthdate have multiple rows reflecting that, they are still connected as one record by their index. After expanding the dates to their own rows, it was easy to convert the entire column to datetime objects. I also simply HAD to change the name of the column, because, like variable names, column names are the LAST place you want to be unnecessarily wordy:
This is where I realized there was an easier way to get the total number of missing persons. The
subjects column contains the category for each record. Many of them are the type of offense committed, but the records for missing persons are clearly categorized here. So I cleaned up this column in a manner similar to the birth dates, since some records contained more than one category, and I renamed the column "categories" to more clearly reflect the data contained:
One column that I could not comprehend the reason behind its very unfortunate, yet oddly uniform, format was the
field_offices column. The names of the FBI field offices are all lower case, and multiple-word city names are run together as one word. It makes them very annoying to read. Again, each record contained this data as a list. So to make sure I cover all bases for the records that contain more than one field office listed, I used
pd.explode(). I then capitalized the current city names and created a decoding dictionary to give the city names that are made of multiple words their corrected names:
nationality column was the first one that was nice and clean. So I just did a bit of data visualization and went along my merry way.
sex column was not too bad, as there is very little that can be done wrong there. However, some records were left blank in this column, while some were listed as
None. So I decided to change all blank and
None values to
NaN, since that is an easier format to work with inside of Pandas.
The race column was a total hot mess. There were so many unique values and descriptions of races. There was very little data analysis that could be done with such labels. So I created a variety of masks that I applied to the data via
pd.mask() and relabeled the entries as best I could, taking into account the descriptions given. I tried to apply the masks in an order that led to the most accurate labeling, which was a challenge. But given the state of the data as it came in, I must say that I think I did the best I could.
I must reiterate: I do not believe the race column adequately reflects the racial demographics of the data. There are many records that clearly fit into another demographic that are just labeled as "white", for example. So this data is fairly skewed at best. The only way to make it more accurate would be to go through the database and label individuals based on their names and images. And while I am all for accuracy, I did not feel it a good idea to take on that moral responsibility, nor did I really want to spend the time doing so. The dataset became fairly depressing to work with pretty early on. So I decided to accept what I had done so far with this column. If I were to use this data more extensively and for more consequential purposes, I definitely would have gone through and cleaned this column further.
warning_message column was the next I decided to tackle and filter with masks. There were a great deal of unique values describing the individuals, but after reading through a couple hundred or so, I realized some trends that helped me create masks by which I could filter and relabel.
I again applied the masks in such a way as to create the most accurate representation of the data while at the same time labeling in such a way as to make the data more useable for analysis.
reward_text column truly was one that required fairly unique data. So I decided to do no masking or filtering and instead to create a representation of the distribution of data via a WordCloud. I chose to represent the top 150 words that appear most frequently across the entire corpus of the compiled column of
reward_text. Most of the entries include fairly detailed information about the severity of the case and the reward offered for helping apprehend the individual. But there were some words that appeared in the word cloud that still surprised me.
And lastly, I created a simple visual representation of a random sample of some of the "most dangerous" individuals, exhibiting the difference in the data from import to post-cleaning. It is always surprising how far some hard work goes. Maybe it should not be that surprising, given how tough data wrangling sometimes can be. But let's just say I was impressed and pleased nonetheless!
Thank you for reading and following this data journey! I hope you found something new, something inspiring, or something interesting. Check back for more soon, and happy data hunting!