(To view the complete notebooks related to this project, please visit my resources page, and navigate to the 12 Advanced Projects: Data Science with Pandas section.)
There is A LOT to Pandas, and because the possibilities are so vast when working with this essential library, sometimes it is hard to know the very best way to go about executing a particular task. So I would like to share some of the most useful practices I have found in my own experience.
To do this, I will be utilizing the TMDB movies dataset, which includes 44,691 films from between the years 1884 to 2018. This dataset is excellent for this purpose because 1) it is interesting to just about everyone 2) it contains a wide variety of data and types 3) there are many opportunities to manipulate the data and provide examples of the extremely useful functionality of Pandas.
- Initial Data Inspection
- Feature Engineering
- Creating and Applying Masks and Filters
- Aggregating Data
- Essential Functions and Chaining
Initial Data Inspection:
Starting a new project generally goes the same way almost every time: import the data, see how much needs to be cleaned up and decipher the best method for doing so, check out what data types are involved, and see if any initial correlations can be derived from the data. And of course within each of these are many sub-tasks. So a project will generally start with:
df = pd.read_csv('filename.csv', parse_dates = ['date_column])- import the data and parse any dates so you have a clean datetime to work with
df.info()- get a look at the columns, the datatypes, and a rough idea of missing values
df.describe()- get a general idea of the distribution of the data. I also like to use
describe()specifically on columns that I think will be most important:
So here is what those steps look like in this particular project:
Of course, it is always a good idea to then have a look at a few rows of data, just to get a mental image of the beast you are working to tame. I generally only get 2 or 3 rows, because I know that it will not be long before I am having to scroll back and forth through my notebook. And I do not like having to scroll any farther than absolutely necessary. So I only print data to screen that I actually need to make my work the most effective and efficient moving forward or that is necessary to communicate the important aspects of the data depending on the goal of a project. This way, it is easy to always be able to go back and see exactly how the columns are configured, etc.
To get a more clear idea of the overall layout of the data, I find it invaluable to plot the various features early on. This definitely can save time and help guide a project in the right direction from the very beginning:
I also have a habit of taking each seemingly important feature and using
.value_counts() to get a better idea of the overall contents of a column. This helps me have an idea in the back of my mind of how exactly the values of any given column are distributed. Knowing how many different values you are working with and the general counts for each can be very useful in moving forward with steps such as feature engineering and data cleanup.
With this dataset, there are a great deal of
NaN values. But since this one is not going into a machine learning model or undergoing a vast number of numeric computations, it was not a very big problem. Pandas does an amazing job of accommodating
NaN values, which makes it possible, for example, in this case to have movies going all the way back to 1884, when very little of the data we keep track of today was tracked, and therefor their rows contain
Feature Engineering: top
This is one of my favorite parts of working with data. By this point in the process, having followed the steps above, I generally have an idea of a few features, combinations of existing features, I want to create. In this dataset, it was clear that a profit column would make sense, as well as a return on investment column. The original data provides the budget and overall revenue for each film, so these two new features are not challenging at all to create:
As I worked with the data further, I also found it useful to create category columns for some columns of continuous data, such as the
release_date column, which I expanded and created a
decade column for further data investigation. This, however, like a number of other engineered features, required some custom functions to get just right:
I also found that it was important to create a new categorical column for the
runtime column, in order to do adequate data analysis on the films' runtimes. This is often when I find surprises in the data. For example, I never would have guessed that films in the "long" runtime category would necessarily be more popular than the "average" runtime films. But the data doesn't lie!
One aspect I associate with feature engineering is feature decoding. For example, in this dataset, the original languages for the films were encoded rather than the language being listed outright. And that was not useful when I was working with the feature of original language, so I created another custom function to clean up the language values for the top 10 languages by number of films in each. First, I had to find the top ten languages by getting the value counts and selecting the top ten with
.head(10). Then I had to create a dictionary with keys being the encoded languages and values being the actual language names. This was a little tricky in a couple of spots, because I had no idea that 'zh' was an abbreviation for Chinese, for example:
Creating and Applying Masks and Filters: top
Masking / filtering is essential when working with data, especially when you want to really dig into the details. For example, if I wanted to get a subset from the giant dataset of films that either fit into the action or thriller genre, that are in English, and have ratings of at least 7.5 stars ordered from most recently released, there is really no simpler way to do so. So I create a mask for each specification and then filter based on the combination of all required masks for the search:
Using masks and filtering, finding a needle in haystack is really no big deal. Another example is if I wanted to narrow down the films to only include films that are Disney / Pixar productions, between the years 2010 and 2015, sorted from highest revenue to lowest, I can just make the following masks and filter:
Aggregating Data: top
Using the Pandas
.agg() function is by far one of the most powerful tools when working with data in Python and Pandas. It allows a user to not only narrow in on important data but to also quickly and efficiently make important calculations based on that narrowed down data, which is accurate and thereby actionable.
For example, the following cell contains an aggregation based on the genres of the films. From this aggregation, I can further investigate the total revenue, average revenue, return on investment, average rating, and popularity all centered around the genre feature of the films. This can make plotting data even more powerful! I choose to create entirely new dataframes with their own new column headings with my aggregations so that I can more easily work with the data. But this is not necessary with smaller aggregation.
Now that all of that data has been aggregated in this manner, I can easily plot based on each aggregation contained:
You can view more aggregations like this in the full notebooks in the 12 Advanced Projects: Data Science with Pandas section of my resources page.
Essential Functions and Chaining: top
Pandas has some of the most wonderfully useful functions and chains of functions a user can find in Python. It is the NumPy for tabular data. Here are some of my most valued functions and descriptions thereof. Be sure to check out the official documentation from Pandas as well. I have included links to each.
df.loc()(docs) - this function takes labels for data and gives you back the rows of information based thereupon. It is incredibly useful when working with masks and filtering, for example, because it is the function a user feeds the masks to in order to acquire the filtered data. For example, in the following cell, I use it to locate films where the genre column contains the label "Horror". The chained functions that follow will be discussed below. (Note: before using
.loc()I convert any NaN values to strings so that the data is searchable and does not throw errors.)
df.column.str.contains()(docs) - this chain is incredibly useful in dealing with textual data like this TMDB, where a number of columns contain multiple values, such as the genre column, which contains each genre the film is considered to fit into, or the actors column which contains a series of all the main actors in the film. So this allows a user to search within that data and not be required to locate an exact field that matches the query or have to unpack data and make a dataframe more lengthy. In the cell above, I am searching in the movies dataframe for locations where the genres column of string data the specific string "Horror" somewhere within it.
df.groupby()(docs) - this is a function that is NECESSARY when working with large amounts of data. For many function chains, it is the first step, kind of like the stage upon which a user can create a query, narrowing huge amounts of data down to very specific values and information. For example, in the cell above, after using
.loc()to get the data that contains the genre "Horror", I group the resulting data by the "director" column, which is followed by further chaining described below.
df.groupby().agg()(docs) - This chain is VERY useful in creating narrowed-down aggregations of data. For example, in the code below I am creating a subset of data centered upon the production companies that produce the films, so that I can find more meaningful information on that facet of the data. I should mention that before doing so, because a film very often lists more than one production company, I took the column of production companies and extracted each individual company, creating a new row representing each company and their relation to each movie in which they were involved. After creating that
productiondataframe, in this cell, I am creating the aggregated production data that contains the mean and summed values for a number of different features. I cannot emphasize enough how important these steps are in fully understanding the data. For example, one might think that just because Warner Bros. is number one for the number of films it has produced, as well as the total revenue from films, that it would be the number one production company for average revenue by film. But it did not even make the top ten in that category. So there is always much to be learned from this particular aggregation process.
df.datetime.between()(docs) - The function is incredibly useful when working with datetime data, because it allows the user to pass two strings that represent datetimes and returns the data that falls between the two. The datetime data can be passed in a vast variety of ways. For example, May 1, 2015 can be passed as
2015 May 1or
2015/05/01, etc. In the cells below, which you have seen earlier in this article, I am filtering the data to the Disney / Pixar films between the years 2010 and 2015:
df.nlargest()(docs) - This function is incredibly useful with aggregated data to get the top data from within the aggregation. For example, in the following code, from the aggregated actors data, I am getting the top 10 actors who have participated in the highest number of films:
There are many more useful functions in Pandas that I would love to include, but these are the main ones I found most helpful working with this dataset. Check back for more! I will be sharing all sorts of great tips and tricks for working with Pandas, complete with explanations and code!
While this is a bit of a tangent from the main Pandas discussion, one of my favorite parts of this project was taking all of the words contained in the titles of the films and making a word cloud out of it using the WordCloud module. It takes a bit of prep on the data to do so, but the results can be very telling. A user specifies the number of top words of all the words passed that they want returned, as well as a few other parameters for how they would like the returned image to look, and it gives you a very nice image of the top words present with the more frequent words printed larger. The following code shows how I did this for the titles, the taglines, and the overviews of the for the films. The same few most frequent words seem to be ever-present, as you will see below:
Another slightly entertaining Pandas tangent as we reach the end here, is the
to_html() function, which you have already seen some examples of above. This is the function that allows me to display the poster images for the films. In the code and example below I have extracted the highest rated movies out of the most popular genres. First, I display them in tabular form, and following that, I display them with their images. Because the
to_html() function does not work within another function, i.e., I could not loop through the genres and display the top films by genre for each, I concatenated the data into a large dataframe containing all of the top films for all of the top genres, separated by a small heading that specifies which genre the data directly following represents:
So that is all for now! Again, keep checking back. As I complete more projects, I will be sharing my work and explaining the processes I used in each. Have fun! Code well! And hug a Panda!