In the world of big data, it is inefficient to work with individual, large files that contain the entirety of the data we are working with. It can become slow and unmanageable. So it is often more efficient and economical to save the large amounts of data in a long-term, relational database, such as an SQL database, and only retrieve the data we need as needed. Such databases are composed of numerous tables of information that connect to one another on a common field, such as a unique id number, title, name, etc.
One of the most convenient ways of accessing such a database is with Python and Pandas. In this way it is easy to both create an SQL database and then later access that database with queries for the particular data needed for immediate analysis. This creates a more focused and organized manner of analyzing data and allows for massive data storage at the same time. It is quite literally the best of all possible worlds.
In this article, I will be using a project of my own and walking you through the steps, taking in JSON data from the internet, converting it to a Pandas dataframe, separating that data into smaller dataframes, creating an SQL relational database, converting the separated dataframes into tables within that database, and querying the database.
In this project, I utilize the
SQLite3 module. It is free and open-source, severless, self-contained, requires no installation or configuration, and is the most widely used SQL relational database engine. Best of all, it works perfectly with Pandas and Python! That is why I call them "The Happy Family"!
For this project, our data comes from the UNECE's API. The UNECE is the United Nations Economic Commission for Europe and is one of the five regional commissions of the United Nations. The data taken for this project covers North America and western and eastern Europe. It contains information for 52 countries over the span of years 2000 to 2016. There is a separate record (row) for each country for each year within the original data.
The data contains 79 columns, ranging from simple data such as the country, the year for the data, the population and land area of the country to very specific data concerning economics, the workforce, various industrial sectors, etc. So this data is perfect for creating an SQL relational database. There is SO much data, and there are very clear delineations upon which we can create separated but linked tables within a relational database and see how all this functionality really works.
Importing the JSON data:
The first step in working with data from APIs is to import the JSON data and take the necessary steps to convert it into a Pandas / Python-friendly format. For the purposes of ths project, I chose data that did not need much data cleaning on the original JSON data, since the main focus here is working with SQL. But keep in mind that data is rarely as friendly as this. For more information about bringing data in from APIs and cleaning it all up to be as useful as this data is, please check out my article on the subject here: APIs, JSON, and Data Cleaning.
Here, you can see the libraries I import are few and simple. All we need to import is
pandas. I set the Pandas options to accomodate the vastness of the features in this dataset, which as I mentioned are 79 columns. And when you investigate the raw JSON format above, you can see the one little detail about the data and the cleaning thereof that I had to confront first: the column names. That's right: The majority of ALL that text in the cell is simply COLUMN NAMES. I honestly found it mind-blowing that anyone would come up with such. They were incredibly long, daunting, and cumbersome. As labels/column names within code and SQL queries, these just will not work. After converting to a Pandas dataframe, I took the time to very carefully relabel each column so its contents were comprehensible and at the same time named in such a way that will work in code.
JSON to Pandas and initial data investigation:
Since the data is already quite clean, the transition from JSON to Pandas was as simple as using
pd.json_normalize(). Upon doing so, it was easier to get a clear idea of the data at hand.
This is the point where I took the time to deal with this crazy column names. I created the dictionary
new_column_names and assigned each column a more concise description that will work with queries in code.
The next step is the initial data investigation. I always like to start this step by getting an idea of what the rows actually look like, so I begin with a
df.head() and get a look at just a few rows of data. This is often all I need to know how to begin.
Then, as I showed earlier, I got a count of the countries represented, the number of years worth of data (see the code above), and after noticing that there were a significant number of missing values, I wanted to get a count of just how many were missing.
24.06% missing values seems like a lot, but there is still a great deal of data here to work with still. And we will deal with the missing values within our queries so that it does not affect our returned results too badly.
Next, I check for duplicates in the data, of which there were none. I then assign a unique id to each record, so that I have a common column I can include in each SQL table on which I can join the different tables for complex queries later on. For this I use the module
ShortUUID (source) which allowed me to create a short, length-specific, and unique id for each row.
From Pandas / Python to SQL:
This is where things start to get fun and exciting! (I would assume you love data too if you are reading this, however I realize not everyone gets as giddy about it as I do.) So the next step is separating out the columns of data and making separate Pandas dataframes that will be converted into database tables. I tried to organize the data in such a way as to categorize the columns for query convenience, for example, to minimize the necessity of joining tables, etc. These are the categories I decided upon:
Let's have a look at the first three rows of data from each dataframe before converting them into tables in the database.
The step of converting these to tables is so easy! All we have to do is use
df.to_sql() and supply the name we want the table to have and the SQL connection we have created. The following cells are the creation of the SQL database connection and the creation of the tables of the database. (The code for creating the tables is commented out, because as I was working on the project, any time I would accidentally run those cells again, I would get errors, since the database had already been created.)
The tables of the SQL database are organized as follows:
Maincontains the main information for each row: the country, the year for the data, the country's land area in kilometers, and the population and population density columns. This will be the table the begins every query for the most part.
Popcontains the various columns pertaining to population calculations, many of which are different computations of population densities, gender separations, etc.
Lifecontains data about lifestyle and life expectancy calculations. For example, there is data concerning the amount of computer usage for different age groups, separated by gender, the average age of females upon the birth of their first child, etc.
GenWorkcontains columns pertaining to gender splits, ratios in the workforce, and female representation in the workforce. The table is very interesting when we get to queries!
Econcontains the columns that pertain to the economic and industrial activity and ratios of employment in each field for each nation for each year, as well as employment and unemployment rates for various age groups, also separated out by gender.
RoadRailcontains a few columns that pertain to the deaths and injuries on the roadways and railways for each country, as well as the total kilometers of roadways and railways within each nation.
We can get confirmation that all of these tables are now a part of our SQL database by using
connection.execute() as shown in the following cell.
The most fundamental SQL keywords:
Now that our database has been created, we can start running queries. I will include the cells from my project which explain the query as well as the code for the query and the returned data results. Whenever I use a new SQL keyword, I highlight it in the heading for the code cell for reference.
Here is a quick rundown of the main SQL query keywords that I use in this project and what they do in order of the most frequently utilized. For full detail, ALWAYS be sure to check out the docs. Becoming a doc-extraordinaire makes you a tech wizard! There are MANY more keywords and functionalities to learn about. But this will give you a quick introduction to this powerful tool!
SELECT- This is the first keyword a user must know. It is necessary in order to get any information from the database. It is simply the way you say, "I want to get the following data as described...", which you follow with whatever query you need to make to gather the appropriate data for analysis.
FROM- This is the second most important keyword, because after using
SELECT, you have to tell from where you would like the data selected. For example, one of the simplest queries would be:
SELECT * FROM Main. This would return all data (hence the
*for all) from the table that has the name
SUM- These are operations you can use on your data to very quickly and easily get the top data, the lowest data, averages, and totals. These are very useful when you need to get an idea of the limits of your data, the overall distribution, etc. For example if I queried
SELECT MAX(population), country FROM Main, I will get back the country with the highest population and that population count.
WHERE- This is an incredibly necessary keyword for filtering data down to only the very specific values you are looking for. For example, if you only wanted the data for a certain country like Turkey:
SELECT * FROM Main WHERE country = "Turkey". This will return every record that has Turkey listed as the country.
AND- This is important for chaining
WHEREfilters together. For example, similar to the query above, suppose I wanted just the data for the year 2013 for Turkey:
SELECT * FROM Main WHERE country = "Turkey" AND year = 2013".
=- As you saw in the last two examples,
=is used to filter fields and values to get very specific results, but what if you wanted a range of data below or above a certain value? Then you could run a query like:
SELECT * FROM Main WHERE population > 10000000. This will return to you all the data contained within the
Maintable for countries whose population is over 10 million in any given year.
<can be used in exactly the same way.
IS NOT NULL- These can be used with
WHEREand a column name to filter out
NULLvalues. Like I mentioned above, 24.06% of the values in this data are
NULL. And it really is unfortunate when you get results where there are so many
NULLvalues that the query becomes meaningless. So it is nice to throw a
WHERE column_name IS NOT NULLinto the query if you find that this is the case with your data.
DISTINCT- is used when you want to get the unique data for column, for example if I only wanted the unique names of all the countries in the database, I could query:
SELECT DISTINCT country FROM Main. This would give me a table of just the distinct countries present. Remember that in this data, each country appears for years 2000-2016. So this is useful in limiting the results and eliminating duplicates. One thing to keep in mind, however, is that this will apply to all columns you query. So with more complicated queries, some work arounds must be performed, such as limiting the year to
WHERE year = 2013.
ON- These are used for combining tables to get results from across different tables that are connected by a common column, in our case here, that column is
id, the unique id I described above. So if I wanted to get the columns for the country as well as the life expectancy for men, I would run the following query:
SELECT Main.country, Life.life_exp_m_birth FROM Main JOIN Life ON Main.id = Life.id. In this query, I specify the two columns I want (
Table.column_name) , the tables that must be joined, and the column on which to join them. This would return to me all records for every country, for every year, but only the country column and the male life expectancy column. Joins can be chained together one after the other for however many tables there are present in the database. You will see a few examples of chained joins in the queries below.
DESC- This one does just what it says. You tack it on to the end of a query, and your results will be in the order you specify. For example, if I performed the following query:
SELECT country, population FROM Main ORDER BY population ASC, I will get back a list of all of the countries (for all years included in the data) ordered by population from lowest to highest. Throwing a
DISTINCTin there would filter that down to just one record per country. But honestly, for operations like that, I find it more convenient and comprehensible to also filter by a particular year.
Simple SQL Queries:
More advanced SQL queries, including table joining:
The following queries start with simple joining of tables, sorting, and so forth and progress to more complicated queries.
A note about plotting:
One aspect of working with data that I thoroughly enjoy (although, I honestly enjoy everything about data analysis and manipulation) is data visualization. That is one facet of using SQL that is better done after converting query results back to a Pandas dataframe. Going straight from SQL query results to
.plot() gives very little opportunity for customization, so labeling and other visual necessities are lacking. Compare the following two plots. The first, I plot directly from the SQL query data, and the second, I import Matplotlib and plot with all the customization options available to me, although I utilize only the main and necessary customizations in this example.
So there you have it! This concludes my overview of "The Happy Family": JSON, Pandas, Python, and SQL. With these, the world of data is at your fingertips and ready to be explored and analyzed. And with a quick
.to_csv() tacked to the end, any of the above queries can be quickly whisked away to csv format and shared as desired. This is just the tip of the iceberg. But the great thing is that once you get this tip, the whole iceberg can easily be yours! So, good luck! And happy data wrangling!