import pandas as pd
import numpy as np
car_data = 'car_financing.xlsx'
df = pd.read_excel(car_data)
# Select one column using double brackets
df[['car_type']].head()
car_type | |
---|---|
0 | Toyota Sienna |
1 | Toyota Sienna |
2 | Toyota Sienna |
3 | Toyota Sienna |
4 | Toyota Sienna |
# Select multiple columns using double brackets
df[['car_type', 'Principal Paid']].head(1)
car_type | Principal Paid | |
---|---|---|
0 | Toyota Sienna | 484.3 |
# Select multiple columns using double brackets
# Keep in mind that you can't select multiple colums using single brackets
# This will result in a KeyError
df[['car_type', 'Principal Paid']].head(1)
car_type | Principal Paid | |
---|---|---|
0 | Toyota Sienna | 484.3 |
# This is a Pandas DataFrame
type(df[['car_type']].head())
pandas.core.frame.DataFrame
# This is a pandas series
type(df['car_type'].head())
pandas.core.series.Series
# Preferred way to select columns in Pandas
# Get the dataframe
df.loc[:, ['car_type']].head()
car_type | |
---|---|
0 | Toyota Sienna |
1 | Toyota Sienna |
2 | Toyota Sienna |
3 | Toyota Sienna |
4 | Toyota Sienna |
# Get the Pandas series
df.loc[:, 'car_type'].head()
0 Toyota Sienna 1 Toyota Sienna 2 Toyota Sienna 3 Toyota Sienna 4 Toyota Sienna Name: car_type, dtype: object
# Notice that the filter produces a pandas series of True and False values
car_filter = df['car_type'] == 'Toyota Sienna'
# Filter dataframe to get a DataFrame of only 'Toyota Sienna'
df.loc[car_filter, :]
# Filter dataframe to get a DataFrame of only 'Toyota Sienna'
df = df.loc[car_filter, :]
df.head(2)
Month | Starting Balance | Repayment | Interest Paid | Principal Paid | New Balance | term | interest_rate | car_type | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 34689.96 | 687.23 | 202.93 | 484.30 | 34205.66 | 60 | 0.0702 | Toyota Sienna |
1 | 2 | 34205.66 | 687.23 | 200.10 | 487.13 | 33718.53 | 60 | 0.0702 | Toyota Sienna |
# Creating interest rate filter
print(df['interest_rate'].value_counts(), "\n")
# Notice that the filter produces a pandas series of True and False values
df['interest_rate']==0.0702
# Create the interest rate filter
interest_filter = df['interest_rate']==0.0702
# limit the dataframe by the filter
df = df.loc[interest_filter, :]
# dropna = False means it will include where there are NaN values
df['interest_rate'].value_counts(dropna = False)
0.0702 60 0.0359 60 Name: interest_rate, dtype: int64
0.0702 60 Name: interest_rate, dtype: int64
An more concise way to do it is shown below.
Bitwise Logic Operator | Meaning |
---|---|
& | and |
| | or |
^ | exclusive or |
~ | not |
df.loc[car_filter & interest_filter, :].head()
Month | Starting Balance | Repayment | Interest Paid | Principal Paid | New Balance | term | interest_rate | car_type | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 34689.96 | 687.23 | 202.93 | 484.30 | 34205.66 | 60 | 0.0702 | Toyota Sienna |
1 | 2 | 34205.66 | 687.23 | 200.10 | 487.13 | 33718.53 | 60 | 0.0702 | Toyota Sienna |
2 | 3 | 33718.53 | 687.23 | 197.25 | 489.98 | 33228.55 | 60 | 0.0702 | Toyota Sienna |
3 | 4 | 33228.55 | 687.23 | 194.38 | 492.85 | 32735.70 | 60 | 0.0702 | Toyota Sienna |
4 | 5 | 32735.70 | 687.23 | 191.50 | 495.73 | 32239.97 | 60 | 0.0702 | Toyota Sienna |
Here are two popular ways to rename dataframe columns.
.rename(columns = {}
¶# Approach 1 dictionary substitution using rename method
df = df.rename(columns={'Starting Balance': 'starting_balance',
'Interest Paid': 'interest_paid',
'Principal Paid': 'principal_paid',
'New Balance': 'new_balance'})
df.columns
Index(['Month', 'starting_balance', 'Repayment', 'interest_paid', 'principal_paid', 'new_balance', 'term', 'interest_rate', 'car_type'], dtype='object')
.columns =
¶# Approach 2 list replacement
# Only changing Month -> month, but we need to list the rest of the columns
df.columns = ['month',
'starting_balance',
'Repayment',
'interest_paid',
'principal_paid',
'new_balance',
'term',
'interest_rate',
'car_type']
df.columns
Index(['month', 'starting_balance', 'Repayment', 'interest_paid', 'principal_paid', 'new_balance', 'term', 'interest_rate', 'car_type'], dtype='object')
.drop(columns=['']
and del
¶# Approach 1
# This approach allows you to drop multiple columns at a time
df = df.drop(columns=['term'])
# Approach 2 use the del command
del df['Repayment']
# sum the values in a column
# total amount of interest paid over the course of the loan
df['interest_paid'].sum()
6450.2699999999995
# sum all the values across all columns
df.sum()
month 1830 starting_balance 1118598.13 interest_paid 6450.27 principal_paid 34690.29 new_balance 1083907.84 interest_rate 4.212 car_type Toyota SiennaToyota SiennaToyota SiennaToyota ... dtype: object
Values will be originally missing from a dataset or be a product of data manipulation. In pandas, missing values are typically called NaN
or None
.
Missing data can:
This is an important subject as before you can graph data, you should make sure you aren't trying to graph some missing values as that can cause an error or misinterpretation of the data.
df.info() # 1 missing value
<class 'pandas.core.frame.DataFrame'> Int64Index: 60 entries, 0 to 59 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 60 non-null int64 1 starting_balance 60 non-null float64 2 interest_paid 59 non-null float64 3 principal_paid 60 non-null float64 4 new_balance 60 non-null float64 5 interest_rate 60 non-null float64 6 car_type 60 non-null object dtypes: float64(5), int64(1), object(1) memory usage: 3.8+ KB
Two common methods to indicate where values in a DataFrame are missing are isna and isnull. They are exactly the same methods, but with different names.
# Notice we have a Pandas Series of True and False values
df['interest_paid'].isna().head()
0 False 1 False 2 False 3 False 4 False Name: interest_paid, dtype: bool
interest_missing = df['interest_paid'].isna()
# Looks at the row that contains the NaN for interest_paid
df.loc[interest_missing,:]
month | starting_balance | interest_paid | principal_paid | new_balance | interest_rate | car_type | |
---|---|---|---|---|---|---|---|
35 | 36 | 15940.06 | NaN | 593.99 | 15346.07 | 0.0702 | Toyota Sienna |
# Keep in mind that we can use the not operator (~) to negate the filter
# every row that doesn't have a nan is returned.
df.loc[~interest_missing,:].head(1)
month | starting_balance | interest_paid | principal_paid | new_balance | interest_rate | car_type | |
---|---|---|---|---|---|---|---|
0 | 1 | 34689.96 | 202.93 | 484.3 | 34205.66 | 0.0702 | Toyota Sienna |
# The code counts the number of missing values
# sum() works because Booleans are a subtype of integers.
df['interest_paid'].isna().sum()
1
This is an important subject as before you can graph data, you should make sure you aren't trying to graph some missing values as that can cause an error or misinterpretation of the data.
You can remove missing values by using the dropna
method.
# You can drop entire rows if they contain 'any' nans in them or 'all'
# this may not be the best strategy for our dataset
# df[30:40].dropna(how = 'any').head(3)
# Looking at where missing data is located
# df['interest_paid'][30:40]
# Filling in the nan with a zero, which is probably a bad idea.
# df['interest_paid'][30:40].fillna(0)
# back fill in value with the value that follows directly after the missing
# df['interest_paid'][30:40].fillna(method='bfill')
# forward fill in value with the value that comes directly before the missing
# df['interest_paid'][30:40].fillna(method='ffill')
# linear interpolation (filling in of values), uses a linear model to come up
# with value to use to fill in
# df['interest_paid'][30:40].interpolate(method = 'linear')
# Fill in with the actual value
interest_missing = df['interest_paid'].isna() # filter to find missing value
df.loc[interest_missing,'interest_paid'] = 93.24
# Approach 1 => returns NumPy array
# df.to_numpy()
# Approach 2 => returns NumPy array
# df.values
# Converts to dictionary, preserving indices, returns dictionary
# df.to_dict()
# Export DataFrame to csv File, index = False means do not export the indices
df.to_csv(path_or_buf='table_i702t60.csv',
index = False)
# Export DataFrame to excel File
df.to_excel(excel_writer='table_i702t60.xlsx',
index=False)
# Keep in mind that if you dont know a methods parameters,
# you can look them up using the help command.
# help(df.to_csv)
How is the NumPy array approach to initializing a pandas DataFrame different than the nested list approach?
Which DataFrame method or attribute can help you check whether the data contains nulls?
.info()
A DataFrame has two columns: "user" and "phone" (in this order). Which command can you use to rename the first column to "person"?
df.columns=['person','phone']
The command below produces a series of true/false values to identify missing data. How can you use this series to show only the rows where no data is missing?
filter=df['my_column'].isna()
df.loc[~filter,:]
Which DataFrame conversion method preserves the indexes?
df.to_dict()