# Python Course 4 - Data Analysis

Application data (from the *real world*) does usually not come in any form that we would like. It is not nicely organized inside of vectors, matrices and tensors. An problem is also posed by missing values which appears quite often.

The goal of this exercise is to look into a few basics of data analysis using Python.

We start by loading the relevant libraries.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt 

%matplotlib inline 
sns.set(color_codes=True)

Data is for example given in a CSV format. Here, we use a data set about cars. Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

In [None]:
df = pd.read_csv("data.csv") 
df.head(5) # display top 5 rows, 
#df.tail(5) # display bottom 5 rows 

Datatypes may also pose a problem since columns might have the *wrong* assigned to them. It is for example possible that values are stored as strings but we need integers or floats to work with.

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.


In [None]:
df.dtypes

The next step is to drop irrelevant columns. This is needed most of the time since we are not going to use a lot of the data. In this case, the columns such as Engine Fuel Type, Market Category, Vehicle style, Popularity, Number of doors, Vehicle Size don't interest us, so we just drop them for this instance.


In [None]:
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
df.head(5)

Now, we want to rename the columns to shorten the names and removes spaces.

In [None]:
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
df.head(5)

In big data sets it might also happen that we have duplicate rows. The next step is to remove those.

In [None]:
df.shape

In [None]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)

In [None]:
df.count() 

So seen above there are 11914 rows and we are removing 989 rows of duplicate data.

In [None]:
df = df.drop_duplicates()
df.head(5)

In [None]:
df.count()

We return to the problem of missing (or NULL) values. There are different ways to deal with this. For our data set it makes sense to just drop rows with missing values.

Depending on the problem, it is also possible to replace the missing values with the average of the column.

In [None]:
print( df.isnull().sum() )

In [None]:
df = df.dropna() # dropping the missing values.
df.count()

In [None]:
print(df.isnull().sum()) # after dropping the values

Before we visualize the data, one problem remains - outliers. Outliers are in general points that don't really fit with the rest of the data. The meaning and importance of outliers depends of course on the kind of problem you are considering.

The outlier detection and removing that we are going to perform is called IQR score technique. Often outliers can be seen with visualizations using a box plot. Shown below are the box plot of MSRP, Cylinders, Horsepower and EngineSize. Some points are outside the box which we consider outliers. 

In [None]:
sns.boxplot(x=df['Price'])

In [None]:
sns.boxplot(x=df['HP'])

In [None]:
sns.boxplot(x=df['Cylinders'])

In [None]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

In [None]:
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape

This did not completely remove the outliers, but we dealt with most of them. Please read up on the IQR score technique if you are interested in the idea behind it.

# Visualization

Visualization is an important part of data analysis. It helps you to get a better understanding of your data and it surely is necessary if you want to present your results. 

**Histogram**

Histogram refers to the frequency of occurrence of variables in an interval. In this case, there are mainly ten different types of car manufacturing companies, but it is often important to know who has the most number of cars. To do this histogram is one of the trivial solutions which lets us know the total number of car manufactured by a different company.

In [None]:
df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
plt.title("Number of cars by make")
plt.ylabel('Number of cars')
plt.xlabel('Make')

**Heat Maps**

Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best way to find a relationship between the features can be done using heat maps. In the below heat map we get the idea that the price feature depends on the Engine Size, Horsepower, and Cylinders.


In [None]:
plt.figure(figsize=(10,5))
c = df.corr()
sns.heatmap(c,cmap="BrBG",annot=True)
c

**Scatterplot**

We generally use scatter plots to find the correlation between two variables. Here the scatter plots are plotted between Horsepower and Price and we can see the plot below. 


In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(df['HP'], df['Price'])
ax.set_xlabel('HP')
ax.set_ylabel('Price')
plt.show()

# Do it yourself!

It is time to try a data set for yourself! The code below loads data about the AirBnb usage in New York City.

Sheet adapted from content in https://github.com/Tanu-N-Prabhu/Python/.