The essentials of data preprocessing for tabular data
In this blog post we will explore the essential steps you need to take to preprocess your data and ensure that it is in the right form before you dump it in your "fancy" ML/DL model.
- Downloading dataset from Kaggle
- Exploring the data
- The essential preprocessing techniques
- Conclusion
If you ask data scientists which part of the ML pipeline takes the most amount of time, they will probably tell you that it's the data preprocessing stage. Ensuring your data is in the right form before you dump it into your ML/DL model is of paramount importance. If you feed in garbage to your model, you will get garbage as the output. In this blog post we will see some of the essential techniques that you can use to preprocess your data properly.
But first, we need to get some data.
This notebook is built using Google Colab. It is a notebook server provided by Google for free. You can also use other services to run the code below but you will have to figure out how to get the dataset. The dataset that we use here is present on Kaggle and you can directly download it from here.
In this notebook, we are going to download the dataset from Kaggle into Google Colab and store it in a directory in our Google Drive. Storing your data in the Drive saves you from the trouble of downloading the dataset every time you start a new session in Colab.
For further guidance read this wonderful article by Mrinali Gupta: How to fetch Kaggle Datasets into Google Colab
So let's get to it!
First, we need to mount our google drive so that we can access all the folders in the drive.
from google.colab import drive
drive.mount('/content/gdrive')
Then we will using the following code to provide a config path for the Kaggle Json API
import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/kaggle/StudentPerformance"
We will change our current directory to where we want the dataset to be downloaded
%cd /content/gdrive/My Drive/kaggle/StudentPerformance
Now we can download the dataset from kaggle
!kaggle datasets download -d spscientist/students-performance-in-exams
Let's unzip the files
!unzip \*.zip && rm *.zip
What files are present in the current directory?
!ls
You can see that there is a "StudentsPerformance.csv" file present in the directory. That is our dataset.
Before we apply any preprocessing steps to the data, we need to know what kind of data the dataset contains. Is it textual data? Is it numerical data? Are there any dates present? What about geographic locations?
There are a lot of questions we can ask about the data in out dataset. So before we move further, we need to get a sense of what hidden knowledge our dataset contains.
In the code cells below you will see some of the most common steps you can apply to gather information about your data.
import pandas as pd
df=pd.read_csv("StudentsPerformance.csv")
First 5 rows of the dataset
Seeing the first and last few rows can be really helpul in creating a mental picture about the data. It also allows you to map out the possible roadblocks you are going to face in acheiving your end goal.
df.head(5)
Last 5 rows of the dataset
df.tail(5)
Information about data type of columns and null values
Knowing the data type of each column is crucial in choosing the right preprocessing step for that column as well as understanding what the values in the column represent.
Another crucial piece of information is the number of non-null values. It helps you in deciding which columns need imputation.
df.info()
Checking if the dataset has null values
In the last cell, we saw how you can see if there are null values in your dataset. Below is another method to confirm if your data has any null/ missing values. As you can see below, this dataset does not have any null values.
for i in list(df.columns):
bool_series=pd.isnull(df[i])
print("Column:{} has {} null values.".format(i,df[bool_series].shape[0]))
Describing the numerical columns in the dataset
If you are applying regression or even classification, knowing the summary statistics might help you in deciding how you want to handle the numerical features. Maybe you have to apply some transformations before you apply regression. Maybe the numerical features can be dropped in case they do not contribute much.
df.describe()
How many unique values does each column have
You might want to know how many unique values each column has. This is helpful when you have a big dataset and you are thinking of generating more features from the existing features. This is also important when you are dealing with cases where the Curse of Dimensionality becomes relevant.
df.nunique()
In this section, we will cover the essential preprocessing techniques you can apply to your data before feeding it into your model. This is by no means an exhaustive list of techniques you can apply, but this covers the most common techinques applied in the ML industry.
The order in which we apply these techniques is very important since each preprocessing step transforms the data such that it may become incompatible for another preprocessing step.
We are going to apply our preprocessing techniques in the following order :
- Label Encoding (if needed)
- One-hot Encoding
- Imputation
- Normalization or Standardization
We saw in the previous section that our data cantains columns which have string values. Although, we can understand what these string values represent, a machine does not. So , to make these values machine-readable we have to find a way to represent them numerically. Label Encoding is one such method of accomplishing this.
In Label Encoding, we assign a unique integer value to each class in the data. This means that the gender column in our dataset will be encoded like so:
Original values | Male | Female | Label Encoded values | 0 | 1 | </tr>
Let's see this in action. We are going to label encode the following columns in our dataset:
- gender
- race/ethnicity
- parental level of education
- lunch
- test preparation course
from sklearn.preprocessing import LabelEncoder
data=df.copy() # creating a copy of our data since we do not want to change the original dataset
for i in list(data.columns)[:5]:
data[i]=LabelEncoder().fit_transform(data[i])
data.head(5)
Although Label Encoding can be useful in many cases, there is a caveat. An algorithm will not be able to differentiate between a numerical variable and a label encoded variable. Due to this limitation, the values in the label encoded columns might be misunderstood. So in our gender column, '1' might be given a higher priority than '0' even when no numerical relation exists between the two classes.
Due to this limitation, we need to find a better way of representing our categorical variables in a numerical form.
Due to the shortcomings of Label Encoding, we cannot apply it to transform categorical variables into numerical values. Instead, we use One-Hot Encoding. In One-Hot Encoding, we take the n categories in a column and create n or n-1 columns from them. The new columns contain only binary values (0 or 1). So, if our gender column is one-hot encoded, then we will have two new columns: Male and Female. The values in these columns will be 0 (Male column) and 1 (Female column) if a row earlier had 'female' as the gender and vice versa.
Let's see how to implement this.
df=pd.get_dummies(data=df,drop_first=True) #drop_first can also be False
df.head(5)
In the above code, the function has a parameter drop_first. What does this mean? In one-hot encoding, we usually drop one of the columns created for each categorical variable to avoid high correlation among the features. But, in many cases dropping columns can have a negative impact on the model performance. So it is always better to experiment and see what works in your case.
One-hot encoding is applied in almost all cases where we have to deal with categorical variables. But when the number of categories in a column gets too big, we cannot use this technique since the resulting dataset might be very large and difficult to handle. In this case, you might want to consider other alternatives like dropping the high-cardinality columns, using label encoding, using dimensionality reduction techniques.
Dealing with missing data is another important data preprocessing step. Missing values can have huge impacts on how your model performs. If a significant portion of values in a column are missing then you might consider dropping that column. But,dropping columns might actually result in leaving out essential information.So, we apply imputation.
In imputation, we replace missing data with substituted values. Here, we will dicuss some of the common ways in which imputation is done:
- Replace by 0: Sometimes replacing numerical values with 0 can work. Suppose you have an Age column in your dataset. Filling 0 in the places where age is missing might not affect the model's accuracy.
- Replace by mean: You can also take the mean of all the values in the column and use that to fill the missing places. This is the most common imputation approach. But it is very sensitive to outliers.
- Replace by most_frequent: You can replace the missing values with the most frequent value. This can work for both categorical and numerical data.
- Replace using custom function: If you know how a particular column was generated then you can use that process to fill the missing values. Usually, this approach is not applicable since the data is downloaded from elsewhere.
In our dataset, we do not have any missing values so we do not need to apply imputation.
For futher guidance: Imputation of missing values
Even after converting our data into a machine readable format, our work is not done. In any dataset, you will have parameters that our measured in different units. For example, you might have Time (measured in hours) and Distance (measured in miles).The values of these parameters will have different distributions and different min/max values. You cannot combine these different parameters using a ML model without taking into account their measurement units.So, we use standardization and normalization.
Both of these techniques transform the data in a way such that it either becomes dimensionless (in terms of measurement units) or the parameters end up having similar distributions.
The biggest difference between standardization and normalization is as follows:
- Standardization typically rescales the values to have a mean of 0 and a standard deviation of 1 (unit variance).
- Normalization typically rescales the values into a range of [0,1].
Because of the difference in the way they transform the values we get a different output in each case.
Let's scale our data.
from sklearn.preprocessing import StandardScaler
data=df.copy()
data=StandardScaler().fit_transform(data)
data[:3]
Now let's look at normalization.
Normalization typically rescales the values into a range of [0,1]. As you will see below, there is a notable difference between the output of standardization and normalization. Normalization will not transform the values of your categorical/one-hot encoded variables in any way. On the other hand, standardization transforms all the columns in the dataset.
So, let's normalize our data.
from sklearn.preprocessing import MinMaxScaler
data=df.copy()
data=MinMaxScaler().fit_transform(data)
data[:3]
Again, it is very important to experiment with both of these techniques to see which one works for you.
In this blog post we have seen the essentials of data preprocessing for tabular data. My goal with this blog post was to provide a condensed overview of the what data preprocessing looks like with useful code snippets that anyone can use in their projects. Applying proper preprocessing can be extremely helpful in improving the model performance and ensuring that your model is ready for the real world.
If hope you liked this blog post, please share it with other ML enthusiasts.
See you on the next adventure.