If you ask data scientists which part of the ML pipeline takes the most amount of time, they will probably tell you that it's the data preprocessing stage. Ensuring your data is in the right form before you dump it into your ML/DL model is of paramount importance. If you feed in garbage to your model, you will get garbage as the output. In this blog post we will see some of the essential techniques that you can use to preprocess your data properly.

But first, we need to get some data.

Downloading dataset from Kaggle

This notebook is built using Google Colab. It is a notebook server provided by Google for free. You can also use other services to run the code below but you will have to figure out how to get the dataset. The dataset that we use here is present on Kaggle and you can directly download it from here.

In this notebook, we are going to download the dataset from Kaggle into Google Colab and store it in a directory in our Google Drive. Storing your data in the Drive saves you from the trouble of downloading the dataset every time you start a new session in Colab.

For further guidance read this wonderful article by Mrinali Gupta: How to fetch Kaggle Datasets into Google Colab

So let's get to it!

First, we need to mount our google drive so that we can access all the folders in the drive.

from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive

Then we will using the following code to provide a config path for the Kaggle Json API

import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/kaggle/StudentPerformance"

We will change our current directory to where we want the dataset to be downloaded

%cd /content/gdrive/My Drive/kaggle/StudentPerformance
/content/gdrive/My Drive/kaggle/StudentPerformance

Now we can download the dataset from kaggle

!kaggle datasets download -d spscientist/students-performance-in-exams
Downloading students-performance-in-exams.zip to /content/gdrive/My Drive/kaggle/StudentPerformance
  0% 0.00/8.70k [00:00<?, ?B/s]
100% 8.70k/8.70k [00:00<00:00, 1.14MB/s]

Let's unzip the files

!unzip \*.zip  && rm *.zip
Archive:  students-performance-in-exams.zip
  inflating: StudentsPerformance.csv  

What files are present in the current directory?

!ls
kaggle.json  StudentsPerformance.csv

You can see that there is a "StudentsPerformance.csv" file present in the directory. That is our dataset.

Exploring the data

Before we apply any preprocessing steps to the data, we need to know what kind of data the dataset contains. Is it textual data? Is it numerical data? Are there any dates present? What about geographic locations?

There are a lot of questions we can ask about the data in out dataset. So before we move further, we need to get a sense of what hidden knowledge our dataset contains.

In the code cells below you will see some of the most common steps you can apply to gather information about your data.

import pandas as pd
df=pd.read_csv("StudentsPerformance.csv")

First 5 rows of the dataset
Seeing the first and last few rows can be really helpul in creating a mental picture about the data. It also allows you to map out the possible roadblocks you are going to face in acheiving your end goal.

df.head(5)
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 female group B bachelor's degree standard none 72 72 74
1 female group C some college standard completed 69 90 88
2 female group B master's degree standard none 90 95 93
3 male group A associate's degree free/reduced none 47 57 44
4 male group C some college standard none 76 78 75

Last 5 rows of the dataset

df.tail(5)
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
995 female group E master's degree standard completed 88 99 95
996 male group C high school free/reduced none 62 55 55
997 female group C high school free/reduced completed 59 71 65
998 female group D some college standard completed 68 78 77
999 female group D some college free/reduced none 77 86 86

Information about data type of columns and null values
Knowing the data type of each column is crucial in choosing the right preprocessing step for that column as well as understanding what the values in the column represent.

Another crucial piece of information is the number of non-null values. It helps you in deciding which columns need imputation.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

Checking if the dataset has null values
In the last cell, we saw how you can see if there are null values in your dataset. Below is another method to confirm if your data has any null/ missing values. As you can see below, this dataset does not have any null values.

for i in list(df.columns):
  bool_series=pd.isnull(df[i])
  print("Column:{} has {} null values.".format(i,df[bool_series].shape[0]))
Column:gender has 0 null values.
Column:race/ethnicity has 0 null values.
Column:parental level of education has 0 null values.
Column:lunch has 0 null values.
Column:test preparation course has 0 null values.
Column:math score has 0 null values.
Column:reading score has 0 null values.
Column:writing score has 0 null values.

Describing the numerical columns in the dataset
If you are applying regression or even classification, knowing the summary statistics might help you in deciding how you want to handle the numerical features. Maybe you have to apply some transformations before you apply regression. Maybe the numerical features can be dropped in case they do not contribute much.

df.describe()
math score reading score writing score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

How many unique values does each column have
You might want to know how many unique values each column has. This is helpful when you have a big dataset and you are thinking of generating more features from the existing features. This is also important when you are dealing with cases where the Curse of Dimensionality becomes relevant.

df.nunique()
gender                          2
race/ethnicity                  5
parental level of education     6
lunch                           2
test preparation course         2
math score                     81
reading score                  72
writing score                  77
dtype: int64

The essential preprocessing techniques

In this section, we will cover the essential preprocessing techniques you can apply to your data before feeding it into your model. This is by no means an exhaustive list of techniques you can apply, but this covers the most common techinques applied in the ML industry.

The order in which we apply these techniques is very important since each preprocessing step transforms the data such that it may become incompatible for another preprocessing step.

We are going to apply our preprocessing techniques in the following order :

  1. Label Encoding (if needed)
  2. One-hot Encoding
  3. Imputation
  4. Normalization or Standardization

Label Encoding

We saw in the previous section that our data cantains columns which have string values. Although, we can understand what these string values represent, a machine does not. So , to make these values machine-readable we have to find a way to represent them numerically. Label Encoding is one such method of accomplishing this.

In Label Encoding, we assign a unique integer value to each class in the data. This means that the gender column in our dataset will be encoded like so:

</tr>
Original values Male Female
Label Encoded values 0 1

Let's see this in action. We are going to label encode the following columns in our dataset:

  • gender
  • race/ethnicity
  • parental level of education
  • lunch
  • test preparation course
from sklearn.preprocessing import LabelEncoder
data=df.copy() # creating a copy of our data since we do not want to change the original dataset
for i in list(data.columns)[:5]:
  data[i]=LabelEncoder().fit_transform(data[i])
data.head(5)
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score
0 0 1 1 1 1 72 72 74
1 0 2 4 1 0 69 90 88
2 0 1 3 1 1 90 95 93
3 1 0 0 0 1 47 57 44
4 1 2 4 1 1 76 78 75

Although Label Encoding can be useful in many cases, there is a caveat. An algorithm will not be able to differentiate between a numerical variable and a label encoded variable. Due to this limitation, the values in the label encoded columns might be misunderstood. So in our gender column, '1' might be given a higher priority than '0' even when no numerical relation exists between the two classes.

Due to this limitation, we need to find a better way of representing our categorical variables in a numerical form.

One-hot encoding

Due to the shortcomings of Label Encoding, we cannot apply it to transform categorical variables into numerical values. Instead, we use One-Hot Encoding. In One-Hot Encoding, we take the n categories in a column and create n or n-1 columns from them. The new columns contain only binary values (0 or 1). So, if our gender column is one-hot encoded, then we will have two new columns: Male and Female. The values in these columns will be 0 (Male column) and 1 (Female column) if a row earlier had 'female' as the gender and vice versa.

Let's see how to implement this.

df=pd.get_dummies(data=df,drop_first=True) #drop_first can also be False
df.head(5)
math score reading score writing score gender_male race/ethnicity_group B race/ethnicity_group C race/ethnicity_group D race/ethnicity_group E parental level of education_bachelor's degree parental level of education_high school parental level of education_master's degree parental level of education_some college parental level of education_some high school lunch_standard test preparation course_none
0 72 72 74 0 1 0 0 0 1 0 0 0 0 1 1
1 69 90 88 0 0 1 0 0 0 0 0 1 0 1 0
2 90 95 93 0 1 0 0 0 0 0 1 0 0 1 1
3 47 57 44 1 0 0 0 0 0 0 0 0 0 0 1
4 76 78 75 1 0 1 0 0 0 0 0 1 0 1 1

In the above code, the function has a parameter drop_first. What does this mean? In one-hot encoding, we usually drop one of the columns created for each categorical variable to avoid high correlation among the features. But, in many cases dropping columns can have a negative impact on the model performance. So it is always better to experiment and see what works in your case.

One-hot encoding is applied in almost all cases where we have to deal with categorical variables. But when the number of categories in a column gets too big, we cannot use this technique since the resulting dataset might be very large and difficult to handle. In this case, you might want to consider other alternatives like dropping the high-cardinality columns, using label encoding, using dimensionality reduction techniques.

Imputation

Dealing with missing data is another important data preprocessing step. Missing values can have huge impacts on how your model performs. If a significant portion of values in a column are missing then you might consider dropping that column. But,dropping columns might actually result in leaving out essential information.So, we apply imputation.

In imputation, we replace missing data with substituted values. Here, we will dicuss some of the common ways in which imputation is done:

  • Replace by 0: Sometimes replacing numerical values with 0 can work. Suppose you have an Age column in your dataset. Filling 0 in the places where age is missing might not affect the model's accuracy.
  • Replace by mean: You can also take the mean of all the values in the column and use that to fill the missing places. This is the most common imputation approach. But it is very sensitive to outliers.
  • Replace by most_frequent: You can replace the missing values with the most frequent value. This can work for both categorical and numerical data.
  • Replace using custom function: If you know how a particular column was generated then you can use that process to fill the missing values. Usually, this approach is not applicable since the data is downloaded from elsewhere.

In our dataset, we do not have any missing values so we do not need to apply imputation.

For futher guidance: Imputation of missing values

Standardization

Even after converting our data into a machine readable format, our work is not done. In any dataset, you will have parameters that our measured in different units. For example, you might have Time (measured in hours) and Distance (measured in miles).The values of these parameters will have different distributions and different min/max values. You cannot combine these different parameters using a ML model without taking into account their measurement units.So, we use standardization and normalization.

Both of these techniques transform the data in a way such that it either becomes dimensionless (in terms of measurement units) or the parameters end up having similar distributions.

The biggest difference between standardization and normalization is as follows:

  • Standardization typically rescales the values to have a mean of 0 and a standard deviation of 1 (unit variance).
  • Normalization typically rescales the values into a range of [0,1].

Because of the difference in the way they transform the values we get a different output in each case.

Let's scale our data.

from sklearn.preprocessing import StandardScaler
data=df.copy()
data=StandardScaler().fit_transform(data)
data[:3]
array([[ 0.39002351,  0.19399858,  0.39149181, -0.96462528,  2.0647416 ,
        -0.68441857, -0.59583014, -0.40347329,  2.73396713, -0.49374193,
        -0.2503982 , -0.54036068, -0.4669334 ,  0.74188112,  0.74674788],
       [ 0.19207553,  1.42747598,  1.31326868, -0.96462528, -0.4843221 ,
         1.46109419, -0.59583014, -0.40347329, -0.36576885, -0.49374193,
        -0.2503982 ,  1.85061578, -0.4669334 ,  0.74188112, -1.33914006],
       [ 1.57771141,  1.77010859,  1.64247471, -0.96462528,  2.0647416 ,
        -0.68441857, -0.59583014, -0.40347329, -0.36576885, -0.49374193,
         3.99363901, -0.54036068, -0.4669334 ,  0.74188112,  0.74674788]])

Now let's look at normalization.

Normalization

Normalization typically rescales the values into a range of [0,1]. As you will see below, there is a notable difference between the output of standardization and normalization. Normalization will not transform the values of your categorical/one-hot encoded variables in any way. On the other hand, standardization transforms all the columns in the dataset.

So, let's normalize our data.

from sklearn.preprocessing import MinMaxScaler
data=df.copy()
data=MinMaxScaler().fit_transform(data)
data[:3]
array([[0.72      , 0.6626506 , 0.71111111, 0.        , 1.        ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 1.        , 1.        ],
       [0.69      , 0.87951807, 0.86666667, 0.        , 0.        ,
        1.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 1.        , 0.        , 1.        , 0.        ],
       [0.9       , 0.93975904, 0.92222222, 0.        , 1.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        1.        , 0.        , 0.        , 1.        , 1.        ]])

Again, it is very important to experiment with both of these techniques to see which one works for you.

Conclusion

In this blog post we have seen the essentials of data preprocessing for tabular data. My goal with this blog post was to provide a condensed overview of the what data preprocessing looks like with useful code snippets that anyone can use in their projects. Applying proper preprocessing can be extremely helpful in improving the model performance and ensuring that your model is ready for the real world.

If hope you liked this blog post, please share it with other ML enthusiasts.

See you on the next adventure.