EDA (Exploratory Data Analysis) is the process of investigating and analyzing the data to summarize it's main characteristics and gain a deep understanding of what it represents. Usually, we use data visualization techniques to create plots or visuals which are easier to understand and also help us in spotting patterns within the data. In this blog post, we are going to explore how EDA can be automated so that we can reduce the manual effort and time needed to understand our data.

Currently, there are many libraries in python that allow us to automate our EDA process, but here we are going to use the following three:

  • Sweetviz
  • Pandas Profiling
  • AutoViz

We will be using the HR Analytics dataset from Kaggle to experiment with Auto-EDA and see how useful it really is.

First things first, let's install our python libraries and download our data.

Initial Setup

Installing the libraries

First, we are going to install our python libraries. To see the install instructions of the libraries in detail, you can check out their Github repositories:

We are going to use the 'pip' command as usual to install the libraries.

!pip install sweetviz autoviz

Collecting sweetviz
  Downloading https://files.pythonhosted.org/packages/6c/ab/5ef84ee47e5344e9a23d69e0b52c11aba9d8be658d4e85c3c52d4e7d9130/sweetviz-2.1.0-py3-none-any.whl (15.1MB)
     |████████████████████████████████| 15.1MB 308kB/s 
Collecting autoviz
  Downloading https://files.pythonhosted.org/packages/89/20/8c8c64d5221cfcbc54679f4f048a08292a16dbad178af7c78541aa3af730/autoviz-0.0.81-py3-none-any.whl
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (1.19.5)
Collecting tqdm>=4.43.0
  Downloading https://files.pythonhosted.org/packages/72/8a/34efae5cf9924328a8f34eeb2fdaae14c011462d9f0e3fcded48e1266d1c/tqdm-4.60.0-py2.py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 7.9MB/s 
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (1.4.1)
Requirement already satisfied: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (2.11.3)
Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (1.1.5)
Requirement already satisfied: matplotlib>=3.1.3 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (3.2.2)
Requirement already satisfied: importlib-resources>=1.2.0 in /usr/local/lib/python3.7/dist-packages (from sweetviz) (5.1.2)
Requirement already satisfied: ipython in /usr/local/lib/python3.7/dist-packages (from autoviz) (5.5.0)
Requirement already satisfied: xgboost in /usr/local/lib/python3.7/dist-packages (from autoviz) (0.90)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (from autoviz) (0.11.1)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from autoviz) (0.22.2.post1)
Requirement already satisfied: jupyter in /usr/local/lib/python3.7/dist-packages (from autoviz) (1.0.0)
Requirement already satisfied: statsmodels in /usr/local/lib/python3.7/dist-packages (from autoviz) (0.10.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.11.1->sweetviz) (2.0.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (2.8.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.1.3->sweetviz) (2.4.7)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.1.3->sweetviz) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.1.3->sweetviz) (0.10.0)
Requirement already satisfied: zipp>=0.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-resources>=1.2.0->sweetviz) (3.4.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (2.6.1)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (0.8.1)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (0.7.5)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (4.4.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (56.1.0)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (1.0.18)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (5.0.5)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.7/dist-packages (from ipython->autoviz) (4.8.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->autoviz) (1.0.1)
Requirement already satisfied: notebook in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (5.3.1)
Requirement already satisfied: ipywidgets in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (7.6.3)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (5.6.1)
Requirement already satisfied: ipykernel in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (4.10.1)
Requirement already satisfied: qtconsole in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (5.1.0)
Requirement already satisfied: jupyter-console in /usr/local/lib/python3.7/dist-packages (from jupyter->autoviz) (5.2.0)
Requirement already satisfied: patsy>=0.4.0 in /usr/local/lib/python3.7/dist-packages (from statsmodels->autoviz) (0.5.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas!=1.0.0,!=1.0.1,!=1.0.2,>=0.25.3->sweetviz) (1.15.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->autoviz) (0.2.5)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from traitlets>=4.2->ipython->autoviz) (0.2.0)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect; sys_platform != "win32"->ipython->autoviz) (0.7.0)
Requirement already satisfied: tornado>=4 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (5.1.1)
Requirement already satisfied: jupyter-core>=4.4.0 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (4.7.1)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (0.9.5)
Requirement already satisfied: jupyter-client>=5.2.0 in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (5.3.5)
Requirement already satisfied: nbformat in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (5.1.3)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook->jupyter->autoviz) (1.5.0)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->autoviz) (3.5.1)
Requirement already satisfied: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from ipywidgets->jupyter->autoviz) (1.0.0)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (3.3.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (0.8.4)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (1.4.3)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (0.3)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (0.7.1)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->jupyter->autoviz) (0.4.4)
Requirement already satisfied: qtpy in /usr/local/lib/python3.7/dist-packages (from qtconsole->jupyter->autoviz) (1.9.0)
Requirement already satisfied: pyzmq>=17.1 in /usr/local/lib/python3.7/dist-packages (from qtconsole->jupyter->autoviz) (22.0.3)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat->notebook->jupyter->autoviz) (2.6.0)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->jupyter->autoviz) (20.9)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->jupyter->autoviz) (0.5.1)
Installing collected packages: tqdm, sweetviz, autoviz
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
Successfully installed autoviz-0.0.81 sweetviz-2.1.0 tqdm-4.60.0

Pandas Profiling provides some specific instructions for installation so let's follow those.

import sys
!{sys.executable} -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Downloading https://files.pythonhosted.org/packages/3b/a3/34519d16e5ebe69bad30c5526deea2c3912634ced7f9b5e6e0bb9dbbd567/pandas_profiling-3.0.0-py2.py3-none-any.whl (248kB)
     |████████████████████████████████| 256kB 6.8MB/s 
Collecting PyYAML>=5.0.0
  Downloading https://files.pythonhosted.org/packages/7a/a5/393c087efdc78091afa2af9f1378762f9821c9c1d7a22c5753fb5ac5f97a/PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636kB)
     |████████████████████████████████| 645kB 9.4MB/s 
Collecting requests>=2.24.0
  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
     |████████████████████████████████| 61kB 6.1MB/s 
Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (0.11.1)
Requirement already satisfied, skipping upgrade: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.1.5)
Collecting htmlmin>=0.1.12
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (2.11.3)
Requirement already satisfied, skipping upgrade: numpy>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.19.5)
Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.4.1)
Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (0.4.2)
Requirement already satisfied, skipping upgrade: tqdm>=4.48.2 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (4.60.0)
Collecting phik>=0.11.1
  Downloading https://files.pythonhosted.org/packages/b7/ce/193e8ddf62d4be643b9b4b20e8e9c63b2f6a20f92778c0410c629f89bdaa/phik-0.11.2.tar.gz (1.1MB)
     |████████████████████████████████| 1.1MB 16.3MB/s 
Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (1.0.1)
Collecting tangled-up-in-unicode==0.1.0
  Downloading https://files.pythonhosted.org/packages/93/3e/cb354fb2097fcf2fd5b5a342b10ae2a6e9363ba435b64e3e00c414064bc7/tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1MB)
     |████████████████████████████████| 3.1MB 29.5MB/s 
Collecting pydantic>=1.8.1
  Downloading https://files.pythonhosted.org/packages/9f/f2/2d5425efe57f6c4e06cbe5e587c1fd16929dcf0eb90bd4d3d1e1c97d1151/pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1MB)
     |████████████████████████████████| 10.1MB 32.7MB/s 
Collecting visions[type_image_path]==0.7.1
  Downloading https://files.pythonhosted.org/packages/80/96/01e4ba22cef96ae5035dbcf0451c2f4f859f8f17393b98406b23f0034279/visions-0.7.1-py3-none-any.whl (102kB)
     |████████████████████████████████| 112kB 40.0MB/s 
Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (3.2.2)
Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1; extra == "notebook" in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (7.6.3)
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in /usr/local/lib/python3.7/dist-packages (from pandas-profiling[notebook]) (4.7.1)
Collecting jupyter-client>=6.0.0; extra == "notebook"
  Downloading https://files.pythonhosted.org/packages/77/e8/c3cf72a32a697256608d5fa96360c431adec6e1c6709ba7f13f99ff5ee04/jupyter_client-6.1.12-py3-none-any.whl (112kB)
     |████████████████████████████████| 122kB 38.3MB/s 
Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2020.12.5)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (1.24.3)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2018.9)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (2.0.0)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from pydantic>=1.8.1->pandas-profiling[notebook]) (3.7.4.3)
Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (2.5.1)
Requirement already satisfied, skipping upgrade: bottleneck in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (1.3.2)
Collecting multimethod==1.4
  Downloading https://files.pythonhosted.org/packages/7a/d0/ce5ad0392aa12645b7ad91a5983d6b625b704b021d9cd48c587630c1a9ac/multimethod-1.4-py2.py3-none-any.whl
Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (21.2.0)
Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in /usr/local/lib/python3.7/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (7.1.2)
Collecting imagehash; extra == "type_image_path"
  Downloading https://files.pythonhosted.org/packages/8e/18/9dbb772b5ef73a3069c66bb5bf29b9fb4dd57af0d5790c781c3f559bcca6/ImageHash-4.2.0-py2.py3-none-any.whl (295kB)
     |████████████████████████████████| 296kB 34.7MB/s 
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.3.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (1.0.0)
Requirement already satisfied, skipping upgrade: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.1.3)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.0.5)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied, skipping upgrade: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.5.0)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.10.1)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (22.0.3)
Requirement already satisfied, skipping upgrade: tornado>=4.1 in /usr/local/lib/python3.7/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (5.1.1)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied, skipping upgrade: decorator<5,>=4.3 in /usr/local/lib/python3.7/dist-packages (from networkx>=2.4->visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.7/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: ipython-genutils in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (2.6.0)
Requirement already satisfied, skipping upgrade: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied, skipping upgrade: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied, skipping upgrade: pygments in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.8.1)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (4.8.0)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (1.0.18)
Requirement already satisfied, skipping upgrade: setuptools>=18.5 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (56.1.0)
Requirement already satisfied, skipping upgrade: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied, skipping upgrade: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied, skipping upgrade: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.9.5)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /usr/local/lib/python3.7/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied, skipping upgrade: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (3.3.0)
Requirement already satisfied, skipping upgrade: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.7.1)
Requirement already satisfied, skipping upgrade: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (1.4.3)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.3)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (20.9)
Requirement already satisfied, skipping upgrade: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1; extra == "notebook"->pandas-profiling[notebook]) (0.5.1)
Building wheels for collected packages: htmlmin, phik
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-cp37-none-any.whl size=27085 sha256=2889c06fcf32178bcfcd894aa597f8a6f0651e0e53d1d8a3051fb1deba3f03b4
  Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
  Building wheel for phik (setup.py) ... done
  Created wheel for phik: filename=phik-0.11.2-cp37-none-any.whl size=1107413 sha256=a2c7b4e43b5c0b9fe7c51601923c17c3f39f88ec55bb08e7490cf4a73bd8fdd5
  Stored in directory: /root/.cache/pip/wheels/c0/a3/b0/f27b1cfe32ea131a3715169132ff6d85653789e80e966c3bf6
Successfully built htmlmin phik
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: phik 0.11.2 has requirement scipy>=1.5.2, but you'll have scipy 1.4.1 which is incompatible.
Installing collected packages: PyYAML, requests, htmlmin, phik, tangled-up-in-unicode, pydantic, multimethod, imagehash, visions, jupyter-client, pandas-profiling
  Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
  Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Found existing installation: jupyter-client 5.3.5
    Uninstalling jupyter-client-5.3.5:
      Successfully uninstalled jupyter-client-5.3.5
  Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
Successfully installed PyYAML-5.4.1 htmlmin-0.1.12 imagehash-4.2.0 jupyter-client-6.1.12 multimethod-1.4 pandas-profiling-3.0.0 phik-0.11.2 pydantic-1.8.2 requests-2.25.1 tangled-up-in-unicode-0.1.0 visions-0.7.1
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK

Downloading dataset from Kaggle

This blog post is built using Google Colab. It is a notebook server provided by Google for free. You can also use other services to run the code below but you will have to figure out how to get the dataset. The dataset that we use here is present on Kaggle and you can directly download it from here.

In this notebook, we are going to download the dataset from Kaggle into Google Colab and store it in a directory in our Google Drive. Storing your data in the Drive saves you from the trouble of downloading the dataset every time you start a new session in Colab.

For further guidance read this wonderful article by Mrinali Gupta: How to fetch Kaggle Datasets into Google Colab

So let's get to it!

First, we need to mount our google drive so that we can access all the folders in the drive.

from google.colab import drive
drive.mount('/content/gdrive')
Mounted at /content/gdrive

Then we will using the following code to provide a config path for the Kaggle Json API

import os
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/kaggle/Insurance"

We will change our current directory to where we want the dataset to be downloaded

%cd /content/gdrive/My Drive/kaggle/Insurance
/content/gdrive/My Drive/kaggle/Insurance

Now we can download the dataset from kaggle

!kaggle datasets download -d giripujar/hr-analytics
Downloading hr-analytics.zip to /content/gdrive/My Drive/kaggle/Insurance
  0% 0.00/111k [00:00<?, ?B/s]
100% 111k/111k [00:00<00:00, 15.9MB/s]

Let's unzip the files

!unzip \*.zip  && rm *.zip
Archive:  hr-analytics.zip
  inflating: HR_comma_sep.csv        

What files are present in the current directory?

!ls
AutoViz_Plots  HR_comma_sep.csv  kaggle.json

Our directory has a file named 'HR_comma_sep.csv'. That is our dataset.

Auto-EDA

What does our data look like?

import pandas as pd
df=pd.read_csv('HR_comma_sep.csv')
df.head(5)
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low

From above, we can see that we have a tabular dataset with 10 columns and a target feature ('left'). We have 4 categorical features ('Department','salary','Work_accident','promotion_last_5years') apart from the target and 5 numerical features.

The goal of EDA is to gain insights which are not apparent just by looking at the data. To do this, we use different kinds of visualizations and statistical techniques which transform our data into information that we can interpret and analyze. With these insights, we can decide what to do with our data, how to process it, how to model it and how to present it in an understanble form.

In this practical example, we are going to use our Auto-EDA tools and note down the insights gained about the data using each one. This well help us in examining their utility and limitations.

SweetViz

SweetViz allows us to create a report which contains a detailed analysis of the data in each column along with information about the associations between the features. Let's use SweetViz to create a report and see what insights we can gather from it.

import sweetviz as sv
my_report = sv.analyze(df,target_feat='left')
my_report.show_notebook(w='100%')

Insights:

  • There are no missing values in any of the columns. This is an important insight, since it tells us that we do not need to perform any imputation during our preprocessing steps. It is also important for deciding which columns we want to keep while creating a ML model since missing data can have a signficant impact on the model's performance.

  • For each numerical column, we can see various summary statistics like range, standard deviation, median, average and skewness, etc. This allows us to get a sense the data in these columns. This may be very helpful in cases where we are performing regression where some columns need to be transformed for effective modelling.

  • For each column, we can also see the distribution of data represented using histograms with additional information about the target feature embedded in it. Visualizations like these help us in making a tangible narrative and asking questions that might reveal something important. For example, the plot for the average_monthly_hours column shows us that people working the longest hours have the highest percentage of leaving. This might raise questions about the work environent and also the reward system within the company.

  • Finally, we can also see the associations/correlations between the features in our dataset. This will help us in removing the redundant features from our dataset when we are creating our ML model. This also helps us in creating a better picture of what is really happening. For instance, number_project is highly correlated with satisfication_level. So, people who are getting more projects to work on are suprisingly less satisfied with their jobs.

Pandas profiling

Pandas profiling is an extension of the describe function in pandas. This library allows us to conduct some quick data analysis on a pandas dataframe. It is very useful for an initial investigation into the mysteries of our dataset.

Let's see what we can do with it.

from pandas_profiling import ProfileReport
ireport = ProfileReport(df,explorative=True)
ireport.to_notebook_iframe()

Insights:

  • Pandas profiling gives a lot of information about each column in our dataset. It also tells us how many zero, negative and infinite values are present in each numerical column. This allows us to think about the preprocessing steps with much more clarity. For instance, negative values in a column can be sign of data corruption. It could also be the case that negative values have a specific meaning in our dataset.

  • On toggling the details of a column, we can see more information like the common and extreme values in a column. Extreme values can be particularly problematic since they can be outliers which make modelling difficult. We might want to remove these outliers so that our ML model can learn and perform better.

  • We can also see the duplicate rows in our data. Duplicate rows can mean a lot of things. Depending on the scenario and how they affect the quality of our analysis we might decide to remove them or keep them. For instance in our example, the duplicate rows represent the employees who have left the company. Since our dataset is imbalanced, these duplicate rows might be an attempt of oversampling in order to improve the balance and create an effective model.

We have seen some of the insights we can gather using Pandas Profiling. Now, let's look at a more visual approach to EDA. SweetViz and Pandas profiling give us a lot of numbers to think about, but Autoviz gives us only visualizations to understand our data.

AutoViz

In his book “Good Charts,” Scott Berinato exclaims: “ A good visualization can communicate the nature and potential impact of information and ideas more powerfully than any other form of communication.”

AutoViz gives us a lot of different visualizations that can be useful in understanding and communicating the story behind our data. This can be tremendously useful when your audience comprises of people from a non-technical background who don't understand numbers but like pictures.

Let's use some Autoviz.

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
sep = ','
dft = AV.AutoViz(filename="",
                 sep=sep,
                 depVar='left',
                 dfte=df, header=0, verbose=1, 
                 lowess=True, chart_format='jpg',
                 max_rows_analyzed=150000, max_cols_analyzed=30)
Shape of your Data Set: (14999, 10)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    Number of Numeric Columns =  2
    Number of Integer-Categorical Columns =  3
    Number of String-Categorical Columns =  2
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  0
    Number of Numeric-Boolean Columns =  2
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  0
    Number of Columns to Delete =  0
    9 Predictors classified...
        This does not include the Target column(s)
        No variables removed since no ID or low-information variables found in data set

################ Binary_Classification VISUALIZATION Started #####################
Data Set Shape: 14999 rows, 10 cols
Data Set columns info:
* satisfaction_level: 0 nulls, 92 unique vals, most common: {0.1: 358, 0.11: 335}
* last_evaluation: 0 nulls, 65 unique vals, most common: {0.55: 358, 0.5: 353}
* number_project: 0 nulls, 6 unique vals, most common: {4: 4365, 3: 4055}
* average_montly_hours: 0 nulls, 215 unique vals, most common: {156: 153, 135: 153}
* time_spend_company: 0 nulls, 8 unique vals, most common: {3: 6443, 2: 3244}
* Work_accident: 0 nulls, 2 unique vals, most common: {0: 12830, 1: 2169}
* promotion_last_5years: 0 nulls, 2 unique vals, most common: {0: 14680, 1: 319}
* Department: 0 nulls, 10 unique vals, most common: {'sales': 4140, 'technical': 2720}
* salary: 0 nulls, 3 unique vals, most common: {'low': 7316, 'medium': 6446}
* left: 0 nulls, 2 unique vals, most common: {0: 11428, 1: 3571}
--------------------------------------------------------------------
   Columns to delete:
'   []'
   Boolean variables %s 
"   ['Work_accident', 'promotion_last_5years']"
   Categorical variables %s 
("   ['Department', 'salary', 'number_project', 'average_montly_hours', "
 "'time_spend_company', 'Work_accident', 'promotion_last_5years']")
   Continuous variables %s 
"   ['satisfaction_level', 'last_evaluation']"
   Discrete string variables %s 
'   []'
   Date and time variables %s 
'   []'
   ID variables %s 
'   []'
   Target variable %s 
'   left'
Total Number of Scatter Plots = 3
Time to run AutoViz (in seconds) = 22.712

 ###################### VISUALIZATION Completed ########################

Insights:

  • Scatter Plots are an important part of EDA. They show us how our data is spread in the 2D/3D plane where the dimensions represent the features from our dataset. From above, we can see how satisfaction level for employees who have left the company differ from the employees who are still working in the company.

  • Visualizing the distribution of each feature based on the value of the target variable can also help in spotting important differences. For example, the normed histogram for average_monthly_hours for the people who left the company is very different from the people still working in the company.

  • Autoviz also gives us other useful plots like bar charts, correlation heatmap and box plots to understand the patterns hidden in our data. In our example, the bar chart of average satisfaction_level by number_project shows us that people who have worked on the most number of projects (7) have the lowest satisfaction level. On ther other hand, they have the highest average evaluation scores.

Now that we have explored all the three Auto-EDA tools, let's talk about their utility and limitations.

Conclusion

EDA lets us understand our data and develop a deeper understanding about the real world. In the previous sections, we have looked at different Auto-EDA tools and noted some insights that can they provide us. Although, these tools are tremendously useful, they have their limitations.

Firstly, all the EDA tools we have seen above work very well on tabular datasets, but they aren't useful for datasets from other domains like computer vision, NLP, etc.

Secondly, they don't always give us the best information and plots. Autoviz uses some internal heuristics to decide which plots are the most useful, but it is still not clear if these heuristics always work. It is still very hard to answer the question: What information is the most useful for a data scientist?

Finally, it's very difficult to configure these tools in a way that is best suited for your data. They are not very flexible. Although, we can customize them to some degree, our problems will always have specific requirements.

In conclusion, Auto-EDA is very useful to launch an initial investigation into the data. It gives us the basic information necessary to delve deeper into the data and ask more interesting questions. More importantly, it gives us a way of conducting quick data analysis with much less manual effort.

I hope to see more such tools that will help data scientists tell better stories through their data.

See you on the next adventure!!!