Python for Data Analysis: A Practical Guide for Beginners

tnplpramanik · Post by **tnplpramanik** » Thu Dec 05, 2024 4:54 am

A data analyst uses programming tools to extract large amounts of complex data and find meaningful insights from it. In short, an analyst is someone who extracts meaning from messy data. An analyst who uses Python for data analysis needs to have skills in the following areas in order to be valued:

Area domain
In order to extract data and obtain relevant data for their workplace, an analyst needs to have knowledge of the surroundings.

Programming Skills
As a data analyst, you will need to know the right libraries to use in order to clean the data, filter it, and get results from it.

Statistic
An analyst may need to use some statistical tools to assist in extracting the data.

Data Visualization Skills with Python
A data analyst needs to have strong data visualization skills in order to summarize and present data to others.

Result
Finally, an analyst needs to communicate their findings to a stakeholder or client. poland number data This means they will need to report on the history of the data, and be able to narrate it.

In this article, I provide a complete process for using Python for data analysis.

If you follow this tutorial and code everything like I did, you will then be able to use these codes and tools for future data analysis projects.

What we will see in this article:
The Power of Python Professional for Data Analysis

Prerequisites

The analysis

Reading data

Pandas Specifications

Data visualization

Data return

We’ll start by downloading and cleaning the dataset, then move on to analyzing and visualizing it. Finally, we’ll tell a story around our discoveries from this data.

I will be using a dataset from Kaggle called Pima Indian Diabetes Database , which you can download to perform the analysis.

Prerequisites
For this entire analysis, I will be using Jupyter Notebook . You can use any Python IDE you like.

You will need to install libraries along the way, and I will provide links that will walk you through the installation process.

The Analysis

Once you have downloaded the dataset, you will need to read the .csv file as a DataFrame in Python. You can do this using the Pandas library.

If you don't have it installed, you can do so with a simple “pip install pandas” in your terminal. If you face any difficulties with the installation or simply want to learn more about the Pandas library, check out our article about the library here and also the Pandas library documentation here .

Also check out other examples using the Pandas Library

Read Data
To read the data frame in Python, you will need to import Pandas first. Then, you can read the file and create a DataFrame with the following lines of code:

import pandas as pd
df = pd.read_csv('diabetes.csv')
To check the data frame header, run:

df.head()

From the image above, you can see 9 different columns with their respective variables related to patient health.

As an analyst, you will need to have a basic understanding of these variables:

Pregnancies: The number of pregnancies the patient has had
Glucose: The patient's glucose level
Blood pressure
Skin thickness: Patient's skin thickness
Insulin: Patient's insulin level
BMI: Patient's Body Mass Index
History of Diabetes: History of diabetes mellitus in relatives
Age
Result: Whether or not the patient has diabetesNumerical variables
These are variables that have a measurement, and have some kind of numerical meaning. All variables in this data set, except “result”, are numerical.

Categorical variables
They are also called nominal variables, and have two or more categories into which they can be classified.

The “outcome” variable is categorical – where “0” represents the absence of diabetes, and “1” represents the presence of diabetes.

A Brief Note

Before continuing with the analysis, I would like to make a quick note:

Analysts are human, and we often come with preconceived notions about what we expect to see in the data.

For example, you would expect an older person to be more likely to have diabetes. You would like to see this correlation in the data, but it may not always be the case.

Keep an open mind during the analysis process and don't let your preconceptions affect your decision making.

Pandas Specifications

This is a very useful tool that can be used by analysts. It generates an analysis report on the DataFrame, and helps to better understand the correlation between the variables.

To generate a Pandas specifications report, run the following lines of code:

import pandas_profiling as pp
pp.ProfileReport(df)
This report will give you some general statistical information about the dataset, which looks like this:

Just look at the dataset to see that there are no missing or duplicate cells in our DataFrame.

The information provided above usually requires us to run a few lines of code to find what we want, but is generated much more easily with Pandas.

Pandas also provides more information about each variable. Here’s an example:

This is the information generated for the variable called “Pregnancies”.

As an analyst, this report saves a lot of time as we don't have to go through each individual variable and execute many lines of code.

From here we can see this:

The variable “Pregnancies” has 17 distinct values.
The smallest number in the pregnancies column is “0”, and the largest is “8”.
The number of “zero” values in this column is quite low (only 14.5%). This means that over 80% of the patients in the dataset are pregnant.

In the report, there is information like this provided for each variable. This helps us a lot in our understanding of the dataset and all the columns contained in it.

The image above is a correlation matrix. It helps us better understand the correlation between the variables in the dataset.

There is a slight negative correlation between the “Age” and “Skin Thickness” variables, which can be further examined in the visualization section of the analysis.

Since there are no missing rows or duplicates in the dataset as seen above, we do not need to do any additional data cleaning or adjustments.