Hi everyone!!
Recently, Microsoft Student Partner Community launched an exclusive contest for Microsoft Students Partners of India. As per that contest, we are supposed to write a blog on topic Machine Learning or Artificial Intelligence. So as a part of the “MSP Developer Stories” initiative, here is my small contribution to the contest.
I am a newbie at Data Science and Machine Learning thing. I tried to learn Machine Learning an year ago or so, but the things didn’t work out great for me then. This is because I couldn’t find any good resources to learn. The code available on the internet is too complex to be digested by beginners.
I had to participate in this contest, so I decided to work again. I learned the basics of Machine Learning and Data Science from Learning Paths of Microsoft Learn and the results are unexpectedly great this time. I strongly recommend Microsoft Learning Paths for beginners from now on as I had an amazing experience.
In this blog, I will be sharing my first ever project on Machine Learning build from scratch.
Introduction to Machine Learning:
Some people consider Machine Learning as a subset of Artificial Intelligence. The Internet is flooded with such kinds of definitions. But I feel that Machine Learning is a bridge between Data Analysis and Artificial Intelligence. It is a tool used to perform Data Analytics for building models to serve the progressions of Artificial Intelligence.
So, let's break our project into 3 modules:
1. Analysis of the data
2. Building a Machine Learning model
3. Implementing Artificial Intelligence for predictions
Corona Virus is the hot topic of this time. So I have decided to choose COVID-19 as the topic of my project. So, let’s begin…
Application of Data Science on the Analysis of COVID-19 patients of Italy:
Data is like the food provided to the AI to help it grow. The more the data provided, the better the AI model is expected to be. Can we improve the quality of the food? Data Science is used to clean the data to remove its anomalies and imperfections, thus improving the quality of data to be served to the model.
In the first part of our project, we will be visualizing the data by using some statistical tools provided by various libraries of python. We will use azure notebooks as the IDE. We will also be cleaning the data if any anomalies are found.
Making the environment ready:
1. Go to https://notebooks.azure.com/. Sign in to your Microsoft account.
Students can have a free account worth $100. Check more details here: https://azure.microsoft.com/en-in/free/students/
We will be creating a Jupyter Notebook hosted on Azure.
2. Click on My Projects.
3. Click on +New Project.
4. Write a catchy project name like COVID-19 Analysis or My First ML Project. Click on create.
5. The project directory will open. Click on +New and the select Notebook.
6. Write a name for your notebook and select Python 3.6. An extension “.ipynb” will be added automatically after the name of your notebook, which suggests that this document is an interactive python notebook. Click on the New button to create a new notebook.
7. Now let’s download a dataset from Kaggle. Here is the direct link of the dataset:
Download and extract the zip file in your local machine.
8. In Azure Notebook created above, click on File-> Upload. Select the national_data.csv file.
9. Make sure you select /project before clicking on Start Upload.
Let the Code begin:
Note: The italicized lines at the center of each step can be copy-pasted directly into Jupyter notebook for convenience purposes. Though I strongly suggest writing code on your own.
1. We will first import the pandas library. It is an open-source library written in python and is very helpful for data analytics.
import pandas as pd
Click Shift+Enter to execute the cell selected. You will not see any output. Just a library will be included.
2. Now create a DataFrame by reading the imported csv file. Execute the cell but no visible output will be produced.
df=pd.read_csv('national_data.csv')
3. Let’s see how our data looks like. head() function shows the top 5 rows of our data.
df.head()
After executing, make sure output looks the same.
4. OOPS!! Column headings are in Italian language. In a new cell, I selected the input to be of markdown format. I googled the meaning of all these headings and noted them in markdown cells for reference purposes.
data: Date of notification deceduti: Deceased people dimessi_guariti: Discharged, people healed isolamento_domiciliare: People in home isolation note_en: note English note_it: note Italian nuovi_positivi: Total amount of current positive cases ricoverati_con_sintomi: New amount of current positive cases stato: State of reference tamponi: tests/swabs terapia_intensiva: Hospitalized in intensive care totale_casi: Total amount of positive cases totale_ospedalizzati: Total hospitalized patients totale_positivi: total positive variazione_totale_positivi: total change positive
5. Why not rename the columns of the dataframe? Ok, let’s do this.
df.rename(columns={'data':'Date of notification','deceduti':'Deceased people','dimessi_guariti':'Discharged, people healed','isolamento_domiciliare':'People in home isolation','note_en':'note english','note_it':'note italian','nuovi_positivi':'Total amount of current positive cases','ricoverati_con_sintomi':'New amount of current positive cases','stato':'State of reference','tamponi':'tests/swabs','terapia_intensiva':'Hospitalized in intensive care','totale_casi':'Total amount of positive cases','totale_ospedalizzati':'Total hospitalised patients','totale_positivi':'total positive','variazione_totale_positivi':'total change positive'}, inplace=True)
6. Let’s see how our data looks now.
df.head()
Yeah, the data looks good now. Note that the rename function does not amend the original database. It just makes amendments to the existing dataframe.
7. Let’s proceed towards data cleaning. Now check if any NULL value is present in any column of the data.
df.isnull().sum()
8. Yes, 2 columns contain a lot of null values. So let’s drop these columns of the dataframe.
df=df.drop(['note english','note italian'],axis=1)
9. Now, check if the columns have been dropped or not.
df.head()
10. Yes, the columns have been removed. But I am going to check once again if the data is really clean or not.
df.isnull().sum()
11. None of the columns have any NULL value. Data is now ready to be processed and is usable for various calculations. Now we are going to visualize the data.
df.describe()
It gives an overall description of the data. This function is very useful as it builds our understanding of the data. We get to see what kind of data is present and what boundaries are needed to be taken care of.
12. Now we are going to plot a curve to see an increase in total corona positive cases in Italy over time. I made a small dataframe dfl which includes a copy of one column of our main dataframe.
dfl=df[['Total amount of positive cases']] dfl.plot(title='Increase in total positive corona cases found in Italy by the end of March-20')
13. Let’s plot a bar graph to visualize total corona positive cases found each day in Italy.
dfb=df[['Total amount of current positive cases']]
dfb.plot.bar(title="COVID+ cases each day")
14. Now let’s visualize two pieces of information in a single figure. For instance, let’s compare the total people who died vs total people recovered from coronavirus in Italy over time.
df1=df[['Deceased people','Discharged, people healed']] df1.plot.area(title="People died VS People healed in Italy")
I feel that it is enough visualization for blogging purpose. I am going to explore various other methods of visualization. You too try to have a deep understanding of data so that we can perform appropriate calculations in the next step of our project.
See you in the next blog. Until then have a good time.
Stay Healthy, Stay Safe!
Comments