Takeaways from the article
- This article helps you understand the cases wherein Machine learning can be used, and where it is relevant (and where it is not).
- It discusses the basic steps involved in a machine learning problem, along with code in Python.
- It discusses how the data involved in a Machine Learning problem can be visualized using certain Python packages.
Machine Learning has remained a hot topic since many years. Many know how to make sense of it, and where it can actually be used. It is not a universal solution to all the challenging problems out there (that are difficult to be solved) in the universe. It can only be used when certain conditions are satisfied. Only then does a problem qualify to be solved using a Machine Learning algorithm. In general, Python is the most preferred language to work with algorithms that involve Machine Learning.
Introduction to Machine Learning
Machine Learning, also known as ML in short, is a sub-topic that falls under Artificial Intelligence (AI), to achieve specific goals. ML is the art of understanding or designing an algorithm that can be used to process large or small amounts of data. This algorithm will not explicitly define or set the rules for the machine to learn from the data. The machine learns from the data on its own. There are no ‘if’ or ‘else’ statements to guide the machine.
This is very much similar to how humans learn from their experiences in day-to-day life, how a child learns to ride a bike, how a child learns to read letters, then words, then sentences, and conversations.
Getting started with Machine learning in Python
Python has been used to implement machine learning algorithms, since it is open-source, extremely popular and has gained immense support from the community as well. In addition to this, there are loads of packages in Python, and they support usage of machine learning algorithms for a variety of version of Python application.
These algorithms can be implemented in python by calling simple functions and these functions are placed inside classes. In turn, these classes are encapsulated in a module as a package.
The ‘scikit-learn’ package for Python is one of the most popular and has most of the machine learning algorithms pre-implemented, and housed inside packages. To implement an algorithm, the package can be imported (or a specific class from the package can be imported) and it can be bound with the variable or the class object using a dot operator and accessed. In general, to begin implementing any machine learning algorithm, the following steps can serve as a blue-print:
Define your problem, and confirm that it can be solved using machine learning (so that it is not a trivial “set of rules” related problem)
Prepare the data: In this step, the data needed for this model is collected from various resources. Another way is to generate data using the innumerable functions that are present in Python. In either case, the data has to be cleaned, structured, analysed, and the outliers have to be identified. Also, the data has to be pre-processed so that it is easy for the algorithm to build a model based on the data. Certain irrelevant columns maybe removed, and missing data should be handled.
The data needs to be trained and hyperparameters need to be tuned so as to get better prediction accuracy.
Note: It is understood that the users have Python 3.5 or a higher stable version installed on their workstations before beginning to execute the code in the upcoming sections. Other packages can be installed as and when required.
Where Machine Learning can be used?
- The simplest place is when there is no prediction or complex data insight needed, it need not be used.
- Machine Learning algorithm are built by humans to help understand data better, make predictions etc. When we try to solve a problem, there are certain principles that we hold as a foundation (when dealing with physics- gravity, newton’s law) but algorithms don’t. They are stochastic (random) in nature.
- Not all problems that have a large amount of data is suited to work with Machine Learning algorithms. It is important to understand the deterministic nature of problems, and try to avoid solving such problems using Machine Learning.
Machine Learning in Python
Let us jump into a simple problem of linear regression using Machine learning, Linear regression is a simple algorithm that predicts the value of a variable, based on certain other values. There are many variations to Linear Regression that includes Multi-variate regression, etc.
Before jumping into the algorithm, let us understand what linear regression means. ‘Linear’ basically means a straight line, and ‘regression’ which is a part of machine learning, talks about how tasks can be solved without explicitly being programmed.
There are various machine learning algorithms, and Linear Regression is just the beginning to it. This includes supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning.
Why should Machine Learning be used?
Certain task needs intricate detailing, and patterns might not be fully unveiled if manual or simple methods are used to extract patterns. Machine learning, on the other hand, will be able to extract all important, hidden patterns, and work well even when the amount of data increases exponentially. It also becomes easy to improve pattern recognition. It will also be possible to deliver results in a time manner, get deeper and better insights into the data in hand.
The results computed using a Machine Learning algorithm would be more accurate in comparison to traditional methods, and the models build can serve as a foundation for other data as well. There are different classifications in machine learning, depending on various types. The 4 basic classifications are:
- Supervised learning algorithms
- Semi-supervised learning algorithms
- Unsupervised learning algorithms
- Reinforcement learning algorithms
Machine learning algorithms can also be classified based on how they learn- on the fly or incrementally, into 2 types:
- Online learning
- Batch learning
Machine learning algorithms can also be classified based on how they detect patterns- whether they detect patterns in data or compare new data values with previously seen data values:
- Model-based learning
- Instance-based learning
- Most popular
- Easy to understand
- Easier to implement
- Gives decent results
- Expensive, since human intervention is required
Supervised learning involves human supervision. In real-time, supervision is present in the form of labelled features, feedback loop to the data (insights on whether the machine predicted correctly, and if not, what the correct prediction has to be) and so on.
Once the algorithm is trained on such data, it can predict good outputs with a high accuracy for never-before-seen inputs.
Applications of supervised learning:
- Spam classification: Classifying emails as spam or important.
- Face recognition: Detecting faces, mapping them to a specific face in a database of faces.
Supervised algorithms can further be classified into two types:
- Classification algorithms: They classify the given data into one of the given classes or group of data. This basically deals with data grouping/data mapping into specific classes.
- Regression algorithms: This deals with fitting the data to a given model, predicting continuous or discrete values.
- In between the supervised and unsupervised learning algorithms.
- Created to bridge the gap between dealing with fully structured and fully unstructured data.
- Comes between supervised and unsupervised algorithms.
- Input is a combination of unlabelled (more) and labelled (less) data.
Applications of semi-supervised learning algorithms:
- Speech analysis, sentiment analysis
- Content classification
- No data labelling
- No human intervention
- May not be very accurate
- Can’t be applied to a broad variety of situations
- Algorithm has to figure out how and what to learn from the data
- Similar to real-world unstructured data
- Can’t be applied to a broad variety of situations
Applications of unsupervised learning:
- Anomaly detection
Unsupervised data can be classified into two categories:
- Clustering algorithms
- Association algorithms
- It is a ‘punish and reward’ mechanism.
- Learns from surrounding and experience.
- An agent decides the next relevant step to arrive at the desired result.
- If algorithm learns correctly, then it is rewarded indicating that it is on the right path.
- If the algorithm made a mistake, it is punished to indicate the mistake and to learn from it.
Supervised learning algorithm is different from reinforcement, since the former has a comparable value, whereas the latter has to decide the next action and take it and bear the result and learn from it.
Applications of reinforcement learning:
- Robotics in automation
- Machine learning and data processing
Other types of learning algorithms
- Online learning
- Batch learning: It has two different categories: Model-based learning, and instance-based learning
- Also known as incremental/out of the core learning.
- Assumption is that the learning environment changes constantly.
Machine learning models that are trained consistently and constantly on new data to predict output. On the other hand, during this period, the model is getting trained on new data in real time. Whenever the model sees a new example, it quickly has to learn from it and adapt to it. This way, even the newly learnt example will be a part of the trained model, and will be a part of giving the prediction/output.
This is also known as data learning in a group.
Data is grouped/classified into different batches.
There batches are used to extract different patterns since every batch would be considerably different from the other one. These patterns are learned by the model in time.
The specifications associated with a problem in a domain is converted into a model-format. When this model sees new data, it detects patterns from it, and these patterns are used to make predictions on the newly seen data.
It is the simplest form of clustering and regression algorithms.
They either result in grouping the algorithm into different classes (due to classification) or give continuous or discrete values as output (due to linear or logistic regression).
Classification and regression is based on how similar or different the queries are, with respect to the values in the data.
In this algorithm, we will understand the problems with two different variables in hand- one is an independent variable, and the other one- a dependant variable. We will take a basic problem of finding prices of a house when its area is given. Assume that we have the below dataset:
|Price of house (independent value)||Area of the house (dependant value)|
|356||500 sq m|
|578||1000 sq m|
|890||1500 sq m|
|1300||2000 sq m|
|1800||2500 sq m|
|?||3000 sq m|
When the above data is given, and the price of house is asked to be found (see last row), given the area of the house, simple linear regression (that gives a decent amount of accuracy) can be used. Below is how the data will look when plotted on a graph. It yields an almost straight line, which means the dependant value depends on the independent value, i.e the area of the house matters when the price of the house is being fixed.
The basic steps involved in a machine learning problem-
- Identify the problem: see if it qualifies to be solved using a Machine Learning algorithm.
- Gather the data: The data required can either be collected from a single source or various source, or it could be generated randomly (if it is for a specific purpose) using certain formulas and methods.
- Data cleaning: The data gathered may not be clean or structured, make sure it is cleaned, and in a structured or at least semi-structured format.
- Package installation: Install the packages that are required to work with the data.
- Data loading: Load the data into the Python environment using any IDE (Usually, Spyder is preferred). This is done so that the machine learning algorithm can access the data and perform the operations.
- Data cleaning: Data can be cleaned after it has been placed in the Python environment using certain packages and methods, or it can be cleaned before (manually or by applying some logic).
- Summarize the data: Understand the terms we are looking at, perform some operations on them, get the type of value, mean, median, variance, and standard deviation, which are insights into the data. This can be done easily by importing packages that have these functions.
- Data training: In this step, the input dataset is trained by passing it as parameter to the respective algorithm. This is done so that it can predict the output for the not-ever-seen data also known as testing dataset.
- Linear Regression application: Apply the Linear Regression algorithm to this data.
- Data visualization: The data that has interacted with the linear regression algorithm is visualized using many Python packages.
- Prediction: The predictions are made with the help of the data trained, and are then displayed on the console. Code for Linear Regression using Python
Code to implement linear regression using Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
#Random data set generated
x_dep = np.random.rand(100, 1)
y_indep = 5.89 + (2.45)* x_dep + np.random.rand(100, 1)
#The model is initialized using LinearRegression that is present in the scikit-learn package
model_of_regression = LinearRegression()
#The data is fit on the model, with the help of training
#The output is predicted
predicted_y_val = model_of_regression.predict(x_dep)
#The model built is evaluated using mean squared error parameter
rmse = mean_squared_error(y_indep, predicted_y_val)
r2 = r2_score(y_indep, predicted_y_val)
print("The value of slope is: ", model_of_regression.coef_)
print("The intercept value is: ", model_of_regression.intercept_)
print("The Root Mean Squared Error value (RMSE) is: ", rmse)
#The data is visualized usign the matplotlib library
plt.scatter(x_dep, y_indep, s=8)
#The values are predicted and plotted on a graph and displayed on the screen
plt.plot(x_dep, predicted_y_val, color='r')
Code review-Explanation of every step
- The required packages are imported using the ‘import’ keyword.
- Make sure that ‘scikit-learn’ package is installed before working on this code.
- Instead of using precooked data, we are generating data here, using the ‘random’ function.
- A seed is defined, and a formula is created that assumes random values for variables and generates random data.
- The ‘LinearRegression’ function, present in the ‘scikit-learn’ package is initiated so as to create a model, and one of the functions inside the LinearRegression package-namely ‘fit’ is called by passing the dependant and the independent values.
- The ‘predict’ function from the LinearRegression is used to predict the value that is not known for a given independent value.
- After the model is built with the data, it is important to see how it has fared.
- Hence, an attribute named RMSE (Root Mean Squared Error) is used to see the difference between the value that had to actually be predicted and the value that was predicted.
- Next, the data is visualized on the screen using a package named ‘matplotlib’.
In all, Machine Learning is a game changer when it comes to identifying its use cases, and applying the right kind of algorithm in the right place, with the right amount of data, and right computational resources and power. Linear Regression is just a simple algorithm of where Machine Learning begins to show its aspects. Usually, the Python language is used to implement Machine Learning algorithms, but other new languages could also be used.