All Courses

Learning Model Building in Scikit-learn: A Python Machine Learning Library

Updated on Aug 26, 2025

12,540 Views

Table of Content

conclusion

Scikit learn is a module in Python that is used for data analysis and data mining purposes.

Features of Scikit-learn Module

It is a simple and efficient tool.
It can be used to implement various algorithms such as classification, regression and clustering.
It is open-source and can be used for production code.
It can be accessed and reused in different contexts.

Pre-requisites

Numpy
Scipy and its respective dependencies

How to Install Scikit-learn

pip install scikit-learn

Following are the steps in implementing learning algorithm using scikit-learn:

Loading Data

A dataset can be collected and loaded or a pre-defined dataset can be loaded. It has two features named features and responses.

Features are nothing but attributes or variables present in the dataset. They are represented using a ‘feature matrix’.

Response is also known as a target variable or a label. It is the output which depends on the feature variables. A single response column known as ‘response vector’ is present.

Data can be loaded in different ways and some of them have been demonstrated below:

Using Python Standard Library

There are built-in modules, such as ‘csv’, that contains a reader function, which can be used to read the data present in a csv file. The CSV file can be opened in read mode, and the reader function can be used. Below is an example demonstrating the same:

import numpy as np 
import csv 
path = path to csv file 
with open(path,'r') as infile: 
reader = csv.reader(infile,delimiter = ',') 
headers = next(reader) 
data = list(reader) 
data = np.array(data).astype(float) 

The headers or the column names can be printed using the following line of code:

print(headers)

The dimensions of the dataset can be determined using the shape attribute as shown in the following line of code:

print(data.shape)

Output:

250, 302

The nature of data can be determined by examining the first few rows of the dataset using the below line of code:

data[:2]

Using numpy package

The numpy package has a function named ‘loadtxt’ that can be used to read CSV data. Below is an example demonstrating the same using StringIO.

from numpy import loadtxt 
from io import StringIO 
c = StringIO("0 1 2 \n3 4 5") 
data = loadtxt(c) 
print(data.shape) 

Output:

(2, 3)

Using pandas package

There are a few things to keep in mind while dealing with CSV files using Pandas package.

The file header is basically the name of the column which describes that type of data the column holds. If the file already has a header, the function automatically assigns the same names to every column, otherwise every column needs to be manually named.
In any case, we need to explicitly mention in the read_csv function whether or not the CSV file contain header.
Comments in a CSV file are written using the # symbol.

Let us look at an example to understand how the CSV file is read as a dataframe.

import numpy as np 
import pandas as pd 
#Obtain the dataset 
df = pd.read_csv("path to csv file", sep=",") 
df[:5] 

Output:

 id target   0   1   2 ...   295  296  297  298  299  0 0 1.0 -0.098 2.165 0.681 ... -2.097 1.051 -0.414 1.038 -1.065  1 1 0.0 1.081 -0.973 -0.383 ... -1.624 -0.458 -1.099 -0.936 0.973 2 2 1.0 -0.523 -0.089 -0.348 ...  -1.165 -1.544 0.004 0.800 -1.211 3 3 1.0 0.067 -0.021 0.392 ... 0.467 -0.562 -0.254 -0.533 0.238  4 4 1.0 2.347 -0.831 0.511 ... 1.378 1.246 1.478 0.428 0.253  

Loading a pre-defined dataset

It can be done using the below code.

from sklearn.datasets import load_iris 
iris = load_iris() 
#feature matrix and target is stored in 2 variables 
X = iris.data 
y = iris.target 
feature_names = iris.feature_names 
target_names = iris.target_names 
#feature names and targets are printed 
print("Feature names:", feature_names) 
print("Target names:", target_names) 
#numpy arrays x and y 
print("\nType of X is:", type(X)) 
#first 5 input rows are printed to understand the type of data present in the dataset 
print("\nFirst 5 rows of X:\n", X[:5]) 

Splitting the Dataset

The next important step in implementing a learning algorithm is to split the dataset into training, testing and validation dataset.

Data is split into different sets so that a part of the dataset can be trained upon, a part can be validated and a part can be used for testing purposes.

Training Data:

This is the input dataset which is fed to the learning algorithm. Once the dataset is pre-processed and cleaned, it is fed to the algorithm. Sometimes, predefined datasets are readily available on multiple websites which can be downloaded and used. Some predefined data sets need to be cleaned and verified but some of them are usually cleaned beforehand. The machine learning model learns from this data and tries to fit a model on this data.

Validation Data:

This is similar to the test set, but it is used on the model frequently so as to knowhow well the model performs on never-before seen data. Based on the results obtained by passing the validation set to the learning algorithm, decision can be made as to how the algorithm can be made to learn better- the hyper parameters can be tweaked so that the model gives better results on this validation set in the next run, the features can be combined or new features can be created which better describe the data, thereby yielding better results.

Test data: This is the data on which the model’s performance/its ability to generalize is judged. In theend, the model’s performance can be determined based on how well it reacts to never-before-seen data. This is the data, which is used to test how well the model would generalize on new data. This is a way of knowing whether the model actually understood and learnt the patterns or it just overfit or underfit the data.

It is important to understand that good quality data (less to no noise, less to no redundancy, less to no discrepancies) in large amounts yields great results when the right learning algorithm is applied on the input data.

The dataset needs to be split into training and test datasets, so that once the training is completed on the training dataset, the performance of the learning model is tested on the test dataset. Usually, 80 percent of the data is used for training and 20 percent of the data is assigned for testing purposes. This can be achieved using the scikit-learn library, that has a function named train_test_split. The ‘test_size’ parameter helps in dividing the dataset into training and test datasets.

from

sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2) 

Feature Scaling

This is one of the most important steps in data pre-processing. It refers to standardizing the range of independent variables or features present within the dataset. When all the variables are transformed to

the same scale, it is easier to work with machine learning equations. This can be achieved using the ‘StandardScaler’ class that is present in the scikit-learn library. The training dataset has to first be fit on the learning model and then transformed. On the other hand, the test dataset needs to just be transformed.

from sklearn.preprocessing import StandardScaler 
sc_X = StandardScaler() 
X_train = sc_X.fit_transform(X_train) 
X_test = sc_X.transform(X_test) 
Model training 
Let us look at how a model can be trained sing KNN algorithm. 
from sklearn.datasets import load_iris 
iris = load_iris() #loading the iris dataset 
The feature matrix and respons evectors are stored X = iris.data 
y = iris.target 
#x and y are split into training and testing datasets from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 
#model is trained on the training data 
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train) 
#predictions are made on the test data 
y_pred = knn.predict(X_test) 
#actual response and predicted response is compared from sklearn import metrics 
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) 
#predictions for sample data 
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] 
preds = knn.predict(sample) 
pred_species = [iris.target_names[p] for p in preds] 
print("Predictions:", pred_species) 
#the model is saved 
from sklearn.externals import joblib 
joblib.dump(knn, 'iris_knn.pkl') 

Output:

kNN model accuracy: 0.9833333333333333 
Predictions: ['versicolor', 'virginica'] 
Out[13]: ['iris_knn.pkl'] 

Advantages of using scikit-learn

It provides a consistent interface to implement learning algorithms.
It has good documentation, and a community of helpful users.
It consists of many hyperparameters which can be tuned.

Conclusion

In this post, we saw how scikit-learn can be used to implement machine learning algorithms with ease.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

20% OFF

Coupon Code "REP20"

Coupon Expires 26/01

Copy

Get your free handbook for CSM!!

Recommended Courses