Search

Data Science blog posts

Install Python on Ubuntu

IntroductionThis article will help you in the installation of python 3  on Ubuntu. You will learn the basics of configuring the environment to get started with Python.Brief introduction to PythonPython is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well.Latest version of pythonThe latest version of python is python3 and the latest release is python3.9.0.Installation linksFor downloading python and the documentation for Ubuntu, visit the official website https://www.python.org/downloads/ and go to the downloads section, from where you can download the latest python version for Ubuntu.Key terms (pip, virtual environment, path etc.)       pip:pip is a package manager to simplify installation of python packages. To install pip ,run the below command on the terminalsudo apt  install  python3-pip  Once the installation is done just install a package  by runningpip install <package-name>virtual environment:The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project, For example if you have lot of flask or Django- based applications and not all the applications are using the same version, we use virtual env where each project will have its own version.In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.If you don't have virtual environment installed, use this command to install it:pip3 install virtualenvSo to create a new virtual env, run the below command:virtualenv env (name of virtual env)This will create a virtual environment and will install some standard packages as well as part of the virtual environment creation.To activate the virtual environment on ubuntu, use the below command:source env/bin/activateTo deactivate it you can run the below command in the environment:deactivateGetting and Installing Ubuntu:To download Python, visit the official website https://www.python.org/downloads/ and go to the downloads section. You can download the latest python version for Ubuntu as shown below:Download the tarball and untar the file. After untarring the file, you will see a couple of files. The file you will be interested in is readme file where you can access a set of instructions to install the python on the ubuntu machine.Open the terminal, change the directory of the untarred python file, and run the below command under cd ~/<python untarred folder>Install python command:./configuremakemake testsudo make installThis will install python as python3.If you get an error when running sudo ./configure, like no compiler found, just install the below library to get rid of it:apt-get install build-essentialAlso if you get an error when running, like make not found, just run the below command to install make:sudo apt install makeSo once you are done installing the above libraries, like make and build-essential, you should be good with the above install python command.The other way of installing python is by running apt get commands as below:Open the terminal and run:sudo apt-get updateThis will make sure repos are updated to the latest in Ubuntu. Install python by running the below command:sudo apt-get install python3Setting pathTo find the existing  system path set in your machine, you can run the below command:echo  $PATHNow suppose you want to set a different path for your Python executable, you can just use export command and give it a directory path like below:export PATH=$PATH:`<path to executable file>’By just running the above export command, this will not be persisted across different terminals. Again, if you close that terminal and open it again, the change would have been lost. So to make it persistent,  you need to add the above command in the ~/.bashrc file present in the home directory of the ubuntu system.How to run python codeTo run python code just run the commandpython <pythonfile.py>Installing Additional Python Packages:If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed in the env. If you install any other packages in the env, for instance let’s say you want to install request library, you can just install it by running pip3 install requests. Now try running pip3 list again to see this requests lib installed in this env.Directory as package for distribution:Inside the python project or directory you should have a file called __init__.py inside the directory. So you can create this file by a simple touch command. This file does not need to have any data inside it, it only has to exist inside the directory for that to work as a package.Documentation links for pythonhttps://www.python.org/doc/Conclusion (Summary)This article will help you with stepwise instructions on the installation of Python on ubuntuOs.
Install Python on Ubuntu
KnowledgeHut

Install Python on Ubuntu

IntroductionThis article will help you in the installation of python 3  on Ubuntu. You will learn the basics of configuring the environment to get started with Python.Brief introduction to PythonPython is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well.Latest version of pythonThe latest version of python is python3 and the latest release is python3.9.0.Installation linksFor downloading python and the documentation for Ubuntu, visit the official website https://www.python.org/downloads/ and go to the downloads section, from where you can download the latest python version for Ubuntu.Key terms (pip, virtual environment, path etc.)       pip:pip is a package manager to simplify installation of python packages. To install pip ,run the below command on the terminalsudo apt  install  python3-pip  Once the installation is done just install a package  by runningpip install virtual environment:The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project, For example if you have lot of flask or Django- based applications and not all the applications are using the same version, we use virtual env where each project will have its own version.In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.If you don't have virtual environment installed, use this command to install it:pip3 install virtualenvSo to create a new virtual env, run the below command:virtualenv env (name of virtual env)This will create a virtual environment and will install some standard packages as well as part of the virtual environment creation.To activate the virtual environment on ubuntu, use the below command:source env/bin/activateTo deactivate it you can run the below command in the environment:deactivateGetting and Installing Ubuntu:To download Python, visit the official website https://www.python.org/downloads/ and go to the downloads section. You can download the latest python version for Ubuntu as shown below:Download the tarball and untar the file. After untarring the file, you will see a couple of files. The file you will be interested in is readme file where you can access a set of instructions to install the python on the ubuntu machine.Open the terminal, change the directory of the untarred python file, and run the below command under cd ~/Install python command:./configuremakemake testsudo make installThis will install python as python3.If you get an error when running sudo ./configure, like no compiler found, just install the below library to get rid of it:apt-get install build-essentialAlso if you get an error when running, like make not found, just run the below command to install make:sudo apt install makeSo once you are done installing the above libraries, like make and build-essential, you should be good with the above install python command.The other way of installing python is by running apt get commands as below:Open the terminal and run:sudo apt-get updateThis will make sure repos are updated to the latest in Ubuntu. Install python by running the below command:sudo apt-get install python3Setting pathTo find the existing  system path set in your machine, you can run the below command:echo  $PATHNow suppose you want to set a different path for your Python executable, you can just use export command and give it a directory path like below:export PATH=$PATH:`’By just running the above export command, this will not be persisted across different terminals. Again, if you close that terminal and open it again, the change would have been lost. So to make it persistent,  you need to add the above command in the ~/.bashrc file present in the home directory of the ubuntu system.How to run python codeTo run python code just run the commandpython Installing Additional Python Packages:If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed in the env. If you install any other packages in the env, for instance let’s say you want to install request library, you can just install it by running pip3 install requests. Now try running pip3 list again to see this requests lib installed in this env.Directory as package for distribution:Inside the python project or directory you should have a file called __init__.py inside the directory. So you can create this file by a simple touch command. This file does not need to have any data inside it, it only has to exist inside the directory for that to work as a package.Documentation links for pythonhttps://www.python.org/doc/Conclusion (Summary)This article will help you with stepwise instructions on the installation of Python on ubuntuOs.
3477
Install Python on Ubuntu

IntroductionThis article will help you in the inst... Read More

How to Install Python on Mac

This article will help you in the installation of Python 3  on macOS. You will learn the basics of configuring the environment to get started with Python.Brief introduction to PythonPython is an Interpreted programming language that is very popular these days due to its easy learning curve and simple syntax. Python finds use in many applications and for programming the backend code of websites. It is also very popular for data analysis across industries ranging from medical/scientific research purposes to retail, finances, entertainment, media and so on.When writing a python program or program in any other language, people usually use something called an IDE or Integrated Development Environment that includes everything you need to write a program. It has an inbuilt text editor to write the program and a debugger to debug the programs as well. PyCharm is a well-known IDE for writing python programs.Latest version of pythonThe latest version of python is python3 and the latest release is python3.9.0.Installation linksFor downloading python and the documentation for MacOS, visit the official website https://www.python.org and go to the downloads section, from where you can download the latest python version for MacOS.Key terms (pip, virtual environment, path etc.)pip:pip is a package manager to simplify the installation of python packages. To install pip, run the below command on the terminal:curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py.If you install python by using brew which is a package management to simplify installation of software on macOs, it installs other dependent  packages as well along with python3  like pip etc.virtual environment:The purpose of virtual environments is to have a separate space where you can install packages which are specific to a certain project. For example if you have a lot of flask or Django-based applications and not all the applications are using the same version, we use virtual env wherein each project will have its own version.In order to use a virtual environment you need to be on the python 3.x version. Let’s understand how to create the virtual environment. You do not need any library as it comes along with standard python installation.So to create a new virtual env, run the below command:python3 -m venv demo-m expects a module name which is venv in this case, so with this python searches your sys.path and executes the module as the main module.Venv expects an environment name that you must create.Now you should have a new environment called demo. Let’s activate this virtual env by running the below command:source demo/bin/activateAfter running this, the environment is activated and you can see the environment name in the terminal. Another way to check if the env is activated is by running which python. You will see the python that you are using with this project env, and the version that it will use is the same that you used to create the environment.Getting and Installing MacPython:For MacOS, python usually comes pre-installed, so to check if python is installed open the terminal in the mac and use `python --version` to confirm the same. You can also see what is the default python version installed, which is usually python2.x by default. However, Python2.x is going to get deprecated soon, and with everyone moving to python3.x ,we will go with the latest python3 installation.Installation stepsFor downloading python, visit the official website https://www.python.org and go to the downloads section. You can download the latest python version for MacOS as shown below:It will download a pkg file. Click on that file to start the installation wizard. You can continue with default settings. If you want to change the install location, however,  you can change it, and then continue and finish the installation with the rest of the default settings.Once the installation is finished it would have created the python 3.x directory in the application folder. Just open the application folder and verify the same.Now you have python3.x installed.To verify it from the terminal, go to the terminal and check the version of python by using `python --version` command. So you will still see it is showing the old default python version, Now instead if you use python3 explicitly like `python3 –version, you can see the version that you have installed with python3 version.Once the installation is finished it would have created a python3.x directory in the application folder. Open the application folder and verify the same.You can also install python3 on mac by using brew which is a package management to simplify installation of software on macOs.brew install python3brew will install other dependent  packages as well along with python3  like pip etcSetting pathSuppose you have installed a new python 3  version but when you type python it still shows the default python2 version which comes by default in mac os. To solve this, add an alias by runningalias python=python3Add this line in the file called .bash_profile present in home directory. In case this file is not present, you can create it, save the changes and restart the terminal by closing it. Next, open the terminal and run python and hit enter. You should see the latest python3 that you have installed.Sometimes when you type python or python3 explicitly, it does not work even if you have installed the python. You get the message, “command is not found”. This means the command is not present in the directories used by the machine for lookup. Let’s  check the directories where the machine is looking for commands by runningecho $PATHIt will list all your directories where the machine looks for commands. This will vary from machine to machine. If the command that you are trying is not under the directory path listed by echo, that command will not work. It will throw an error saying command is not present, until you provide the full path of the directory where it's installed.Now let’s open the file  .bash_profile and add the directory path where python is installed to the current path env variableFor example  let’s add the following lines in that bash_profile file which will add the below directory to the current env variable. This can vary from machine to machine based on the installed location.PATH=”/Library/Frameworks/Python.framework/Versions/3.7/bin:${PATH}”export PATHSave the changes and restart the terminal. Open the terminal now and run echo $PATH again and see the above path that you added for python3. When you now type python3 command, you should see it working.  Also, if you are trying to import a package that you have installed and it says that it cannot find that package, this means pip install is installing the packages in the different version of python directory. Make sure the location of the package is in the site-packages directory of the version of the python that you are using. You can see the location of the package that you are trying to import by running  pip show The above command will have a location field in which you can see and cross verify the path.9. How to run python codeTo run python code just run the commandpython Installing Additional Python Packages:If you want to see what all packages are installed in the env, run the command pip3 list which will list down the current packages installed. Let’s say you want to install request library. You can just install it by running pip3 install requests. Now try running pip3 list again, to see this requests lib installed in this env.Directory as package for distribution:Inside the python project or directory you should have a file called __init__.py. You can create this file by a simple touch command, and this file does not need to have any data inside it, All it has to do is to exist inside the directory, for that to work as a package.Documentation links for pythonhttps://www.python.org/doc/ConclusionThis article will help you with stepwise instructions on the installation of python on mac.
4446
How to Install Python on Mac

This article will help you in the installation of ... Read More

Essential Steps to Mastering Machine Learning with Python

One of the world’s most popular programming languages today, Python is a great tool for Machine Learning (ML) and Artificial Intelligence (AI). It is an open-source, reusable, general-purpose, object-oriented, and interpreted programming tool. Python’s key design ideology is code readability, ease of use and high productivity. The latest trend shows that the interest in Python has grown significantly over the past five years. Python is the top choice for ML/AI enthusiasts when compared to other programming languages.   Image source: Google Trends - comparing Python with other tools in the marketWhat makes Python a perfect recipe for Machine Learning? Python can be used to write Machine Learning algorithms and it computes pretty accurately. Python’s concise and easy readability allows the writing of reliable code very quickly. Another reason for its popularity is the availability of various versatile, ready-to-use libraries.  It has an excellent library ecosystem and a great tool for developing prototypes. Unlike R, Python is a general-purpose programming language which can be used to build web applications and enterprise applications.  The community of Python has developed libraries that adhere to a particular area of data science application. For instance, there are libraries available for handling arrays, performing numerical computation with matrices, statistical computing, machine learning, data visualization and many more. These libraries are highly efficient and make the coding much easier with fewer lines of codes. Let us have a brief look at some of the important Python libraries that are used for developing machine learning models. NumPy: One of the fundamental packages for numerical and scientific computing. It is a mathematical library to work with n-dimensional arrays in Python. Pandas: Provides highly efficient, easy-to-use DataFrame for DataFrame manipulations and Exploratory Data Analysis (EDA). SciPy: SciPy is a functional library for scientific and high-performance computations. It contains modules for optimization and for several statistical distributions and tests. Matplotlib: It is a complete plotting package that provides 2D plotting as well as 3D plotting. It can plot static and interactive plots. Seaborn: Seaborn library is based on Matplotlib. It is used to plot more elegant statistical visualization.  StatsModels: The StatsModels library provides functionalities for estimation of various statistical models and conducting different statistical tests. Scikit-learn: Scikit-Learn is built on NumPy, SciPy and Matplotlib. Free to use, overpowered and provides various range of supervised and unsupervised machine learning algorithms. One should also take into account the importance of IDEs specially designed for Python for Machine Learning. The Jupyter Notebook  -  an open-source web-based application that enables ML enthusiasts to create, share, quote, visualize, and live-code their projects.  There are various other IDEs that can be used like PyCharm, Spyder, Vim, Visual Studio Code. For beginners, there is a nice simple online compiler available – Programiz. Roadmap to master Machine Learning Using Python Learn Python: Learn Python from basic to advanced. Practice those features that are important for data analysis, statistical analysis and Machine Learning. Start from declaring variables, conditional statements, control flow statements, functions, collection objects, modules and packages. Deep dive into various libraries that are used for statistical analysis and building machine learning models. Descriptive Analytics : Learn the concept of descriptive analytics, understand the data, learn to load structured data and perform Exploratory Data Analysis (EDA). Practice data filtering, ordering, grouping, multiple joining of datasets. Handle missing values, prepare visualization plots in 2D or 3D format (from libraries like seaborn, matplotlib) to find hidden information and insights. Take a break from Python and Learn Stats - Learn the concept of the random variable and its important role in the field of analytics. Learn to draw insights from the measures of dispersion (mean, median, mode, quartiles and other statistical measures like confidence interval and distribution functions. The next step is to understand probability & various probability distributions and their crucial role in analytics. Understand the concept of various hypothesis tests like t-tests, z-test, ANOVA (Analysis of Variance), ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), MANCOVA (Multivariate Analysis of Covariance) and chi-square test.  Understand Major Machine Learning AlgorithmsImage sourceDifferent algorithms have different tasks. It is advisable to understand the context and select the right algorithm for the right task. Types of ML ProblemDescriptionExamplesClassificationPick one of N labelsPredict if loan is going to be defaulted or notRegressionPredict numerical valuesPredict property priceClusteringGroup similar examplesMost relevant documentsAssociation rule learningInfer likely association patterns in dataIf you buy butter you are likely to buy bread (unsupervisedStructured OutputCreate complex outputNatural language parse trees, images recognition bounding boxesRankingIdentify position on a scale or statusSearch result rankingSourceA. Regression (Prediction):  Regression algorithms are used for predicting numeric values. For example, predicting property price, vehicle mileage, stock prices and so on.   SourceB. Linear Regression – predicting a response variable, which is numeric in nature, using one or more features or variables. Linear regression model is mathematically represented as:  SourceVarious regression algorithms include: Linear Regression Polynomial Regression  Exponential Regression Decision Tree Random Forest Neural Network As a note to new learners, it is suggested to understand the concepts of – Regression assumptions, Ordinary Least Square Method, Dummy Variables (n-1 dummy encoding, one hot encoding), and performance evaluation metrics (RMSE, MSE, MAD). Classification - We use classification algorithms for predicting a set of items’ classes or a categorical feature. For example, predicting loan default (yes/no) or predicting cancer (yes/no) and so on. Various classification algorithms include: Binomial Logistic Regression Fractional Binomial Regression Quasibinomial Logistic regression Decision Tree Random Forest Neural Networks K-Nearest Neighbor Support Vector Machines Some of the classification algorithms are explained here: K-Nearest Neighbors – simple yet often used classification algorithm. It is a non-parametric algorithm (does not make any assumption on the underlying data distribution) It chooses to memorize the learning instances The output is a class membership  There are three key elements in this approach – a set of labelled objects, eg, a set of stored records, a distance between objects, and the value of k, the number of nearest neighbours  Distance measures that the K-NN algorithm uses - Euclidean distance (square root of the sum of the squared distance between a new point and the existing point across all the input attributes.  Other distances include – Hamming distance, Manhattan distance, Minkowski distance  SourceExample of K-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. In other words the number of triangles is more than the number of squares If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). It is to be noted that to avoid equal voting, the value of k should be odd and not even.  Logistic Regression – A supervised algorithm that is used for binary classification. The basis for logistic regression is the logit feature aka sigmoid characteristic which takes any real value and maps it between zero and 1. In other words, Logistic Regression returns a probability value for the class label.   SourceIf the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO For instance, let us take cancer prediction. If the output of the Logistic Regression is 0.75, we can say in terms of probability that, “There is a 75 percent chance that the patient will suffer from cancer.” Decision Tree – Is a type of supervised learning algorithm which is most commonly used in the case of a classification problem. Decision Tree algorithms can also be used for regression problems i.e. to predict a numerical response variable. In other words, Decision Tree works for both categorical and continuous input and output variables. Each branch node of the decision tree represents a choice between some alternatives and each leaf node represents a decision. SourceAs an early learner, it is suggested to understand the concept of ID3 algorithm, Gini Index, Entropy, Information Gain, Standard Deviation and Standard Deviation Reduction. Random Forest – is a collection of multiple decision trees. It is a supervised learning algorithm, that can be used for both classification & regression problems. While algorithms like Decision Tree can cause a problem of overfitting wherein a model performs well in training data but does not perform well in testing or unseen data, algorithms like Random Forest can help avoid overfitting. It achieves uncorrelated decision trees throughout the concept of bootstrapping (i.e. sampling with replacement) and features randomness.  SourceAs a new learner it is important to understand the concept of bootstrapping.  Support Vector Machine – a supervised learning algorithm, used for classification problems. Another flavour of Support Vector Machines (SVM) is Support Vector Regressor (SVR) which can be used for regression problems. In this, we plot each data item as a point in n-dimensional space n here represents the number of features SourceThe value of each feature is the value of a particular coordinate.  Classification is performed by finding hyperplanes that differentiate the two classes.  It is important to understand the concept of margin, support vectors, hyperplanes and tuning hyper-parameters (kernel, regularization, gamma, margin). Also get to know various types of kernels like linear kernel, radial basis function kernel and polynomial kernel Naive Bayes – a supervised learning classifier which assumes features are independent and there is no correlation between them. The idea behind Naïve Bayes algorithm is the Bayes theorem.  SourceC.Clustering Clustering algorithms are unsupervised algorithms that are used for dividing data points into groups such that the data points in each group are similar to each other and very different from other groups.  Some of the clustering algorithms include: K-means – An unsupervised learning algorithm in which the items are grouped into k-cluster The elements of the cluster are similar or homogenous. Euclidean distance is used to calculate the distance between two data points. Data points have a centroid; this centroid represents the cluster. The objective is to minimize the intra-cluster variations or the squared error function.SourceOther types of clustering algorithms: DBSCAN Mean Shift Hierarchical d) Association Association algorithms, which form part of unsupervised learning algorithms, are for associating co-occurring items or events. Association algorithms are rule-based methods for finding out interesting relationships in large sets of data. For example, find out a relationship between products that are being bought together – say, people who buy butter also buy bread. Some of the association algorithms are: Apriori Rules - Most popular algorithm for mining strong associations between variables. To understand how this algorithm works, concepts like Support, Confidence & Lift to be studied. ECLAT - Equivalence Class Clustering and bottom-up Lattice Traversal. This is one of the popular algorithms that is used for association problems. This algorithm is an enhanced version of the Apriori algorithm and is more efficient. FP Growth - Frequent Pattern Growth Algorithm - Another very efficient & scalable algorithm for mining associations between variables e) Anomaly Detection We recommend the use of anomaly detection for discovering abnormal activities and unusual cases like fraud detection. An algorithm that can be used for anomaly detection: Isolation Forest - This is an unsupervised algorithm that can help isolate anomalies from huge volume of data thereby enabling anomaly detection f) Sequence Pattern Mining We use sequential pattern mining for predicting the next data events between data examples in a sequence. Predicting the next dose of medicine for a patient g) Dimensionality ReductionDimensionality reduction is used for reducing the dimension of the original data. The idea is to reduce the set of random features by obtaining a set of principal components or features. The key thing to understand in this is that the components retain or represent some meaningful properties of the original data. It can be divided into feature extraction and selection. Algorithms that can be used for dimensionality reduction are: SourcePrincipal Component Analysis - This is a dimensionality reduction algorithm that is used to reduce the number of dimensions or variables in large datasets that have a very high number of variables. However it is to be noted that though PCA transforms a very large set of features or variables into smaller sets, it helps retain most of the information of the dataset. While the reduction of dimensions comes at a cost of model accuracy, the idea is to bring in simplicity in the model by reducing the number of variables or dimensions.  h) Recommendation Systems - Recommender Systems are used to build recommendation engines. Recommender algorithms are used in various business areas that include online stores to recommend the right product to its buyers like Amazon , content recommendation for online video & music sites like Netflix, Amazon Prime Music and various social media platforms like FaceBook, Twitter and so on.   SourceRecommender Engines can be broadly categorized into the following types: Content-based methods — recommends items to a user based on their profile history. It revolves around customer’s taste and preference.  Collaborating filtering method — it can be further subdivided into two categories Model-based — a stipulation wherein user and item interact. Both user and item interaction are learned from interactions matrix. Memory-based — Unlike model-based it relies on the similarity between the users and the items. Hybrid methods — Mix content which is based on collaborative filtering approaches. Examples: Movie recommendation system Food recommendation system E-commerce recommendation system 5. Choose the Algorithm — Several machine learning models can be used with the given context. These models are chosen depending on the data (image, numerical values, texts, sounds) and the data distribution 6. Train the model — Training the model is a process in which the machine learns from the historical data and provides a mathematical model that can be used for prediction. Different algorithms use different computation methods to compute the weights for each of the variables. Some algorithms like Neural Network initialize the weight of the variables at random. These weights are the values which affect the relationship between the actual and the predicted values.  7. Evaluation metrics to evaluate the model— Evaluation process comprises understanding the output model and evaluating the model accuracy for the result. There are various metrics to evaluate model performance. Regression problems have various metrics like MSE, RMSE, MAD, MAPE as key evaluation metrics while classification problems have metrics like Confusion Matrix, Accuracy, Sensitivity (True Positive Rate), Specificity (True Negative Rate), AUC (Area under ROC Curve), Kappa Value and so on. It is only after the evaluation, the model can be improved or fine-tuned to get more accurate predictions. It is important to know a few more concepts like:  True Positive  True Negative  False Positive  False Negative  Confusion Matrix  Recall (R) F1 Score ROC AUC Log loss When we talk about regression the most commonly used regression metrics are: Mean Absolute Error (MAE) Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Root Mean Squared Logarithmic Error (RMSLE) Mean Percentage Error (MPE) Mean Absolute Percentage Error (MAPE) We must know when to use which metric. It depends on the kind of data and the target variable you have. 8. Tweaking the model or the hyperparameter tuning  - With great models, comes the great problem of optimizing hyperparameters to build an improved and accurate ML model. Tuning certain parameters (which are called hyperparameters) is important to ensure improved performance. The hyperparameters vary from algorithm to algorithm and it is important to learn the hyperparameters for each algorithm.  9. Making predictions  - The final nail to the coffin. With all these aforementioned steps followed one can tackle real-life problems with advanced Machine Learning models.  Steps to remember while building the ML model: Data assembling or data collection  - generally represents the data in the form of the dataset.  Data preparation - understanding the problem statement. This includes data wrangling for building or training models, data cleaning, removing duplicates, checking for missing values, data visualization for understanding the relationship between variables, checking for (imbalanced) bias data, and other exploratory data analysis. It also includes splitting the data into train and test. Choosing the model  -  the ML model which answers the problem statement. Different algorithms serve different purposes. Training the model  -  the idea to train the model is to ensure that the prediction is accurate more often. Model evaluation — evaluation metric to measure the performance of the model. How does the model perform against the previously unseen data? The train/test splitting ratio — (70:30) or (80:20), depending on the dataset. There is no exact rule to split the data by (80:20) or (70:30); it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. Parameter tuning  - to ensure improved performance by controlling the model’s learning process. The hyperparameters have to be tuned so that the model can optimally solve the machine learning problem. For parameter tuning, we either specify a grid of parameters known as the grid search or we randomly select a combination of parameters known as the random search.GridSearchCV — It is the process to search the best combination of parameters over the grid. For instance, n_estimator could possibly be 100,250,350,500; max_depth can be 2,5,11,15 and the criterion could be gini or entropy. Though these don’t look like a lot of parameters, just imagine the scenario if the dataset is too large. The grid search has to run on a loop and calculate the score on the validation set. RandomSearchCV —We randomly select a combination of parameters and then calculate the cross-validation score. It computes faster than GridSearch. Note: Cross-validation is the first and most essential step when it comes to building ML models. If the cross-validation score is good, we can say that the validation data is a representation of training or the real-world data. Finally, making predictions — using the test data, of how the model will perform in real-world cases. ConclusionPython has an extensive catalogue of modules and frameworks. It is fast, less complex and thus it saves development time and cost. It makes the program completely readable particularly for novice users. This particular feature makes Python an ideal recipe for Machine Learning.  Both Machine Learning and Deep Learning require work on complex algorithms and several workflows. When using Python, the developer can worry less about the coding, and can focus more on finding the solution. It is open-source and has an abundance of available resources and step-by-step documentation. It also has an active community of developers who are open to knowledge sharing and networking. The benefits and the ease of coding makes Python the go to choice for developers. We saw how Python has an edge over other programming tools, and why knowledge of Python is essential for ML right now.  Summing up we saw the benefits of Python, the way ahead for beginners and finally the steps required in a machine learning project. This article can be considered as a roadmap to your mastery over Machine Learning. 
6627
Essential Steps to Mastering Machine Learning with...

One of the world’s most popular programming lang... Read More

The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have changed our lives. From the most technologically savvy person working in leading digital platform companies like Google or Facebook to someone who is just a smartphone user, there are very few who have not been impacted by artificial intelligence or machine learning in some form or the other;  through social media, smart banking, healthcare or even Uber.  From self – driving Cars, robots, image recognition, diagnostic assessments, recommendation engines, Photo Tagging, fraud detection and more, the future for machine learning and AI is bright and full of untapped possibilities.With the promise of so much innovation and path-breaking ideas, any person remotely interested in futuristic technology may aspire to make a career in machine learning. But how can you, as a beginner, learn about the latest technologies and the various diverse fields that contribute to it? You may have heard of many cool sounding job profiles like Data Scientist, Data Analyst, Data Engineer, Machine Learning Engineer etc., that are not just rewarding monetarily but also allow one to grow as a developer and creator and work at some of the most prolific technology companies of our times. But how do you get started if you want to embark on a career in machine learning? What education background should you pursue and what are the skills you need to learn? Machine learning is a field that encompasses probability, statistics, computer science and algorithms that are used to create intelligent applications. These applications have the capability to glean useful and insightful information from data that is useful to arrive business insights. Since machine learning is all about the study and use of algorithms, it is important that you have a base in mathematics.Why do I need to Learn Math?Math has become part of our day-to-day life. From the time we wake up to the time we go to bed, we use math in every aspect of our life. But you may wonder about the importance of math in Machine learning and whether and how it can be used to solve any real-world business problems.Whatever your goal is, whether it’s to be a Data Scientist, Data Analyst, or Machine Learning Engineer, your primary area of focus should be on “Mathematics”.  Math is the basic building block to solve all the Business and Data driven applications in the real-world scenario. From analyzing company transactions to understanding how to grow in the day-to-day market, making future stock predictions of the company to predicting future sales, Math is used in almost every area of business. The applications of math are used in many Industries like Retail, Manufacturing, IT to bring out the company overview in terms of sales, production, goods intake, wage paid, prediction of their level in the present market and much more.Pillars of Machine LearningTo get a head start and familiarize ourselves with the latest technologies like Machine learning, Data Science, and Artificial Intelligence, we have to understand the basic concepts of Math, write our own Algorithms and implement  existing Algorithms to solve many real-world problems.There are four pillars of Machine Learning, in which most of our real-world business problems are solved. Many algorithms in Machine Learning are also written using these pillars. They areStatisticsProbabilityCalculusLinear AlgebraMachine learning is all about dealing with data. We collect the data from organizations or from any repositories like Kaggle, UCI etc., and perform various operations on the dataset like cleaning and processing the data, visualizing and predicting the output of the data. For all the operations we perform on data, there is one common foundation that helps us achieve all of this through computation-- and that is Math.STATISTICSIt is used in drawing conclusions from data. It deals with the statistical methods of collecting, presenting, analyzing and interpreting the Numerical data. Statistics plays an important role in the field of Machine Learning as it deals with large amounts of data and is a key factor behind growth and development of an organization.Collection of data is possible from Census, Samples, Primary or Secondary data sources and more. This stage helps us to identify our goals in order to work on further steps.The data that is collected contains noise, improper data, null values, outliers etc. We need to clean the data and transform it into a meaningful observations.The data should be represented in a suitable and concise manner. It is one of the most crucial steps as it helps to understand the insights and is used as the foundation for further analysis of data.Analysis of data includes Condensation, Summarization, Conclusion etc., through the means of central tendencies, dispersion, skewness, Kurtosis, co-relation, regression and other methods.The Interpretation step includes drawing conclusions from the data collected as the figures don’t speak for themselves.Statistics used in Machine Learning is broadly divided into two categories, based on the type of analyses they perform on the data. They are Descriptive Statistics and Inferential Statistics.a) Descriptive StatisticsConcerned with describing and summarizing the target populationIt works on a small dataset.The end results are shown in the form of pictorial representations.The tools used in Descriptive Statistics are – Mean, Median, Mode which are the measures of Central and Range, Standard Deviation, variance etc., which are the measures of Variability.b) Inferential StatisticsMethods of making decisions or predictions about a population based on the sample information.It works on a large dataset.Compares, tests and predicts the future outcomes.The end results are shown in the probability scores.The specialty of the inferential statistics is that, it makes conclusions about the population beyond the data available.Hypothesis tests, Sampling Distributions, Analysis of Variance (ANOVA) etc., are the tools used in Inferential Statistics.Statistics plays a crucial role in Machine Learning Algorithms. The role of a Data Analyst in the Industry is to draw conclusions from the data, and for this he/she requires Statistics and is dependent on it.PROBABILITYThe word probability denotes the happening of a certain event, and the likelihood of the occurrence of that event, based on old experiences. In the field of Machine Learning, it is used in predicting the likelihood of future events.  Probability of an event is calculated asP(Event) = Favorable Outcomes / Total Number of Possible OutcomesIn the field of Probability, an event is a set of outcomes of an experiment. The P(E) represents the probability of an event occurring, and E is called an Event. The probability of any event lies in between 0 to 1. A situation in which the event E might occur or not is called a Trail.Some of the basic concepts required in probability are as followsJoint Probability: P(A ∩ B) = P(A). P(B), this type of probability is possible only when the events A and B are Independent of each other.Conditional Probability: It is the probability of the happening of event A, when it is known that another event B has already happened and is denoted by P (A|B)i.e., P(A|B) = P(A ∩ B)/ P(B)Bayes theorem: It is referred to as the applications of the results of probability theory that involve estimating unknown probabilities and making decisions on the basis of new sample information. It is useful in solving business problems in the presence of additional information. The reason behind the popularity of this theorem is because of its usefulness in revising a set of old probabilities (Prior Probability) with some additional information and to derive a set of new probabilities (Posterior Probability).From the above equation it is inferred that “Bayes theorem explains the relationship between the Conditional Probabilities of events.” This theorem works mainly on uncertainty samples of data and is helpful in determining the ‘Specificity’ and ‘Sensitivity’ of data. This theorem plays an important role in drawing the CONFUSION MATRIX.Confusion matrix is a table-like structure that measures the performance of Machine Learning Models or Algorithms that we develop. This is helpful in determining the True Positive rates, True Negative Rates, False Positive Rates, False Negative Rates, Precision, Recall, F1-score, Accuracy, and Specificity in drawing the ROC Curve from the given data.We need to further focus on Probability distributions which are classified as Discrete and Continuous, Likelihood Estimation Functions etc. In Machine Learning, the Naive Bayes Algorithm works on the probabilistic way, with the assumption that input features are independent.Probability is an important area in most business applications as it helps in predicting the future outcomes from the data and takes further steps. Data Scientists, Data Analysts, and Machine Learning Engineers use this probability concept very often as their job is to take inputs and predict the possible outcomes.CALCULUS:This is a branch of Mathematics, that helps in studying rates of change of quantities. It deals with optimizing the performance of machine learning models or Algorithms. Without understanding this concept of calculus, it is difficult to compute probabilities on the data and we cannot draw the possible outcomes from the data we take. Calculus is mainly focused on integrals, limits, derivatives, and functions. It is divided into two types called Differential Statistics and Inferential Statistics. It is used in back propagation algorithms to train deep Neural Networks.Differential Calculus splits the given data into small pieces to know how it changes.Inferential Calculus combines (joins) the small pieces to find how much there is.Calculus is mainly used in optimizing Machine Learning and Deep Learning Algorithms. It is used to develop fast and efficient solutions. The concept of calculus is used in Algorithms like Gradient Descent and Stochastic Gradient Descent (SGD) algorithms and in Optimizers like Adam, Rms Drop, Adadelta etc.Data Scientists mainly use calculus in building many Deep Learning and Machine Learning Models. They are involved in optimizing the data and bringing out better outputs of data, by drawing intelligent insights hidden in them.Linear Algebra:Linear Algebra focuses more on computation. It plays a crucial role in understanding the background theory behind Machine learning and is also used for Deep Learning. It gives us better insights into how the algorithms really work in day-to-day life, and enables us to take better decisions. It mostly deals with Vectors and Matrices.A scalar is a single number.A vector is an array of numbers represented in a row or column, and it has only a single index for accessing it (i.e., either Rows or Columns)A matrix is a 2D array of numbers and can be accessed with the help of both the indices (i.e., by both rows and columns)A tensor is an array of numbers, placed in a grid in a particular order with a variable number of axes.The package named Numpy in the Python library is used in computation of all these numerical operations on the data. The Numpy library carries out the basic operations like addition, subtraction, Multiplication, division etc., of vectors and matrices and results in a meaningful value at the end. Numpy is represented in the form of N-d array.Machine learning models cannot be developed, complex data structures cannot be manipulated, and operations on matrices would not have been performed without the presence of Linear Algebra. All the results of the models are displayed using Linear Algebra as a platform.Some of the Machine Learning algorithms like Linear, Logistic regression, SVM and Decision trees use Linear Algebra in building the algorithms. And with the help of Linear Algebra we can build our own ML algorithms. Data Scientists and Machine Learning Engineers work with Linear Algebra in building their own algorithms when working with data.How do Python functions correlate to Mathematical Functions?So far, we have seen the importance of Mathematics in Machine Learning. But how do Mathematical functions corelate to Python functions when building a machine learning algorithm? The answer is quite simple. In Python, we take the data from our dataset and apply many functions to it. The data can be of different forms like characters, strings, numerical, float values, double values, Boolean values, special characters, Garbage values etc., in the data set that we take to solve a particular machine learning problem. But we commonly know that the computer understands only “zeroes & ones”. Whatever we take as input to our machine learning model from the dataset, the computer is going to understand it as binary “Zeroes & ones” only.Here the Python functions like “Numpy, Scipy, Pandas etc.,” mostly use pre-defined functions or libraries. These help us in applying the Mathematical functions to get better insights of the data from the dataset that we take. They help us to work on different types of data for processing and extracting information from them. Those functions further help us in cleaning the garbage values in data, the noise present in data and the null values present in data and finally help to make the dataset free from all the unwanted matter present in it. Once the data is preprocessed with the Python functions, we can apply our algorithms on the dataset to know which model works better for the data and we can find the accuracies of different algorithms applied on our dataset. The mathematical functions help us in visualizing the content present in the dataset, and helps to get better understanding on the data that we take and the problem we are addressing using a machine learning algorithm.Every algorithm that we use to build a machine learning model has math functions hidden in it, in the form of Python code. The algorithm that we develop can be used to solve a variety of things like a Boolean problem or a matrix problem like identifying an image in a crowd of people and much more. The final stage is to find the best algorithm that suits the model. This is where the mathematical functions in the Python language help us. It helps to analyze which algorithm is best through comparison functions like correlation, F1 score, Accuracy, Specificity, sensitivity etc. Mathematical functions also help us in finding out if the selected model is overfitting or underfitting to the data that we take.To conclude, we cannot apply the mathematical functions directly in building machine learning models, so we need a language to implement the mathematical strategies in the algorithm. This is why we use Python to implement our math models and draw better insights from the data. Python is a suitable language for implementations of this type. It is considered to be the best language among the other languages for solving real-world problems and implementing new techniques and strategies in the field of ML & Data Science.Conclusion:For machine learning enthusiasts and aspirants, mathematics is a crucial aspect to focus on, and it is important to build a strong foundation in Math. Each and every concept you learn in Machine Learning, every small algorithm you write or implement in solving a problem directly or indirectly has a relation to Mathematics.The concepts of math that are implemented in machine learning are built upon the basic math that we learn in 11th and 12th grades. It is the theoretical knowledge that we gain at that stage, but in the area of Machine Learning we experience the practical use cases of math that we have studied earlier.The best way to get familiar with the concepts of Mathematics is to take a Machine Learning Algorithm, find a use case, and solve and understand the math behind it.An understanding of math is paramount to enable us to come up with machine learning solutions to real world problems. A thorough knowledge of math concepts also helps us enhance our problem-solving skills.
3400
The Role of Mathematics in Machine Learning

IntroductionAutomation and machine learning have c... Read More

What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algorithm, on which statistical inferences are made and predictions are done. Consequently, it is important to collect the data, clean it and use it with maximum efficacy. A decent data sampling can guarantee accurate predictions and drive the whole ML project forward whereas a bad data sampling can lead to incorrect predictions. Before diving into the sampling techniques, let us understand what the population is and how does it differ from a sample? The population is the assortment or the collection of the components which shares a few of the other characteristics for all intents and purposes. The total number of observations is said to be the size of the populationImage SourceThe sample is a subset of the population. The process of  choosing a sample from a given set of the population is known as sampling. The number of components in the example is the sample size. Data sampling refers to statistical approaches for picking observations from the domain to estimate a population parameter. Whereas data resampling refers to the drawing of repeated samples from the main or original source of data. It is the non-parametric procedure of statistical extrapolation. It produces unique sample distributions based on the original data and is used to improve the accuracy and intuitively measure the uncertainty of the population parameter. Sampling methods can be divided into two parts: Probability sampling procedure  Non-probability sampling procedure  The distinction between the two is that the example of determination depends on randomization. With randomization, each component persuades equivalent opportunity and is important for test for study. Probability Sampling – It is a method in which each element of a given population has an equivalent chance of being selected. Simple random sampling –For instance, a classroom has 100 students and each student has an equal chance of getting selected as the class representative Systematic sampling- It is a sampling technique in which the first element is selected at random and others get selected based on a fixed sampling interval. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Suppose we want the element with the number 3 and a sample size of 5. The next selection will be made at an interval of 20/5 i.e. 4 so 3 + 4 = 7 so 3,7,11 and so on.  Stratified sampling – In this sampling process, the total group is subdivided into smaller groups, known as the strata, to obtain a sampling process. Assume that we need to identify the average number of votes in three different cities to elect a representative. City x has 1 million citizens, city y has 2 million citizens, city z has 3 million citizens. We can randomly choose a sample size of 60 for the entire population. But if you notice, the random samples are not balanced with respect to the different cities. Hence there could be an estimation error. To overcome this, we may choose a random sample of 10,20,30 from city x, y, z respectively. We can therefore minimize the total estimated error. Reservoir sampling is a randomized algorithm. It is used to select k out of n samples. The n is generally very large or unknown. For instance, reservoir sampling can be used to obtain k out of the number of fish in a lake. Cluster sampling - samples are taken as subgroup /clusters of the population. These subgroups are selected at random. Image SourceNon-probability sampling – In a non-probability sampling method, each instance of a population does not have an equivalent chance of being selected. There is an element of risk of ending up with a non-representative sample which might not bring out a comprehensive outcome. Convenience sampling - This sampling technique includes people or samples that are easy to reach. Though it is the easiest methodology to collect a sample it runs a high risk of not being representative of a population. For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) The surveyor wants the person 4,7,11,18 to participate, hence it can create selection bias. Quota sampling – In Quota sampling methods the sample or the instances are chosen based on their traits or characteristics which matches with the population For instance, consider a population size of 20 (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19.20) Consider a quota in multiple of 4 - (4,8,12,16,20) Judgement sampling - Also known as selective sampling. Here individuals are asked to participate.  Snowball sampling - In this sampling technique, an individual element/person can nominate further elements/people known to them. It is only applicable when the sampling frame is difficult to identify.  A nominates P, P nominates G, G nominates M A > P > G > M The non-probability sampling technique may lead to selection bias and population misrepresentation.  Image SourceWe often come across the case of an imbalanced dataset.  Resampling is a technique used to overcome or to deal with  imbalanced datasets It includes removing samples/elements from the majority class i.e. undersampling  Adding more instances from the minority class i.e. Oversampling  There is a dedicated library to tackle imbalanced datasets in Python - known as imblearn. Imblearn has multiple methods to handle undersampling and oversampling    Image SourceTomek Links for under-sampling - pairs of examples from opposite classes in close instancesMajority elements are eliminated from the Tomek Links which intuitively provides a better understanding and decision boundary for ML classifier  SMOTE for oversampling - Synthetic Minority Oversampling Technique - works by increasing new examples from the minority cases. It is a statistical technique of increasing or generating the number of instances in the dataset in a more balanced manner.  Image SourcePick a minority class as the input vector  Discover its k closest neighbors (k_neighbors is indicated as a contention in the SMOTE()) Pick one of these neighbors and spot a synthetic point anyplace on the line joining the point viable and its picked neighbor  Rehash the above steps until it is adjusted or balanced Other must-read sampling methods - Near miss, cluster centroids for under sampling, ADASYN and bSMOTE for oversampling  Train-Test split  Python is bundled with overpowered ML library. The train_test_Split() module from Scikit-Learn library is one of the major python modules that provides a function to split the datasets into multiple subsets in different ways or let us say randomly into training and validation datasets. The parameter train_size takes a fraction between zero and one for specifying the training size. The remaining samples in the original data set are for testing purposes. The record which is selected for training and test sets are randomly sampled. The simplest method train_test_split() or the split_train_test() are more or less the same. train set – the subset of the dataset to train a model test set - the subset of the dataset to test the trained model The train-test method is used to measure the performance of ML algorithms  It is appropriate to use this procedure when the dataset is very large For any supervised Machine learning algorithms, train-test split can be implemented.  Involves taking the data set as a whole and further subdividing it into two subsets The training dataset is used to fit the model  The test dataset serves as an input to the model The model predictions are made on the test data  The output (prediction) is compared to the expected values  The ultimate objective is to evaluate the performance of the said ML model against the new or unseen data. A visual representation of training or test data:  Image SourceIt is important to note that the test data adheres to the following conditions:   Be large enough to fetch statistically significant results Is a representation of the whole dataset. One must not pick the test set with different traits/characteristics of the training set. Never train on test data - don’t get fooled by good results and high accuracy. It might be the case that one has accidentally trained the model on the test data. The train_test_split() is coupled with additional features: a random seed generator as random_state parameter – this ensures which samples go to training and which go to the test set It takes multiple data sets with the matching number of rows and splits them on similar indices. The train_test_split returns four variables  train_X  - which covers X features of the training set. train_y – which contains the value of a response variable from the training set test_X – which includes X features of the test set test_y – which consists of values of the response variable for the test set. There is no exact rule to split the data by 80:20 or 70:30; it depends on the data and the target variable. Some of the data scientists use a range of 60% to 80% for training and the rest for testing the model. To find the length or the number of records we use len function of python > len(X_train), len (X_test) The model is built by using the training set and is tested using the test set X_train and y_train contain the independent features or variables and response variable values for training datasets respectively. On the other hand, X_test and y_test include the independent features and response variables values for the test dataset respectively. Conclusion: Sampling is an ongoing process of accumulating the information or the observations on an estimate of the population variable. We learnt about sampling types - probability sampling procedure and non-probability sampling procedure. Resampling is a repeated process to draw samples from the main data source. And finally, we learnt about training, testing and splitting the data which are used to measure the performance of the model. The training and testing of the model are done to understand the data discrepancies and develop a better understanding of the machine learning model. 
5440
What Is Data Splitting in Learn and Test Data?

Data is the fuel of every machine learning algo... Read More

Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be pre-processed before it can be used to fit a model. Data preparation is essentially, the task of modifying raw data into a form that can be used for modelling, mostly by data addition, deletion or other data transformation techniques.  We need to pre-process the data before feeding into any algorithm mainly due to the following reasons: Messy data – Real world data is messy, with missing values, redundant values, out-of-range values, errors and noise. Machine learning algorithms need numeric data. More often than not, algorithms have requirements on the input data, for example some algorithms assume a certain probability distribution of the data, others might perform worse if the predictor variables are highly correlated etc. Data preparation tasks are mostly dependent on the dataset we are working with, and to some extent on the choice of model. However, it becomes more evident after initial analysis of the data and EDA. For e.g. looking at the summary statistics, we know if predictors need to be scaled. Looking at correlation matrix you can find out if there are highly correlated predictors. Looking at various plots, e.g. boxplot, you can find, if outliers need to be dealt with, so on and so forth. Even though every dataset is different, we can define a few common steps which can guide us in preparing the data to feed into our learning algorithms. Some common tasks that contribute to data pre-processing are: Data Cleaning Feature Selection Data Transformation Feature Engineering Dimensionality Reduction Note: Throughout this article, we will refer to Python libraries and syntaxes. Data Cleaning: It can be summed up as the process of correcting the errors in the data. Errors could be in the form of missing values, redundant rows or columns, variables with zero or near zero variance and so on. Thus, data cleaning involves a few or all of the below sub-tasks: Redundant samples or duplicate rows: should be identified and dropped from the dataset. In Python,  functions in Pandas such as duplicated() can be used to identify such samples and drop_duplicates() can be used to drop such rows. Redundant Features: If the dataset has features which are highly correlated, it may lead to multi-collinearity (irregular regression coefficient estimates). Such columns can be identified using the correlation matrix and one of the pairs of the highly correlated feature should be dropped. Similarly, near zero variance features, which have the same value for all the samples do not contribute to the variance in data. Such columns should be identified and dropped from the dataset.  Outlier Detection: Outliers are extreme values which fall far away from other observations. Outliers can skew the descriptive statistics of the data, hence mislead data interpretations and negatively impact model performance. So, it is important that the outliers are detected and dealt with. Outliers can be detected through data visualization techniques like box-plots and scatter plots.  Example of outliers being detected using box plots:  Image Source Outliers can also be detected by computing the z-scores or the Inter-Quartile range. When using z-score, a data point which is more than 3 standard deviations away from the mean is normally considered as an outlier.  However, this may vary based on the size of the dataset. When using inter-quartile range, a point which is below Q1 - 1.5 inter-quartile range or above Q3 + 1.5 inter-quartile range is considered to be an outlier, where Q1 is the first quartile and Q3 is the third quartile. Below diagram shows outliers which are more than 3 standard deviations from the mean: Image Source If there are a few outliers, you may choose to drop the samples with outliers. Else if there are too many outliers, these can be modelled separately. We may also choose to cap or floor the outlier values by the 95th percentile or 5th percentile value. However, you may choose the appropriate replacement value by analyzing the deciles of the data. Missing Values: Data with missing values cannot be used for modelling; hence any missing values should be identified and cleaned. If the data in the predictor or sample is sparse, we may choose to drop the entire column/row. Else we may impute the missing value with mean or median. Missing values in categorical variables can be replaced with the most frequent class. Points to remember: Use z-score for outlier detection if the data follows Gaussian distribution, else use Inter-Quartile range for outlier detection. Feature Selection: Sometimes datasets have hundreds of input variables, not all of which are good predictors of the target and may contribute to noise in the data. Feature selection techniques are used to find the input variables that can most efficiently predict the target variable, in order to reduce the number of input variables. Feature selection techniques can be further classified as supervised selection techniques and unsupervised selection techniques. As the name suggests, unsupervised selection techniques do not consider the target variable while eliminating the input variables. This would include techniques like using correlation to eliminate highly correlated predictors or eliminating low variance predictors. Supervised feature selection techniques consider the target variable for selecting the features to be eliminated. These can be further divided into three groups namely, Intrinsic, Filter and Wrapper techniques. Intrinsic – the feature selection process is embedded in the model building process itself, for e.g. tree-based algorithms which pick up the best predictor for the split. Similarly, regularization techniques like lasso shrinks the coefficient of the predictors such that the coefficient can be shrunk to zero for some predictors, and hence are excluded from the model. Multivariate adaptive regression spline (MARS) models also fall under this category. A major advantage of such methods is that since the feature selection is a part of model building process, it is relatively fast. However model dependance can also prove to be disadvantageous for e.g. some tree-based algorithms are greedy and hence may select predictors which may lead to sub-optimal fit. Filter – Filter based selection techniques use some statistical method to score each predictor separately with the target variable and choose the predictors with highest scores. It is mostly univariate analysis, i.e., each predictor is evaluated in isolation. It does not consider the correlation of independent variables amongst themselves. Based on the type of the input variable i.e., numerical or categorical and the type of output variable an appropriate statistical measure can be used to evaluate predictors for feature selection: for example, Pearson’s correlation coefficient, Spearmon’s correlation coefficient, ANOVA, Chi-square. Wrapper – Wrapper feature selection builds models using various subsets of predictors iteratively, and evaluates the model, until it finds a subset of features which best predict the target. These methods are agnostic to the type of variables. However, they are computationally more taxing. RFE is a commonly used wrapper-based feature selection method. Recursive Feature Elimination is a greedy backward elimination technique, which starts with a complete set of predictors and systematically eliminates less useful predictors, until it finds a subset of predictors which best predict the target variable with the specified number of predictors. Two important hyperparameters for RFE algorithm in scikit learn are the number of predictors(n_features_to_select) and the algorithm of choice (estimator). Points to remember: Feature selection techniques reduce the number of features by excluding or eliminating the existing features from the dataset, whereas dimensionality reduction techniques create a projection of the data in lower dimensional feature space, which does not have a one-to-one mapping with the existing features. However, both have a similar goal of reducing the number of independent variables. Data Transformations: We may need to transform data to change its data type, scale or distribution. Type: We need to analyze the input variables at the very beginning to understand if the predictors are represented with the appropriate data type, and do the required conversions before progressing with the EDA and modelling. For e.g., sometimes the Boolean values are encoded as true and false, and we may transform them to take values 0 and 1. Similarly sometimes we may come across integer variables where it might be more appropriate to treat it as a categorical variable. For e.g. when working on a dataset to predict car prices, it would be more appropriate to treat the variable ‘Number of doors’ which takes up values {2,4} as a categorical variable.  Categorical variables should be converted to numeric, before they can be used for modelling. There are many categorical variable encoding techniques like, N-1 dummy encoding, 1 Hot encoding, label encoding, frequency encoding. Ordinal encoding can be used when we want to specify and maintain the order of the ordinal variable. Scale: Predictor variables may have different units (Km, $, years etc.) and hence, different scales. For e.g. we might have input variables like age and salary in a dataset. Scale of the variable salary will always be much higher than the age, and hence may contribute unequally to the model and create a bias. Hence, we transform the predictors to bring them to a common scale. Normalization and standardization are the most widely used scaling techniques. Normalization: helps scale the data such that all values lie between the range of 0 and 1. The scikit-learn library method even allows one to specify the preferred range. Data shown before and after normalization:  Image SourceStandarisation: We standardize the data by centering it around the mean and then scaling the data by the standard deviation. In other words, mean of the variable is subtracted from each value of the input variable and the difference is divided by the standard deviation of the variable. The resulting data will have zero mean and standard deviation 1. Standardisation assumes that the data follows a Gaussian distribution. Scikit learn library in python can be used for normalization (MinMaxScaler()) and standardization (StandardScaler()).  Data shown before and after standardization:  Image Source Distribution: Many algorithms assume Gaussian distribution for the underlying data. If the data is not Gaussian or is Gaussian like, we can transform the data to reduce the skewness. Box-Cox transform, or Yeo-Johnson transform can be used to perform power transformations on the data. Box-Cox transform applies a different transformation on the data based on the value of lambda. For e.g. for Lambda = -1, it does inverse transformation, for Lambda=0 it does log transformation, for Lambda = 0.5, it does square root transformation, for Lambda = -0.5 it does reciprocal square root transformation. PowerTransformer() class in the python scikit library can be used for making these power transformations.Data shown before and after log transformation: Image SourcePoints to remember: Data transformations should be done on the training dataset, so that the statistic required for transformation is estimated from the training set only and then applied on the validation set. Decision trees and other tree-based ensembles like Random forest and boosting algorithms are not impacted by different scale of the input variables. Hence scaling may not be required.  Linear regression and neural networks which use weighted sum of the input variables and K-nearest neighbors or SVM which compute distance or dot product between predictors will be impacted by the scale of the predictors, hence input variables should be scaled for these models. Between normalization and standardization, one should standardize when the data follows a Gaussian distribution, else normalize. Feature Engineering:  is the part of data pre-processing where we derive new features using one or more existing features. For e.g. when working on taxi fare prediction problem, we may derive a new feature, distance travelled in the ride with the use of latitude and longitude co-ordinates of the start and end point of the ride. Or when working on predicting sales or foot fall for a retail business we may need to add a new feature to factor in, the impact of holiday, weekends and festivals on the target variable. Hence, we may need to engineer these new predictors and feed them into our model to identify the underlying patterns effectively. Polynomial term: We may also add new features by raising the existing input variables to a higher degree polynomial. Polynomial terms help the model learn the non-linear patterns. When polynomial terms of existing features are added to the linear regression model, it is termed as polynomial regression. Usually, we stick to a smaller degree of 2 or 3. Interaction term: We may add new features that represent interaction between existing features by adding a product of two features. For e.g. if we are working on a problem to help businesses allocate their marketing budget between various marketing mediums like radio, TV and newspaper, we need to model how effective each medium is. We may like to factor in the interaction term of a radio and newspaper campaign, to understand the effectiveness of marketing if both the radio and newspaper campaigns were run together at the same time. Similarly, when predicting a crop yield, we may engineer a new interaction term for fertilizer and water together to factor in how the yield varies when water and fertilizer are provided together. Points to remember: When using polynomial terms in the model, it is good practice to restrict the degree of the polynomial to 3 or at most 4. This is firstly, to control the number of input variables. Secondly, a larger degree of the polynomial will result in large values which may impact the weights(parameters) to be large and hence make the model less sensitive to small changes. Domain knowledge or the advice of the SME may come in handy to identify effective interaction terms. Dimensionality Reduction: Sometimes data might have hundreds and even thousands of features. High dimensional data can be more complicated, with way more parameters to train and a very complicated model structure. In higher dimensions, the volume of space is huge, and the data points become sparse, which could negatively impact the machine learning algorithm performance. This is sometimes also referred to as the curse of dimensionality.  Dimensionality Reduction techniques are used to reduce the number of predictor variables in the dataset. Some techniques for dimensionality reduction are: PCA or Principal Component Analysis uses linear algebra and Eigenvalue to achieve dimensionality reduction. For given datapoints PCA finds orthogonal set of directions, that have maximum variance. Rotating the reference frame, it finds the directions (ones which correspond to smallest eigen values) which can be neglected. Principal Component Analysis applied to a dataset is shown below: Manifold learning is a non-linear dimensionality reduction technique which uses geometric properties of the data, to create low dimensional projections of a high dimensional data, while preserving its structure and relationships, and to visualize high dimensional data, which is otherwise difficult. SOM Self organizing Map also called Kohonen map and t-SNE are examples of Manifold learning techniques.  t-distributed stochastic neighbor embedding (t-SNE) computes the probability that pairs of datapoints (in high dimension) are related and maps them in low dimension, such that data has a similar distribution. Autoencoders are deep learning neural networks that learn low dimensional representation of a given dataset in an unsupervised manner. The hidden layer is limited to contain fewer neurons, thus it learns to map high dimensional input vector into low dimensional vector, while still preserving the underlying structure and relationships in the data. Autoencoders have two parts, encoder which learns to map high dimensional vector to a low-dimensional space and decoder, which maps the data from low to high dimension. The output from the encoder with reduced dimension can be fed into any another model for supervised learning. Points to remember:  Dimensionality reduction is mostly performed after data cleaning and data scaling.  It is imperative that the dimensionality reduction performed on the training data set must also be performed on the validation and the new data on which the model will predict. Conclusion: Data preparation is an important and integral step of machine learning projects. There are multiple techniques for various data cleaning tasks. However, there are no best or worst data cleaning techniques. Every machine learning problem is unique and so is the underlying data. We need to apply different techniques and see what works best based on the data and the problem at hand.  
10506
Data Preparation for Machine Learning Projects

The data we collect for machine-learning must be... Read More