# Introduction to Principal Component Analysis (PCA) in Python

9K

Python is no longer an unfamiliar word for professionals from the IT or Web Designing world. It’s one of the most widely used programming languages because of its versatility and ease of usage. It has a focus on object-oriented, as well as functional and aspect-oriented programming. Python extensions also add a whole new dimension to the functionality it supports. The main reasons for its popularity are its easy-to-read syntax and value for simplicity. The Python language can be used as a glue to connect components of existing programmes and provide a sense of modularity.

## Introducing Principal Component Analysis with Python

### 1. Principal Component  Analysis definition

Principal Component Analysis is a method that is used to reduce the dimensionality of large amounts of data. It transforms many variables into a smaller set without sacrificing the information contained in the original set, thus reducing the dimensionality of the data.

PCA Python is often used in machine learning as it is easier for machine learning software to analyse and process smaller sets of data and variables. But this comes at a cost. Since a larger set of variables contends, it sacrifices accuracy for simplicity. It preserves as much information as possible while reducing the number of variables involved.

The steps for Principal Component Analysis Python include Standardisation, that is, standardising the range of the initial variables so that they contribute equally to the analysis. It is to prevent variables with larger ranges from dominating over those with smaller ranges.

The next step involves complex matrix computation. It involves checking if there is any relationship between variables and shows if they contain redundant information or not. To identify this, the covariance matrix is computed.

The next step is determining the principal components of the data. Principal Components are the new variables that are formed from the mixtures of the initial variables. The principal components are formed such that they're Uncorrelated, unlike the initial variables. They follow a descending order where the program tries to put as much information as possible in the first component, the remaining in the second, etc. It helps to discard components with low information and effectively reduces the number of variables. This comes at the cost of the principal components losing the meaning of the initial data.

Further steps include computing the eigenvalues and discarding the figures with fewer eigenvalues, meaning that they have less significance. The remaining is a matrix of vectors that can be called the Feature Vector. It effectively reduces the dimensions since we take an eigenvalue. The last step involves reorienting the data obtained in the original axes to recast it along the axes formed by the principal components.

### 2. Objectives of PCA

The objectives of Principal Component Analysis are the following:

Find and Reduce the dimensionality of a data set As shown above, Principal Component Analysis is a helpful procedure to reduce the dimensionality of a data set by lowering the number of variables to keep track of.

• Identify New Variables

Sometimes this process can help one identify new underlying pieces of information and find new variables for the data sets which were previously missed.

• Remove needless Variables

The process reduces the number of needless variables by eliminating those with very little significance or those that strongly correlate with other variables.

Image Source

### 3. Uses of PCA

The uses of Principal Component Analysis are wide and encompass many disciplines, for instance, statistics and geography with applications in image compression techniques etc. It is a huge component of compression technology for data and may be in video form, picture form, data sets and much more.

It also helps to improve the performance of algorithms as more features will increase their workload, but with Principal Component Analysis, the workload is reduced to a great degree. It helps to find correlating values since finding them manually in thousands of sets is almost impossible

Overfitting is a phenomenon that occurs when there are too many variables in a set of data. Principal Component Analysis reduces overfitting, as the number of variables is now reduced.

It is very difficult to carry out the visualisation of data when the number of dimensions being dealt with is too high. PCA alleviates this issue by reducing the number of dimensions, so visualisation is much more efficient, easier on the eyes and concise. We can potentially even use a 2D plot to represent the data after Principal Component Analysis.

### 4. Applications of PCA

As discussed above, PCA has a wide range of utilities in image compression, facial recognition algorithms, usage in geography, finance sectors, machine learning, meteorological departments and more. It is also used in the medical sector to interpret and process Medical Data while testing medicines or analysis of spike-triggered covariance. The scope of applications of PCA implementation is really broad in the present day and age.

For example, in neuroscience, spike-triggered covariance analysis helps to identify the properties of stimulus that causes a neutron to fire up. It also helps to identify individual neutrons using the action potential they emit. Since it is a dimension reduction technique, it helps to find a correlation in the activity of large ensembles of neutrons. This comes in special use during drug trials that deal with neuronal actions.

### 5. Principal Axis Method

In the principal axis method, the assumption is that the common variance in communalities is less than one. The implementation of the method is carried out by replacing the main diagonal of the correlation matrix with the initial communality estimates. The initial matrix consisted of ones as per the PCA methodology. The principal components are now applied to this new and improved version of the correlation matrix.

### 6. PCA for Data Visualization

Tools like Plotly allow us to visualise data with a lot of dimensions using the method of dimensional reduction and then applying it to a projection algorithm. In this specific example, a tool like Scikit-Learn can be used to load a data set and then the dimensionality reduction method can be applied to it. Scikit learn is a machine learning library. It has an arsenal of software and training machine learning algorithms along with evaluation and testing models. It works easily with NumPy and allows us to use the Principal Component Analysis Python and pandas library.

The PCA technique ranks the various data points based on relevance, combines correlated variables and helps to visualise them. Visualising only the Principal components in the representation helps make it more effective. For example, in a dataset containing 12 features, 3 represent more than 99% of the variance and thus can be represented in an effective manner.

The number of features can drastically affect its performance. Hence, reducing the amount of these features helps a lot to boost machine learning algorithms without a measurable decrease in the accuracy of results.

### 7. PCA as dimensionality reduction

The process of reducing the number of input variables in models, for instance, various forms of predictive models, is called dimensionality reduction. The fewer input variables one has, the simpler the predictive model is. Simple often means better and can encapsulate the same things as a more complex model would. Complex models tend to have a lot of irrelevant representations. Dimensionality reduction leads to sleek and concise predictive models.

Principal Component Analysis is the most common technique used for this purpose. Its origin is in the field of linear algebra and is a crucial method in data projection. It can automatically perform dimensionality reduction and give out principal factors, which can be translated as a new input and make much more concise predictions instead of the previous high dimensionality input.

In this process, the features are reconstructed; in essence, the original features don't exist. They are, however, constructed from the same overall data but are not directly compared to it, but they can still be used to train machine learning models just as effectively.

### 8. PCA for visualisation: Hand-written digits

Handwritten digit recognition is a machine learning system's ability to identify digits written by hand, as on post, formal examinations and more. It's important in the field of exams where OMR sheets are often used. The system can recognise OMRs, but it also needs to recognise the student's information, besides the answers. In Python, a handwritten digit recognition system can be developed using moist Datasets. When handled with conventional PCA strategies of machine learning, these datasets can yield effective results in a practical scenario. It is really difficult to establish a reliable algorithm that can effectively identify handwritten digits in environments like the postal service, banks, handwritten data entry etc. PCA ensures an effective and reliable approach for this recognition.

### 9. Choosing the number of components

One of the most important parts of Principal Component analysis is estimating the number of components needed to describe the data. It can be found by having a look at the cumulative explained variance ratio and taking it as a function of the number of components.

One of the rules is Kaiser's Stopping file, where one should choose all components with an eigenvalue of more than one. This means that variables that have a measurable effect are the only ones that get chosen.

We can also plot a graph of the component number along with eigenvalues. The trick is to stop including values when the slope becomes close to a straight line in shape.

### 10. PCA as Noise Filtering

Principal Component Analysis has found a utility in the field of physics. It is used to filter noise from experimental electron energy loss (EELS) spectrum images. It, in general, is a method to remove noise from the data as the number of dimensions is reduced. The nuance is also reduced, and one only sees the variables which have the maximum effect on the situation. The principal component analysis method is used after the conventional demonising agents fail to remove some remnant noise in the data. Dynamic embedding technology is used to perform the principal component analysis. Then the eigenvalues of the various variables are compared, and the ones with low eigenvalues are removed as noise. The larger eigenvalues are used to reconstruct the speech data.

The very concept of principal component analysis lends itself to reducing noise in data, removing irrelevant variables and then reconstructing data which is simpler for the machine learning algorithms without missing the essence of the information input.

### 11. PCA to Speed-up Machine Learning Algorithms

The performance of a machine learning algorithm, as discussed above, is inversely proportional to the number of features input in it. Principal component analysis, by its very nature, allows one to drastically reduce the number of features of variables input, allows one to remove excess noise and reduces the dimensionality of the data set. This, in turn, means that there is a lot less strain on a machine learning algorithm, and it can produce near identical results with heightened efficiency.

### 12. Apply Logistic Regression to the Transformed Data

Logistic regression can be used after a principal component analysis. The PCA is a dimensionality reduction, while the logical regression is the actual brains that make the predictions. It is derived from the logistic function, which has its roots in biology.

### 13. Measuring Model Performance

After preparing the data for a machine learning model using PCA, the effectiveness or performance of the model doesn’t change drastically. This can be tested by several metrics such as testing true positives, negatives, and false positives and false negatives. The effectiveness is computed by plotting them on a specialised confusion matrix for the machine learning model.

### 14. Timing of Fitting Logistic Regression after PCA

Principle component regression Python is the technique that can give predictions of the machine learning program after data prepared by the PCA process is added to the software as input. It more easily proceeds, and a reliable prediction is returned as the end product of logical regression and PCA.

### 15. Implementation of PCA with Python

scikit learn can be used with Python to implement a working PCA algorithm, enabling Principal Component Analysis in Python 720 as explained above as well. It is a working form of linear dimensionality reduction that uses singular value decomposition of a data set to put it into a lower dimension space. The input data is taken, and the variables with low eigenvalues can be discarded using Scikit learn to only include ones that matter- the ones with a high eigenvalue.

Steps involved in the Principal Component Analysis

1. Standardization of dataset.
2. Calculation of covariance matrix.
3. Complete the eigenvalues and eigenvectors for the covariance matrix.
4. Sort eigenvalues and their corresponding eigenvectors.
5. Determine, k eigenvalues and form a matrix of eigenvectors.
6. Transform the original matrix.

Conclusion

In conclusion, PCA is a method that has high possibilities in the field of science, art, physics, chemistry, as well as the fields of graphic image processing, social sciences and much more, as it is effectively a means to compress data without compromising on the value it gives. Only the variables that do not significantly affect the value are removed, and the correlated variables are consolidated.

### Abhresh Sugandhi

Author

Abhresh is specialized as a corporate trainer, He has a decade of experience in technical training blended with virtual webinars and instructor-led session created courses, tutorials, and articles for organizations. He is also the founder of Nikasio.com, which offers multiple services in technical training, project consulting, content development, etc.