How to perform PCA using scikit-learn in Python

by Abdul Rawoof A R Updated: May 8, 2023

Solution Kit

Kernel Principal Component Analysis helps identify the axes of variance within a dataset. PCA allows us to explore data to understand the variables, and it spots outliers.

Kernel Principal Component Analysis is the tool in the data analysis tool kit. We do PCA to reduce the number of dimensions in the dataset. PCA can help to find patterns in the high-dimensional dataset. It can visualize high-dimensionality data. Particularly useful in processing data where multi-co-linearity exists between the features and variables.

In particular, decomposition algorithms like principal component analysis or blind source separation algorithms. The algorithms can be independent component analysis available through the methods described. We will look at other unsupervised learning techniques in an unsupervised learning method. It builds on some of the ideas of PCA. An unsupervised machine learning method helps in exploratory data analysis. It helps reduce the dimension of a dataset. We can do it by projecting a dimensional feature space onto a dimensional subspace where it is less than.

Types of PCA:

Kernel PCA.
Sparse PCA.
Incremental PCA.

It helps identify the main axes of variance within a dataset. It allows for an easy way to explore data to understand the key variables and spot outliers. We can apply it properly, the most powerful tool in the data analysis tool kit. We do PCA to reduce the number of dimensions in the dataset. PCA can help find patterns in the high-dimensional dataset and visualize high-dimensionality data. Particularly useful in processing data where multi-colinearity exists between the features and variables. It shows clear contrast compared with a random forest. It uses the class membership information or data to compute the node impurities. Variance measures the spread of values along a feature axis. Please choose the first eigenvectors corresponding to the eigenvalues. It is where is the dimension of the features space, and it's such that $(k<d).

It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract. Run exact SVD calling the LAPACK solver and select the components by postprocessing. We will follow the classic ML pipeline where we import libraries and datasets. Then we perform exploratory data analysis and preprocessing. We can now train our models, make predictions, and evaluate accuracies. Identifying and extracting trends from large datasets is often difficult. We can use decomposition methods, blind source separation, and cluster analysis. It is because they play an important role in this process.

Python's Scikit-Learn library helps with data analysis. The Python ecosystem will be a gold standard for Machine Learning algorithms.

import numpy offers a multidimensional array object and variations like mask and metrics.

Limitations:

Kernel PCA is difficult to determine which characteristics in the dataset are significant. The PCA could only catch the most basic invariance if the training data clearly stated it. For example, after we compute the main components.

Advantages:

Kernel Principal Component Analysis is easy to compute and based on linear algebra.
It speeds up the other machine-learning algorithms.
It counteracts the issues of high-dimensional data in machine learning.

Disadvantages:

The kernel Principal Component has lower interpretability than the principal components.
The trade-off between information or data loss and dimensionality reduction.

Here is an example of performing PCA using scikit-learn in Python:

Fig: Preview of the output that you will get on running this code from your IDE.

Code

In this solution, we are using scikit-learn, NumPy and Pandas.

PCA on sklearn - how to interpret pca.components_

PythonLines of Code : 73License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca. components_) 
plt.show()

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

     0  1
 0  PC0  e
 1  PC1  d

Instructions

Follow the steps carefully to get the output easily.

Install PyCharm Community Edition on your computer.
Open terminal and install the required libraries with following commands.
Install Scikit-learn - pip install scikit-learn.
Install NumPy - pip install numpy.
Install Pandas - pip install pandas.
Create a new Python file(eg: test.py).
Copy the snippet using the 'copy' button and paste it into that file.
Then add print statement at end of the code(like 'print(df)').
Run the file using run button.

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.

I found this code snippet by searching for 'pca on sklearn' in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

The solution is created in PyCharm 2022.3.3.
The solution is tested on Python 3.9.7.
Scikit-learn version 1.2.2.
NumPy version v1.24.2.
Pandas version 2.0.0.

Using this solution, we are able to perform PCA using scikit-learn Python with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to perform PCA using scikit-learn in Python.

Dependent Libraries

scikit-learnby scikit-learn

Python

54584

Version:1.2.2

License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

scikit-learnby scikit-learn

Python 54584 Version:1.2.2 License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support

Quality

Security

License

Reuse

pandasby pandas-dev

Python

38689

Version:v2.0.2

License: Permissive (BSD-3-Clause)

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Support

Quality

Security

License

Reuse

pandasby pandas-dev

Python 38689 Version:v2.0.2 License: Permissive (BSD-3-Clause)

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Support

Quality

Security

License

Reuse

numpyby numpy

Python

23755

Version:v1.25.0rc1

License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

numpyby numpy

Python 23755 Version:v1.25.0rc1 License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

You can also search for any dependent libraries on kandi like 'scikit-learn', 'pandas' and 'NumPy'.

FAQ:

1. What is Kernel Principal Component Analysis? How does it differ from traditional PCA?

Kernel PCA allows us to generalize linear PCA to nonlinear dimensionality reduction. Traditional PCA allows linear dimensionality reduction.

2. How is principal component analysis an unsupervised machine learning technique?

The principal component analysis is an unsupervised machine learning technique. We can use it for exploratory data analysis.

3. How do the first two principal components help to interpret a dataset better?

We can say that (72.22 + 23.9) 96.21% percent of the classification information in the feature set. We can do it by wanting to capture the first two principal components.

4. What are Probabilistic PCA and its advantages over traditional PCA methods?

The noise covariance follows the Probabilistic PCA model from Tipping and Bishop 1999. The advantages are easy to compute and based on linear algebra.

5. Explain matrix decompositions as applied to principal component analysis?

We are finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page.

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

How to perform PCA using scikit-learn in Python

Types of PCA:

Limitations:

Advantages:

Disadvantages:

Code

Instructions

Environment Tested

Dependent Libraries

FAQ:

Support

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow