How to perform PCA using scikit-learn in Python

share link

by Abdul Rawoof A R dot icon Updated: May 8, 2023

technology logo
technology logo

Solution Kit Solution Kit  

Kernel Principal Component Analysis helps identify the axes of variance within a dataset. PCA allows us to explore data to understand the variables, and it spots outliers.


Kernel Principal Component Analysis is the tool in the data analysis tool kit. We do PCA to reduce the number of dimensions in the dataset. PCA can help to find patterns in the high-dimensional dataset. It can visualize high-dimensionality data. Particularly useful in processing data where multi-co-linearity exists between the features and variables. 


In particular, decomposition algorithms like principal component analysis or blind source separation algorithms. The algorithms can be independent component analysis available through the methods described. We will look at other unsupervised learning techniques in an unsupervised learning method. It builds on some of the ideas of PCA. An unsupervised machine learning method helps in exploratory data analysis. It helps reduce the dimension of a dataset. We can do it by projecting a dimensional feature space onto a dimensional subspace where it is less than. 

Types of PCA:

  1. Kernel PCA.
  2. Sparse PCA.
  3. Incremental PCA.


It helps identify the main axes of variance within a dataset. It allows for an easy way to explore data to understand the key variables and spot outliers. We can apply it properly, the most powerful tool in the data analysis tool kit. We do PCA to reduce the number of dimensions in the dataset. PCA can help find patterns in the high-dimensional dataset and visualize high-dimensionality data. Particularly useful in processing data where multi-colinearity exists between the features and variables. It shows clear contrast compared with a random forest. It uses the class membership information or data to compute the node impurities. Variance measures the spread of values along a feature axis. Please choose the first eigenvectors corresponding to the eigenvalues. It is where is the dimension of the features space, and it's such that $(k<d).


It uses the LAPACK implementation of the full SVD or a randomized truncated SVD by the method of Halko et al. 2009, depending on the shape of the input data and the number of components to extract. Run exact SVD calling the LAPACK solver and select the components by postprocessing. We will follow the classic ML pipeline where we import libraries and datasets. Then we perform exploratory data analysis and preprocessing. We can now train our models, make predictions, and evaluate accuracies. Identifying and extracting trends from large datasets is often difficult. We can use decomposition methods, blind source separation, and cluster analysis. It is because they play an important role in this process.


Python's Scikit-Learn library helps with data analysis. The Python ecosystem will be a gold standard for Machine Learning algorithms.

import numpy offers a multidimensional array object and variations like mask and metrics.

Limitations:

Kernel PCA is difficult to determine which characteristics in the dataset are significant. The PCA could only catch the most basic invariance if the training data clearly stated it. For example, after we compute the main components.

Advantages:

  • Kernel Principal Component Analysis is easy to compute and based on linear algebra.
  • It speeds up the other machine-learning algorithms.
  • It counteracts the issues of high-dimensional data in machine learning.

Disadvantages:

  • The kernel Principal Component has lower interpretability than the principal components.
  • The trade-off between information or data loss and dimensionality reduction.


Here is an example of performing PCA using scikit-learn in Python:

Fig: Preview of the output that you will get on running this code from your IDE.

Code

In this solution, we are using scikit-learn, NumPy and Pandas.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca. components_) 
plt.show()

from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
np.random.seed(0)

# 10 samples with 5 features
train_features = np.random.rand(10,5)

model = PCA(n_components=2).fit(train_features)
X_pc = model.transform(train_features)

# number of components
n_pcs= model.components_.shape[0]

# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(model.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = ['a','b','c','d','e']
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]

# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}

# build the dataframe
df = pd.DataFrame(dic.items())

     0  1
 0  PC0  e
 1  PC1  d

Instructions

Follow the steps carefully to get the output easily.

  1. Install PyCharm Community Edition on your computer.
  2. Open terminal and install the required libraries with following commands.
  3. Install Scikit-learn - pip install scikit-learn.
  4. Install NumPy - pip install numpy.
  5. Install Pandas - pip install pandas.
  6. Create a new Python file(eg: test.py).
  7. Copy the snippet using the 'copy' button and paste it into that file.
  8. Then add print statement at end of the code(like 'print(df)').
  9. Run the file using run button.


I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.


I found this code snippet by searching for 'pca on sklearn' in kandi. You can try any such use case!

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

  1. The solution is created in PyCharm 2022.3.3.
  2. The solution is tested on Python 3.9.7.
  3. Scikit-learn version 1.2.2.
  4. NumPy version v1.24.2.
  5. Pandas version 2.0.0.


Using this solution, we are able to perform PCA using scikit-learn Python with simple steps. This process also facilities an easy way to use, hassle-free method to create a hands-on working version of code which would help us to perform PCA using scikit-learn in Python.

Dependent Libraries

scikit-learnby scikit-learn

Python doticonstar image 54584 doticonVersion:1.2.2doticon
License: Permissive (BSD-3-Clause)

scikit-learn: machine learning in Python

Support
    Quality
      Security
        License
          Reuse

            scikit-learnby scikit-learn

            Python doticon star image 54584 doticonVersion:1.2.2doticon License: Permissive (BSD-3-Clause)

            scikit-learn: machine learning in Python
            Support
              Quality
                Security
                  License
                    Reuse

                      pandasby pandas-dev

                      Python doticonstar image 38689 doticonVersion:v2.0.2doticon
                      License: Permissive (BSD-3-Clause)

                      Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

                      Support
                        Quality
                          Security
                            License
                              Reuse

                                pandasby pandas-dev

                                Python doticon star image 38689 doticonVersion:v2.0.2doticon License: Permissive (BSD-3-Clause)

                                Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
                                Support
                                  Quality
                                    Security
                                      License
                                        Reuse

                                          numpyby numpy

                                          Python doticonstar image 23755 doticonVersion:v1.25.0rc1doticon
                                          License: Permissive (BSD-3-Clause)

                                          The fundamental package for scientific computing with Python.

                                          Support
                                            Quality
                                              Security
                                                License
                                                  Reuse

                                                    numpyby numpy

                                                    Python doticon star image 23755 doticonVersion:v1.25.0rc1doticon License: Permissive (BSD-3-Clause)

                                                    The fundamental package for scientific computing with Python.
                                                    Support
                                                      Quality
                                                        Security
                                                          License
                                                            Reuse

                                                              You can also search for any dependent libraries on kandi like 'scikit-learn', 'pandas' and 'NumPy'.

                                                              FAQ: 

                                                              1. What is Kernel Principal Component Analysis? How does it differ from traditional PCA? 

                                                              Kernel PCA allows us to generalize linear PCA to nonlinear dimensionality reduction. Traditional PCA allows linear dimensionality reduction. 


                                                              2. How is principal component analysis an unsupervised machine learning technique? 

                                                              The principal component analysis is an unsupervised machine learning technique. We can use it for exploratory data analysis. 


                                                              3. How do the first two principal components help to interpret a dataset better? 

                                                              We can say that (72.22 + 23.9) 96.21% percent of the classification information in the feature set. We can do it by wanting to capture the first two principal components. 


                                                              4. What are Probabilistic PCA and its advantages over traditional PCA methods? 

                                                              The noise covariance follows the Probabilistic PCA model from Tipping and Bishop 1999. The advantages are easy to compute and based on linear algebra. 


                                                              5. Explain matrix decompositions as applied to principal component analysis? 

                                                              We are finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. 

                                                              Support

                                                              1. For any support on kandi solution kits, please use the chat
                                                              2. For further learning resources, visit the Open Weaver Community learning page.


                                                              See similar Kits and Libraries