kandi background
Explore Kits

K-Means | K-Means Clustering using MapReduce

 by   himank Java Version: Current License: No License

 by   himank Java Version: Current License: No License

Download this library from

kandi X-RAY | K-Means Summary

K-Means is a Java library. K-Means has no bugs, it has no vulnerabilities and it has low support. However K-Means build file is not available. You can download it from GitHub.
K-Means Clustering using MapReduce
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • K-Means has a low active ecosystem.
  • It has 63 star(s) with 62 fork(s). There are 5 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 3 open issues and 2 have been closed. On average issues are closed in 344 days. There are 1 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of K-Means is current.
K-Means Support
Best in #Java
Average in #Java
K-Means Support
Best in #Java
Average in #Java

quality kandi Quality

  • K-Means has 0 bugs and 0 code smells.
K-Means Quality
Best in #Java
Average in #Java
K-Means Quality
Best in #Java
Average in #Java

securitySecurity

  • K-Means has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • K-Means code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
K-Means Security
Best in #Java
Average in #Java
K-Means Security
Best in #Java
Average in #Java

license License

  • K-Means does not have a standard license declared.
  • Check the repository for any license declaration and review the terms closely.
  • Without a license, all rights are reserved, and you cannot use the library in your applications.
K-Means License
Best in #Java
Average in #Java
K-Means License
Best in #Java
Average in #Java

buildReuse

  • K-Means releases are not available. You will need to build from source code and install.
  • K-Means has no build file. You will be need to create the build yourself to build the component from source.
K-Means Reuse
Best in #Java
Average in #Java
K-Means Reuse
Best in #Java
Average in #Java
Top functions reviewed by kandi - BETA

kandi has reviewed K-Means and discovered the below as its top functions. This is intended to give you an instant insight into K-Means implemented functionality, and help decide if they suit your requirements.

  • Runs the hdfs algorithm .
    • Entry point .

      Get all kandi verified functions for this library.

      Get all kandi verified functions for this library.

      K-Means Key Features

      K-Means Clustering using MapReduce

      Eigenvectors and samples to calculate the points on PCA scale

      copy iconCopydownload iconDownload
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      
      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      
      centered_data = data - np.mean(data, axis=1)[:, None]
      
      covar = np.cov(centered_data)
      
      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      
      eigen_vals[:2] = array([33.49985588, 22.64997038])
      
      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      
      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      
      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      
      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      
      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      
      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      

      How do I color clusters after k-means and TSNE in either seaborn or matplotlib?

      copy iconCopydownload iconDownload
      df["y"] = model.labels_
      df["comp-1"] = transformed_centroids[:-2, 0]
      df["comp-2"] = transformed_centroids[:-2, 1]
      
      df["y"] = model.labels_
      df["comp-1"] = transformed_centroids[:true_k, 0]
      df["comp-2"] = transformed_centroids[:true_k, 1]
      
      df["y"] = model.labels_
      df["comp-1"] = transformed_centroids[:-2, 0]
      df["comp-2"] = transformed_centroids[:-2, 1]
      
      df["y"] = model.labels_
      df["comp-1"] = transformed_centroids[:true_k, 0]
      df["comp-2"] = transformed_centroids[:true_k, 1]
      

      Visualise in R with ggplot, a k-means clustered developmental gene expression dataset

      copy iconCopydownload iconDownload
      library(pRolocdata)
      library(dplyr)
      library(tidyverse)
      library(magrittr)
      library(ggplot2)
      
      mulvey2015 %>%
        Biobase::assayData() %>%
        magrittr::extract2("exprs") %>%
        data.frame(check.names = FALSE) %>%
        tibble::rownames_to_column("prot_id") %>%
        mutate(.,
               cl = kmeans(select(., -prot_id),
                           centers = 12,
                           nstart = 10) %>%
                 magrittr::extract2("cluster") %>%
                 as.factor()) %>%
        pivot_longer(cols = !c(prot_id, cl),
                     names_to = "Timepoint",
                     values_to = "Expression") %>%
        ggplot(aes(x = Timepoint, y = Expression, color = cl)) +
        geom_line(aes(group = prot_id)) +
        facet_wrap(~ cl, ncol = 4)
      
      
      dfsa_mul <- data.frame(scale(assay(mul)))
      dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
      
      dfsa_mul2$clus <- ksa_mul$cluster
      dfsa_mul2 %>% 
        pivot_longer(cols = -c("protID", "clus"),
                     names_to = "samples",
                     values_to = "expression") %>% 
      ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
        geom_line(aes(group = protID)) +
        facet_wrap(~ factor(clus))
      
      dfsa_mul <- data.frame(scale(assay(mul)))
      dfsa_mul2 <- rownames_to_column(dfsa_mul, "protID")
      
      dfsa_mul2$clus <- ksa_mul$cluster
      dfsa_mul2 %>% 
        pivot_longer(cols = -c("protID", "clus"),
                     names_to = "samples",
                     values_to = "expression") %>% 
      ggplot(aes(x = samples, y = expression, colour = factor(clus))) +
        geom_line(aes(group = protID)) +
        facet_wrap(~ factor(clus))
      

      Splitting image by whitespace

      copy iconCopydownload iconDownload
      # Writes an PGN image:
      def writeImage(imagePath, inputImage):
          imagePath = imagePath + ".png"
          cv2.imwrite(imagePath, inputImage, [cv2.IMWRITE_PNG_COMPRESSION, 0])
          print("Wrote Image: " + imagePath)
      
      
      def findBiggestBlob(inputImage):
          # Store a copy of the input image:
          biggestBlob = inputImage.copy()
          # Set initial values for the
          # largest contour:
          largestArea = 0
          largestContourIndex = 0
      
          # Find the contours on the binary image:
          contours, hierarchy = cv2.findContours(inputImage, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
      
          # Get the largest contour in the contours list:
          for i, cc in enumerate(contours):
              # Find the area of the contour:
              area = cv2.contourArea(cc)
              # Store the index of the largest contour:
              if area > largestArea:
                  largestArea = area
                  largestContourIndex = i
      
          # Once we get the biggest blob, paint it black:
          tempMat = inputImage.copy()
          cv2.drawContours(tempMat, contours, largestContourIndex, (0, 0, 0), -1, 8, hierarchy)
          # Erase smaller blobs:
          biggestBlob = biggestBlob - tempMat
      
          return biggestBlob
      
      # Imports
      import cv2
      import numpy as np
      
      # Read image
      imagePath = "D://opencvImages//"
      inputImage = cv2.imread(imagePath + "L85Bu.jpg")
      
      # Get image dimensions
      originalImageHeight, originalImageWidth = inputImage.shape[:2]
      
      # Resize at a fixed scale:
      resizePercent = 30
      resizedWidth = int(originalImageWidth * resizePercent / 100)
      resizedHeight = int(originalImageHeight * resizePercent / 100)
      
      # resize image
      inputImage = cv2.resize(inputImage, (resizedWidth, resizedHeight), interpolation=cv2.INTER_LINEAR)
      writeImage(imagePath+"objectInput", inputImage)
      
      # Deep BGR copy:
      colorCopy = inputImage.copy()
      
      # Convert BGR to grayscale:
      grayscaleImage = cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)
      
      # Threshold via Otsu:
      _, binaryImage = cv2.threshold(grayscaleImage, 250, 255, cv2.THRESH_BINARY_INV)
      
      # Image counter to write pngs to disk:
      imageCounter = 0
      
      # Segmentation flag to stop the processing loop:
      segmentObjects = True
      
      while (segmentObjects):
      
          # Get biggest object on the mask:
          currentBiggest = findBiggestBlob(binaryImage)
      
          # Use a little bit of morphology to "widen" the mask:
          kernelSize = 3
          opIterations = 2
          morphKernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernelSize, kernelSize))
          # Perform Dilate:
          binaryMask = cv2.morphologyEx(currentBiggest, cv2.MORPH_DILATE, morphKernel, None, None, opIterations,cv2.BORDER_REFLECT101)
      
          # Mask the original BGR (resized) image:
          blobMask = cv2.bitwise_and(colorCopy, colorCopy, mask=binaryMask)
      
          # Flood-fill at the top left corner:
          fillPosition = (0, 0)
          # Use white color:
          fillColor = (255, 255, 255)
          colorTolerance = (0,0,0)
          cv2.floodFill(blobMask, None, fillPosition, fillColor, colorTolerance, colorTolerance)
      
          # Write file to disk:
          writeImage(imagePath+"object-"+str(imageCounter), blobMask)
          imageCounter+=1
      
          # Subtract current biggest blob to
          # original binary mask:
          binaryImage = binaryImage - currentBiggest
      
          # Check for stop condition - all pixels
          # in the binary mask should be black:
          whitePixels = cv2.countNonZero(binaryImage)
      
          # Compare agaisnt a threshold - 10% of
          # resized dimensions:
          whitePixelThreshold = 0.01 * (resizedWidth * resizedHeight)
          if (whitePixels < whitePixelThreshold):
              segmentObjects = False
      
      # Writes an PGN image:
      def writeImage(imagePath, inputImage):
          imagePath = imagePath + ".png"
          cv2.imwrite(imagePath, inputImage, [cv2.IMWRITE_PNG_COMPRESSION, 0])
          print("Wrote Image: " + imagePath)
      
      
      def findBiggestBlob(inputImage):
          # Store a copy of the input image:
          biggestBlob = inputImage.copy()
          # Set initial values for the
          # largest contour:
          largestArea = 0
          largestContourIndex = 0
      
          # Find the contours on the binary image:
          contours, hierarchy = cv2.findContours(inputImage, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
      
          # Get the largest contour in the contours list:
          for i, cc in enumerate(contours):
              # Find the area of the contour:
              area = cv2.contourArea(cc)
              # Store the index of the largest contour:
              if area > largestArea:
                  largestArea = area
                  largestContourIndex = i
      
          # Once we get the biggest blob, paint it black:
          tempMat = inputImage.copy()
          cv2.drawContours(tempMat, contours, largestContourIndex, (0, 0, 0), -1, 8, hierarchy)
          # Erase smaller blobs:
          biggestBlob = biggestBlob - tempMat
      
          return biggestBlob
      
      # Imports
      import cv2
      import numpy as np
      
      # Read image
      imagePath = "D://opencvImages//"
      inputImage = cv2.imread(imagePath + "L85Bu.jpg")
      
      # Get image dimensions
      originalImageHeight, originalImageWidth = inputImage.shape[:2]
      
      # Resize at a fixed scale:
      resizePercent = 30
      resizedWidth = int(originalImageWidth * resizePercent / 100)
      resizedHeight = int(originalImageHeight * resizePercent / 100)
      
      # resize image
      inputImage = cv2.resize(inputImage, (resizedWidth, resizedHeight), interpolation=cv2.INTER_LINEAR)
      writeImage(imagePath+"objectInput", inputImage)
      
      # Deep BGR copy:
      colorCopy = inputImage.copy()
      
      # Convert BGR to grayscale:
      grayscaleImage = cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)
      
      # Threshold via Otsu:
      _, binaryImage = cv2.threshold(grayscaleImage, 250, 255, cv2.THRESH_BINARY_INV)
      
      # Image counter to write pngs to disk:
      imageCounter = 0
      
      # Segmentation flag to stop the processing loop:
      segmentObjects = True
      
      while (segmentObjects):
      
          # Get biggest object on the mask:
          currentBiggest = findBiggestBlob(binaryImage)
      
          # Use a little bit of morphology to "widen" the mask:
          kernelSize = 3
          opIterations = 2
          morphKernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernelSize, kernelSize))
          # Perform Dilate:
          binaryMask = cv2.morphologyEx(currentBiggest, cv2.MORPH_DILATE, morphKernel, None, None, opIterations,cv2.BORDER_REFLECT101)
      
          # Mask the original BGR (resized) image:
          blobMask = cv2.bitwise_and(colorCopy, colorCopy, mask=binaryMask)
      
          # Flood-fill at the top left corner:
          fillPosition = (0, 0)
          # Use white color:
          fillColor = (255, 255, 255)
          colorTolerance = (0,0,0)
          cv2.floodFill(blobMask, None, fillPosition, fillColor, colorTolerance, colorTolerance)
      
          # Write file to disk:
          writeImage(imagePath+"object-"+str(imageCounter), blobMask)
          imageCounter+=1
      
          # Subtract current biggest blob to
          # original binary mask:
          binaryImage = binaryImage - currentBiggest
      
          # Check for stop condition - all pixels
          # in the binary mask should be black:
          whitePixels = cv2.countNonZero(binaryImage)
      
          # Compare agaisnt a threshold - 10% of
          # resized dimensions:
          whitePixelThreshold = 0.01 * (resizedWidth * resizedHeight)
          if (whitePixels < whitePixelThreshold):
              segmentObjects = False
      
      # Writes an PGN image:
      def writeImage(imagePath, inputImage):
          imagePath = imagePath + ".png"
          cv2.imwrite(imagePath, inputImage, [cv2.IMWRITE_PNG_COMPRESSION, 0])
          print("Wrote Image: " + imagePath)
      
      
      def findBiggestBlob(inputImage):
          # Store a copy of the input image:
          biggestBlob = inputImage.copy()
          # Set initial values for the
          # largest contour:
          largestArea = 0
          largestContourIndex = 0
      
          # Find the contours on the binary image:
          contours, hierarchy = cv2.findContours(inputImage, cv2.RETR_CCOMP, cv2.CHAIN_APPROX_SIMPLE)
      
          # Get the largest contour in the contours list:
          for i, cc in enumerate(contours):
              # Find the area of the contour:
              area = cv2.contourArea(cc)
              # Store the index of the largest contour:
              if area > largestArea:
                  largestArea = area
                  largestContourIndex = i
      
          # Once we get the biggest blob, paint it black:
          tempMat = inputImage.copy()
          cv2.drawContours(tempMat, contours, largestContourIndex, (0, 0, 0), -1, 8, hierarchy)
          # Erase smaller blobs:
          biggestBlob = biggestBlob - tempMat
      
          return biggestBlob
      
      # Imports
      import cv2
      import numpy as np
      
      # Read image
      imagePath = "D://opencvImages//"
      inputImage = cv2.imread(imagePath + "L85Bu.jpg")
      
      # Get image dimensions
      originalImageHeight, originalImageWidth = inputImage.shape[:2]
      
      # Resize at a fixed scale:
      resizePercent = 30
      resizedWidth = int(originalImageWidth * resizePercent / 100)
      resizedHeight = int(originalImageHeight * resizePercent / 100)
      
      # resize image
      inputImage = cv2.resize(inputImage, (resizedWidth, resizedHeight), interpolation=cv2.INTER_LINEAR)
      writeImage(imagePath+"objectInput", inputImage)
      
      # Deep BGR copy:
      colorCopy = inputImage.copy()
      
      # Convert BGR to grayscale:
      grayscaleImage = cv2.cvtColor(inputImage, cv2.COLOR_BGR2GRAY)
      
      # Threshold via Otsu:
      _, binaryImage = cv2.threshold(grayscaleImage, 250, 255, cv2.THRESH_BINARY_INV)
      
      # Image counter to write pngs to disk:
      imageCounter = 0
      
      # Segmentation flag to stop the processing loop:
      segmentObjects = True
      
      while (segmentObjects):
      
          # Get biggest object on the mask:
          currentBiggest = findBiggestBlob(binaryImage)
      
          # Use a little bit of morphology to "widen" the mask:
          kernelSize = 3
          opIterations = 2
          morphKernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernelSize, kernelSize))
          # Perform Dilate:
          binaryMask = cv2.morphologyEx(currentBiggest, cv2.MORPH_DILATE, morphKernel, None, None, opIterations,cv2.BORDER_REFLECT101)
      
          # Mask the original BGR (resized) image:
          blobMask = cv2.bitwise_and(colorCopy, colorCopy, mask=binaryMask)
      
          # Flood-fill at the top left corner:
          fillPosition = (0, 0)
          # Use white color:
          fillColor = (255, 255, 255)
          colorTolerance = (0,0,0)
          cv2.floodFill(blobMask, None, fillPosition, fillColor, colorTolerance, colorTolerance)
      
          # Write file to disk:
          writeImage(imagePath+"object-"+str(imageCounter), blobMask)
          imageCounter+=1
      
          # Subtract current biggest blob to
          # original binary mask:
          binaryImage = binaryImage - currentBiggest
      
          # Check for stop condition - all pixels
          # in the binary mask should be black:
          whitePixels = cv2.countNonZero(binaryImage)
      
          # Compare agaisnt a threshold - 10% of
          # resized dimensions:
          whitePixelThreshold = 0.01 * (resizedWidth * resizedHeight)
          if (whitePixels < whitePixelThreshold):
              segmentObjects = False
      

      finding clusters in a picture opencv/python

      copy iconCopydownload iconDownload
      #!/usr/bin/env python3
      
      import cv2
      import numpy as np
      
      # Load image and make greyscale version too
      im = cv2.imread('dl9Vx.png')
      grey = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
      
      # Threshold (in case we do morphology) and invert
      _, thresh = cv2.threshold(grey, 254, 255, cv2.THRESH_BINARY_INV)
      cv2.imwrite('DEBUG-thresh.png', thresh)
      
      # Prepare to do some K-means
      # https://docs.opencv.org/4.x/d1/d5c/tutorial_py_kmeans_opencv.html
      criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
      # Find x,y coordinates of all non-white pixels in original image
      Y, X = np.where(thresh==255)
      Z = np.column_stack((X,Y)).astype(np.float32)
      
      nClusters = 3
      ret,label,center=cv2.kmeans(Z,nClusters,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
      
      # Mark and display cluster centres 
      for x,y in center:
          print(f'Cluster centre: [{int(x)},{int(y)}]')
          cv2.drawMarker(im, (int(x), int(y)), [0,0,255])
      
      cv2.imwrite('result.png', im)
      
      Cluster centre: [103,65]
      Cluster centre: [50,93]
      Cluster centre: [60,29]
      
      kernel = np.ones((3,3),np.uint8)
      thresh = cv2.dilate(thresh, kernel, iterations=1)
      
      Y, X = np.where(im!=255)
      
      #!/usr/bin/env python3
      
      import cv2
      import numpy as np
      
      # Load image and make greyscale version too
      im = cv2.imread('dl9Vx.png')
      grey = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
      
      # Threshold (in case we do morphology) and invert
      _, thresh = cv2.threshold(grey, 254, 255, cv2.THRESH_BINARY_INV)
      cv2.imwrite('DEBUG-thresh.png', thresh)
      
      # Prepare to do some K-means
      # https://docs.opencv.org/4.x/d1/d5c/tutorial_py_kmeans_opencv.html
      criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
      # Find x,y coordinates of all non-white pixels in original image
      Y, X = np.where(thresh==255)
      Z = np.column_stack((X,Y)).astype(np.float32)
      
      nClusters = 3
      ret,label,center=cv2.kmeans(Z,nClusters,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
      
      # Mark and display cluster centres 
      for x,y in center:
          print(f'Cluster centre: [{int(x)},{int(y)}]')
          cv2.drawMarker(im, (int(x), int(y)), [0,0,255])
      
      cv2.imwrite('result.png', im)
      
      Cluster centre: [103,65]
      Cluster centre: [50,93]
      Cluster centre: [60,29]
      
      kernel = np.ones((3,3),np.uint8)
      thresh = cv2.dilate(thresh, kernel, iterations=1)
      
      Y, X = np.where(im!=255)
      
      #!/usr/bin/env python3
      
      import cv2
      import numpy as np
      
      # Load image and make greyscale version too
      im = cv2.imread('dl9Vx.png')
      grey = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
      
      # Threshold (in case we do morphology) and invert
      _, thresh = cv2.threshold(grey, 254, 255, cv2.THRESH_BINARY_INV)
      cv2.imwrite('DEBUG-thresh.png', thresh)
      
      # Prepare to do some K-means
      # https://docs.opencv.org/4.x/d1/d5c/tutorial_py_kmeans_opencv.html
      criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
      # Find x,y coordinates of all non-white pixels in original image
      Y, X = np.where(thresh==255)
      Z = np.column_stack((X,Y)).astype(np.float32)
      
      nClusters = 3
      ret,label,center=cv2.kmeans(Z,nClusters,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
      
      # Mark and display cluster centres 
      for x,y in center:
          print(f'Cluster centre: [{int(x)},{int(y)}]')
          cv2.drawMarker(im, (int(x), int(y)), [0,0,255])
      
      cv2.imwrite('result.png', im)
      
      Cluster centre: [103,65]
      Cluster centre: [50,93]
      Cluster centre: [60,29]
      
      kernel = np.ones((3,3),np.uint8)
      thresh = cv2.dilate(thresh, kernel, iterations=1)
      
      Y, X = np.where(im!=255)
      
      #!/usr/bin/env python3
      
      import cv2
      import numpy as np
      
      # Load image and make greyscale version too
      im = cv2.imread('dl9Vx.png')
      grey = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
      
      # Threshold (in case we do morphology) and invert
      _, thresh = cv2.threshold(grey, 254, 255, cv2.THRESH_BINARY_INV)
      cv2.imwrite('DEBUG-thresh.png', thresh)
      
      # Prepare to do some K-means
      # https://docs.opencv.org/4.x/d1/d5c/tutorial_py_kmeans_opencv.html
      criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
      # Find x,y coordinates of all non-white pixels in original image
      Y, X = np.where(thresh==255)
      Z = np.column_stack((X,Y)).astype(np.float32)
      
      nClusters = 3
      ret,label,center=cv2.kmeans(Z,nClusters,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS)
      
      # Mark and display cluster centres 
      for x,y in center:
          print(f'Cluster centre: [{int(x)},{int(y)}]')
          cv2.drawMarker(im, (int(x), int(y)), [0,0,255])
      
      cv2.imwrite('result.png', im)
      
      Cluster centre: [103,65]
      Cluster centre: [50,93]
      Cluster centre: [60,29]
      
      kernel = np.ones((3,3),np.uint8)
      thresh = cv2.dilate(thresh, kernel, iterations=1)
      
      Y, X = np.where(im!=255)
      

      K-Means centroids not visible in 3D clustering plot

      copy iconCopydownload iconDownload
      ax.scatter(x[:,0], x[:,1], x[:,2] ,c=kmeans.labels_, cmap='viridis',
             edgecolor='k', s=10, alpha = 0.1)
      

      How to plot the cluster centers?

      copy iconCopydownload iconDownload
      import matplotlib.pyplot as plt
      import seaborn as sns  # for the iris dataset
      import numpy as np
      from scipy.spatial.distance import cdist
      
      def kmeans(x, k, no_of_iterations=100):
          idx = np.random.choice(len(x), k, replace=False)
          # Randomly choosing Centroids
          centroids = x[idx, :]
          # finding the distance between centroids and all the data points
          distances = cdist(x, centroids, 'euclidean')
          points = np.array([np.argmin(i) for i in distances])
      
          for _ in range(no_of_iterations):
              centroids = []
              for idx in range(k):
                  # Updating Centroids by taking mean of Cluster it belongs to
                  temp_cent = x[points == idx].mean(axis=0)
                  centroids.append(temp_cent)
              centroids = np.vstack(centroids)  # Updated Centroids
              distances = cdist(x, centroids, 'euclidean')
              points = np.array([np.argmin(i) for i in distances])
          return points, centroids
      
      iris = sns.load_dataset('iris')
      x = iris[['sepal_length', 'sepal_width']].to_numpy()
      
      k = 3
      points, centroids = kmeans(x, k)
      
      colors = plt.cm.Set2.colors
      for val, color in zip(range(k), colors):
          plt.scatter(centroids[val, 0], centroids[val, 1], facecolor='none', edgecolor=color, lw=3,
                      s=100, label=f'centroid {val}')
      for val, color in zip(range(k), colors):
          plt.scatter(x[points == val, 0], x[points == val, 1], color=color, label=f'set {val}')
      plt.legend(ncol=2)
      plt.show()
      
      from sklearn.datasets import load_iris
      
      iris_data = load_iris()
      x = iris_data.data[:, :2]
      
      color_givens = ['magenta', 'gold', 'cyan']
      for val, (name, color) in enumerate(zip(iris_data.target_names, color_givens)):
          plt.scatter(x[iris_data.target == val, 0], x[iris_data.target == val, 1],
                      color=color, s=150, alpha=0.6, label=f'given {name}')
      
      k = 3
      points, centroids = kmeans(x, k)
      colors_kmeans = plt.cm.Set1.colors
      for val, color in zip(range(k), colors_kmeans):
          plt.scatter(centroids[val, 0], centroids[val, 1], facecolor='none', edgecolor=color, lw=3,
                      s=150, label=f'centroid {val}')
      for val, color in zip(range(k), colors_kmeans):
          plt.scatter(x[points == val, 0], x[points == val, 1], color=color, label=f'set {val}')
      plt.xlabel(iris_data.feature_names[0])
      plt.ylabel(iris_data.feature_names[1])
      plt.legend(ncol=3)
      plt.show()
      
      import matplotlib.pyplot as plt
      import seaborn as sns  # for the iris dataset
      import numpy as np
      from scipy.spatial.distance import cdist
      
      def kmeans(x, k, no_of_iterations=100):
          idx = np.random.choice(len(x), k, replace=False)
          # Randomly choosing Centroids
          centroids = x[idx, :]
          # finding the distance between centroids and all the data points
          distances = cdist(x, centroids, 'euclidean')
          points = np.array([np.argmin(i) for i in distances])
      
          for _ in range(no_of_iterations):
              centroids = []
              for idx in range(k):
                  # Updating Centroids by taking mean of Cluster it belongs to
                  temp_cent = x[points == idx].mean(axis=0)
                  centroids.append(temp_cent)
              centroids = np.vstack(centroids)  # Updated Centroids
              distances = cdist(x, centroids, 'euclidean')
              points = np.array([np.argmin(i) for i in distances])
          return points, centroids
      
      iris = sns.load_dataset('iris')
      x = iris[['sepal_length', 'sepal_width']].to_numpy()
      
      k = 3
      points, centroids = kmeans(x, k)
      
      colors = plt.cm.Set2.colors
      for val, color in zip(range(k), colors):
          plt.scatter(centroids[val, 0], centroids[val, 1], facecolor='none', edgecolor=color, lw=3,
                      s=100, label=f'centroid {val}')
      for val, color in zip(range(k), colors):
          plt.scatter(x[points == val, 0], x[points == val, 1], color=color, label=f'set {val}')
      plt.legend(ncol=2)
      plt.show()
      
      from sklearn.datasets import load_iris
      
      iris_data = load_iris()
      x = iris_data.data[:, :2]
      
      color_givens = ['magenta', 'gold', 'cyan']
      for val, (name, color) in enumerate(zip(iris_data.target_names, color_givens)):
          plt.scatter(x[iris_data.target == val, 0], x[iris_data.target == val, 1],
                      color=color, s=150, alpha=0.6, label=f'given {name}')
      
      k = 3
      points, centroids = kmeans(x, k)
      colors_kmeans = plt.cm.Set1.colors
      for val, color in zip(range(k), colors_kmeans):
          plt.scatter(centroids[val, 0], centroids[val, 1], facecolor='none', edgecolor=color, lw=3,
                      s=150, label=f'centroid {val}')
      for val, color in zip(range(k), colors_kmeans):
          plt.scatter(x[points == val, 0], x[points == val, 1], color=color, label=f'set {val}')
      plt.xlabel(iris_data.feature_names[0])
      plt.ylabel(iris_data.feature_names[1])
      plt.legend(ncol=3)
      plt.show()
      

      Coordinates output to csv file python

      copy iconCopydownload iconDownload
      import csv
      
      kmc = [
          [4963621.73063468,52320928.30284858],
          [4981357.33667335,52293627.08917835],
          [4974134.37538941,52313274.21495327],
          [4945992.84398977,52304446.43606138],
          [4986701.53977273,52317701.43831169],
          [4993362.9143898,52296985.49271403],
          [4949408.06109325,52320541.97963558],
          [4966756.82872596,52301871.5655048],
          [4980845.77591313,52324669.94175716],
          [4970904.14472671,52292401.47190146],
      ]
      
      with open('output.csv', 'w', newline='') as f_output:
          csv_output = csv.writer(f_output)
          csv_output.writerow(['lat', 'long'])
          csv_output.writerows(kmc)
      
      lat,long
      4963621.73063468,52320928.30284858
      4981357.33667335,52293627.08917835
      4974134.37538941,52313274.21495327
      4945992.84398977,52304446.43606138
      4986701.53977273,52317701.43831169
      4993362.9143898,52296985.49271403
      4949408.06109325,52320541.97963558
      4966756.82872596,52301871.5655048
      4980845.77591313,52324669.94175716
      4970904.14472671,52292401.47190146
      
      import os
      
      os.chdir(os.path.dirname(os.path.abspath(__file__)))
      
      import csv
      
      kmc = [
          [4963621.73063468,52320928.30284858],
          [4981357.33667335,52293627.08917835],
          [4974134.37538941,52313274.21495327],
          [4945992.84398977,52304446.43606138],
          [4986701.53977273,52317701.43831169],
          [4993362.9143898,52296985.49271403],
          [4949408.06109325,52320541.97963558],
          [4966756.82872596,52301871.5655048],
          [4980845.77591313,52324669.94175716],
          [4970904.14472671,52292401.47190146],
      ]
      
      with open('output.csv', 'w', newline='') as f_output:
          csv_output = csv.writer(f_output)
          csv_output.writerow(['lat', 'long'])
          csv_output.writerows(kmc)
      
      lat,long
      4963621.73063468,52320928.30284858
      4981357.33667335,52293627.08917835
      4974134.37538941,52313274.21495327
      4945992.84398977,52304446.43606138
      4986701.53977273,52317701.43831169
      4993362.9143898,52296985.49271403
      4949408.06109325,52320541.97963558
      4966756.82872596,52301871.5655048
      4980845.77591313,52324669.94175716
      4970904.14472671,52292401.47190146
      
      import os
      
      os.chdir(os.path.dirname(os.path.abspath(__file__)))
      
      import csv
      
      kmc = [
          [4963621.73063468,52320928.30284858],
          [4981357.33667335,52293627.08917835],
          [4974134.37538941,52313274.21495327],
          [4945992.84398977,52304446.43606138],
          [4986701.53977273,52317701.43831169],
          [4993362.9143898,52296985.49271403],
          [4949408.06109325,52320541.97963558],
          [4966756.82872596,52301871.5655048],
          [4980845.77591313,52324669.94175716],
          [4970904.14472671,52292401.47190146],
      ]
      
      with open('output.csv', 'w', newline='') as f_output:
          csv_output = csv.writer(f_output)
          csv_output.writerow(['lat', 'long'])
          csv_output.writerows(kmc)
      
      lat,long
      4963621.73063468,52320928.30284858
      4981357.33667335,52293627.08917835
      4974134.37538941,52313274.21495327
      4945992.84398977,52304446.43606138
      4986701.53977273,52317701.43831169
      4993362.9143898,52296985.49271403
      4949408.06109325,52320541.97963558
      4966756.82872596,52301871.5655048
      4980845.77591313,52324669.94175716
      4970904.14472671,52292401.47190146
      
      import os
      
      os.chdir(os.path.dirname(os.path.abspath(__file__)))
      

      How to plot scatter plot with original variables after scalling with K-means

      copy iconCopydownload iconDownload
      from sklearn.preprocessing import StandardScaler
      from sklearn.cluster import KMeans
      from matplotlib import pyplot as plt
      import seaborn as sns
      import pandas as pd
      import numpy as np
      
      X1 = pd.DataFrame({'Wn': np.random.rand(30) * 12, 'LL': np.random.rand(30) * 6})
      
      scaler = StandardScaler()
      X1_scaled = pd.DataFrame(scaler.fit_transform(X1), columns=X1.columns)
      
      kmeans = KMeans(init="random",
                      n_clusters=3,
                      n_init=10,
                      max_iter=300,
                      random_state=123)
      X1['label'] = kmeans.fit_predict(X1_scaled[['Wn', 'LL']])
      
      # get centroids
      centroids = scaler.inverse_transform(kmeans.cluster_centers_)
      cen_x = [i[0] for i in centroids]
      cen_y = [i[1] for i in centroids]
      
      ax = sns.scatterplot(x='Wn', y='LL', hue='label',
                           data=X1, palette='colorblind',
                           legend='full')
      sns.scatterplot(x=cen_x, y=cen_y, s=80, color='black', ax=ax)
      
      plt.tight_layout()
      plt.show()
      

      PCA after k-means clustering of multidimensional data

      copy iconCopydownload iconDownload
       data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
      
       sklearn.cluster.AgglomerativeClustering()` 
      
       data_reduced = PCA(n_componnts=2).fit_transform(data[['dist1', 'dist2',..., dist10']]
      
       sklearn.cluster.AgglomerativeClustering()` 
      

      Community Discussions

      Trending Discussions on K-Means
      • Eigenvectors and samples to calculate the points on PCA scale
      • How do I color clusters after k-means and TSNE in either seaborn or matplotlib?
      • Visualise in R with ggplot, a k-means clustered developmental gene expression dataset
      • Splitting image by whitespace
      • finding clusters in a picture opencv/python
      • K-Means centroids not visible in 3D clustering plot
      • How to plot the cluster centers?
      • Coordinates output to csv file python
      • How to plot scatter plot with original variables after scalling with K-means
      • PCA after k-means clustering of multidimensional data
      Trending Discussions on K-Means

      QUESTION

      Eigenvectors and samples to calculate the points on PCA scale

      Asked 2022-Apr-10 at 19:04

      I want to get the new points on the new scale for PC1 and PC2. I calculated the Eigenvalues, Eigenvectors and Contribution.

      Now I want to calculate the points on the new scale (scores) to apply the K-Means cluster algorithm on them.

      Whenever I try to calculate it by saying z_new = np.dot(v, vectors) (with v = np.cov(x)) I get a wrong score, which is [[14. -2. -2. -1. -0. 0. 0. -0. -0. 0. 0. -0. 0. 0.] for PC1 and [-3. -1. -2. -1. -0. -0. 0. 0. 0. -0. -0. 0. -0. -0.] for PC2. The right score scores (Calculated using SciKit's PCA() function) should be PC1: [ 4 4 -6 3 1 -5] and PC2: [ 0 -3 1 -1 5 -4]

      Here is my code:

      dataset = pd.read_csv("cands_dataset.csv")
      x = dataset.iloc[:, 1:].values
      
      m = x.mean(axis=1);
      for i in range(len(x)):
          x[i] = x[i] - m[i]
      
      z = x / np.std(x)
      v = np.cov(x)
      values, vectors = np.linalg.eig(v)
      d = np.diag(values)
      p = vectors
      z_new = np.dot(v, p) # <--- Here is where I get the new scores
      z_new = np.round(z_new,0).real
      print(z_new)
      

      The result I get:

      [[14. -2. -2. -1. -0.  0.  0. -0. -0.  0.  0. -0.  0.  0.]
       [-3. -1. -2. -1. -0. -0.  0.  0.  0. -0. -0.  0. -0. -0.]
       [-4. -0.  3.  3.  0.  0.  0.  0.  0. -0. -0.  0. -0. -0.]
       [ 2. -1. -2. -1.  0. -0.  0. -0. -0.  0.  0. -0.  0.  0.]
       [-2. -1.  8. -3. -0. -0. -0.  0.  0. -0. -0. -0.  0.  0.]
       [-3.  2. -1.  2. -0.  0.  0.  0.  0. -0. -0.  0. -0. -0.]
       [ 3. -1. -3. -1.  0. -0.  0. -0. -0.  0.  0. -0.  0.  0.]
       [11.  6.  4.  4. -0.  0. -0. -0. -0.  0.  0. -0.  0.  0.]
       [ 5. -8.  6. -1.  0.  0. -0.  0.  0.  0.  0. -0.  0.  0.]
       [-1. -1. -1.  1.  0. -0.  0.  0.  0.  0.  0.  0. -0. -0.]
       [ 5.  7.  1. -1.  0. -0. -0. -0. -0.  0.  0. -0. -0. -0.]
       [12. -6. -1.  2.  0.  0.  0. -0. -0.  0.  0. -0.  0.  0.]
       [ 3.  6.  0.  0.  0. -0. -0. -0. -0.  0.  0.  0. -0. -0.]
       [ 5.  5. -0. -4. -0. -0. -0. -0. -0.  0.  0. -0.  0.  0.]]
      

      Dataset(Requested by a comment):

      enter image description here

      ANSWER

      Answered 2022-Apr-10 at 19:04

      The way I look at this, you have 6 samples with 14 dimensions. The PCA procedure is as follows:

      1. Remove the mean

      Starting with the following data:

      data  = np.array([[10, 9, 1, 7, 5, 3],
                        [10, 8,10, 8, 8,10],
                        [ 4, 6, 8, 9, 8, 9],
                        [ 9, 9, 6, 6, 8, 7],
                        [ 4, 9,10, 9, 5, 5],
                        [ 7, 6, 9, 9,10,10],
                        [10, 9, 6, 6, 8, 7],
                        [ 5, 4, 2,10, 9, 1],
                        [ 4,10, 4, 9, 2, 6],
                        [ 8, 7, 7, 7, 7, 9],
                        [ 5, 5, 6, 6, 9, 1],
                        [ 9,10, 1, 9, 5, 6],
                        [ 7, 6, 6, 6,10, 4],
                        [10, 9, 9, 7, 9, 4]])
      

      We can remove the mean via:

      centered_data = data - np.mean(data, axis=1)[:, None]
      
      2. Create a covariance matrix

      Can be done as follows:

      covar = np.cov(centered_data)
      
      3. Getting the Principal Components

      This can be done using eigenvalue decomposition of the covariance matrix

      eigen_vals, eigen_vecs = np.linalg.eig(covar)
      eigen_vals, eigen_vecs = np.real(eigen_vals), np.real(eigen_vecs)
      
      4. Dimensionality redunction

      Now we can do dimensionality reduction by choosing the two PC with the highest matching eigenvalues (variance). In your example you wanted 2 dimension so we take the two major PCs:

      eigen_vals = 
      array([ 3.34998559e+01,  2.26499704e+01,  1.54115835e+01,  9.13166675e+00,
              1.27359015e+00, -3.10462438e-15, -1.04740277e-15, -1.04740277e-15,
             -2.21443036e-16,  9.33811755e-18,  9.33811755e-18,  6.52780501e-16,
              6.52780501e-16,  5.26538300e-16])
      

      We can see that the first two eigen values are the highest:

      eigen_vals[:2] = array([33.49985588, 22.64997038])
      

      Therefore, we can project the data on the two first PCs as follows:

      projected_data = eigen_vecs[:, :2].T.dot(centered_data)
      

      This can now be scaterred and we can see that the 14 dimension are reduced to 2:

      PC1 = [0.59123632, -0.10134531, -0.20795954,  0.1596049 , -0.07354629, 0.19588723,  0.19151677,  0.33847213,  0.22330841, -0.03466414, 0.1001646 ,  0.52913917,  0.09633029,  0.16141852]
      PC2 = [-0.07551251, -0.07531288,  0.0188486 , -0.01280896, -0.07309957, 0.12354371, -0.01170589,  0.49672196, -0.43813664, -0.09948337, 0.49590461, -0.25700432,  0.38198034,  0.2467548 ]
      

      enter image description here

      General analysis

      Since we did PCA wee now have orthogonal variance in each dimension (diagonal covariance matrix). to better understand the dimensionality reduction possible we can see how the total variance of the data is distributed in the different dimensions. This can be dome via the eigenvalues:

      var_dim_contribution = eigen_vals / np.sum(eigen_vals)
      

      Plotting this results in:

      enter image description here

      We can see here that using the 2 major PCs we can describe ~67% of the variance. Adding a third dimension will boost us towards ~90% of the variance. This is a good reduction from 14. This can be better seem in the cumulative variance plot.

      var_cumulative_contribution = np.cumsum(eigen_vals / np.sum(eigen_vals))
      

      enter image description here

      Comparison with sklearn

      When comparing with sklearn.decomposition.PCA we get the following:

      from sklearn.decomposition import PCA
      pca = PCA(n_components=2)
      pca.fit(centered_data.T)
      print(pca.explained_variance_ratio_)  # [0.40870097 0.27633148]
      print(pca.explained_variance_)        # [33.49985588 22.64997038]
      

      We see that we get the same explained variance and variance values as the ones from the manual computation, In addition, The resulted PCs are:

      print(pca.components_)
      [[-0.59123632  0.10134531  0.20795954 -0.1596049   0.07354629  0.19588723 0.19151677 -0.33847213 -0.22330841  0.03466414 -0.1001646  -0.52913917 -0.09633029 -0.16141852]
       [-0.07551251 -0.07531288  0.0188486  -0.01280896 -0.07309957  0.12354371 -0.01170589  0.49672196 -0.43813664 -0.09948337  0.49590461 -0.25700432 0.38198034  0.2467548 ]]
      

      And we see that we get the same results as scikit

      Source https://stackoverflow.com/questions/71816829

      Community Discussions, Code Snippets contain sources that include Stack Exchange Network

      Vulnerabilities

      No vulnerabilities reported

      Install K-Means

      You can download it from GitHub.
      You can use K-Means like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the K-Means component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

      Support

      For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

      DOWNLOAD this Library from

      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
      over 430 million Knowledge Items
      Find more libraries
      Reuse Solution Kits and Libraries Curated by Popular Use Cases
      Explore Kits

      Save this library and start creating your kit

      Share this Page

      share link
      Consider Popular Java Libraries
      Compare Java Libraries with Highest Support
      Compare Java Libraries with Highest Quality
      Compare Java Libraries with Highest Security
      Compare Java Libraries with Permissive License
      Compare Java Libraries with Highest Reuse
      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
      over 430 million Knowledge Items
      Find more libraries
      Reuse Solution Kits and Libraries Curated by Popular Use Cases
      Explore Kits

      Save this library and start creating your kit

      • © 2022 Open Weaver Inc.