# 第 68 章 Principal Component Analysis 主成分分析

A big computer, a complex algorithm and a long time does not equal science.
Robert Gentleman
PCA lecture was taught by Professor Luigi Palla.

## 68.1 數據有相關性時產生的問題

Edgeworth (1891) 最早試圖用下面的方程來歸納一組從男性樣本身上測量獲得的存在相關性的變量：身高(H)，前臂長(F)，腿長(L)：

\begin{aligned} Y_1 & = 0.16H + 0.51F + 0.39L \\ Y_2 & = -0.17H + 0.69F + 0.09L \\ Y_3 & = -0.15H + 0.25F + 0.52L \end{aligned}

\begin{aligned} Corr(X_1,X_2) & = \frac{Cov(X_1,X_2)}{SD(X_1)SD(X_2)} \\ & =\frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)Var(X_2)}}\\ & = Cov(X_1,X_2) \\ & = 0.3 \end{aligned}

$$x_2$$ (體重) 爲結果變量，$$X_1$$ (身高) 爲單一解釋變量的線性回歸模型的回歸係數 (regression coefficient $$\hat\beta$$, 概念參考 Section 27.2) 是：

\begin{aligned} \hat\beta & = \frac{S_{x_1x_2}}{SS_{x_1x_2}} \\ & = \frac{CV_{x_1x_2}}{SD_{x_1}^2} \\ & = 0.3 \end{aligned}

$(OP_j)^2 = (P_jP_j^\prime)^2 + (OP_j^\prime)^2$

$$$\sum_j (OP_j)^2 = \sum_j(P_jP_j^\prime)^2 + \sum_j(OP_j^\prime)^2$$ \tag{68.1}$

## 68.2 最大化方差等價於最大化數據點到新座標軸“投影(projection)”的長度

$$$\sum_j (OP_j)^2/n = \sum_j(P_jP_j^\prime)^2/n + \sum_j(OP_j^\prime)^2/n$$ \tag{68.2}$

\begin{aligned} y_1 & = x_1\cos\theta + x_2\sin\theta \\ y_2 & = -x_1\sin\theta + x_2\cos\theta \end{aligned} \tag{68.3}

\begin{aligned} (OP_j)^2 & = x_1^2 + x_2^2 = y_1^2 + y_2^2 \\ & = r^2 \\ \because x_1 & = r\times\cos(\alpha +\theta) \\ x_2 & = r\times\sin(\alpha + \theta) \\ y_1 & = r\times\cos(\alpha) \\ y_2 & = r\times\sin(\alpha) \\ \therefore x_1 & = r[\cos\alpha\cos\theta - \sin\alpha\sin\theta] \\ & = y_1\cos\theta - y_2\sin\theta \\ x_2 & = r[\sin\alpha\cos\theta + \cos\alpha\sin\theta] \\ & = y_2\cos\theta+y_1\sin\theta \\ \Rightarrow x_1\cos\theta & = y_1\cos^2\theta -y_2 \sin\theta\cos\theta \\ x_2\sin\theta & = y_2\cos\theta\sin\theta + y_1\sin^2\theta \\ \textbf{Sum the}& \textbf{ above two equations} \\ \Rightarrow y_1 & = \frac{x_1\cos\theta + x_2 \sin\theta}{(\cos^2\theta + \sin^2\theta)} \\ y_1 & = x_1\cos\theta + x_2 \sin\theta \\ \textbf{Similarly}& \\ \Rightarrow x_1\sin\theta & = y_1\cos\theta\sin\theta -y_2 \sin^2\theta \\ x_2\cos\theta & = y_2\cos^2\theta + y_1\sin\theta\cos\theta \\ \textbf{Take substraction}& \textbf{ between the above two equations} \\ \Rightarrow y_2 & = \frac{-x_1\sin\theta + x_2\cos\theta}{(\cos^2\theta + \sin^2\theta)} \\ y_2 & = -x_1\sin\theta + x_2\cos\theta \end{aligned}

$$y_1, y_2$$就是旋轉後新的座標軸的變量。在這個簡單實例中，我們從原始數據 $$x_1, x_2$$ 經過旋轉，獲得新的數據 $$y_1, y_2$$，他們二者之間其實只是經過了線性轉換 (linear transformation)。一般地，我們如果要給原始數據矩陣 (維度 $$n\times p$$)進行座標軸的數據轉換，只需要給原始數據矩陣乘以一個正方形的投影矩陣 $$\mathbf{P}$$ (projection matrix) (維度 $$p\times p$$) ($$p$$ 是變量的個數)即可。

$\left[ \begin{array} \cos\cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{array} \right] \tag{68.4}$

1. $$p$$ 個相互獨立的變量分別都是原始變量 $$x_1, x_2, \dots, x_p$$ 的線性轉換： \begin{aligned} y_1 & = a_{11}x_1 + a_{12}x_2 + \cdots + a_{1p}x_p \\ y_2 & = a_{21}x_1 + a_{22}x_2 + \cdots + a_{2p}x_p \\ \vdots & \\ y_p & = a_{p1}x_1 + a_{p2}x_2 + \cdots + a_{pp}x_p \\ \end{aligned}

2. $$p$$ 個相互獨立的變量通過最大化它們對數據整體方差的貢獻獲得。
3. $$p$$ 個相互獨立的變量被叫做這個數據的主成分變量。
4. 這些主成分變量之間相互獨立 (uncorrelated)，並且按照他們各自對數據總體方差的貢獻度從大到小排列 (the principal components are uncorrelated and are ordered by the amount of the total system variability that they explain)：

$\text{Cov}(y_j, y_k) = 0 \text{ for any } j, k \in [1, p] \\ \text{Var}(y_1) \geqslant \text{Var}(y_2) \geqslant \text{Var}(y_3) \geqslant \dots \geqslant \text{Var}(y_p)$

## 68.3 數學推導

• $$\textbf{S}$$ 是數據的方差協方差矩陣 (variance, covriance matrix)
• $$\textbf{P}$$直角投影矩陣 (orthogonal projection matrix)，該矩陣的每一列，是旋轉之後的新變量的座標，也就是主成分變量，它們又被叫做特徵向量 (eigenvectors)
• $$\bf{\Lambda}$$ 是一個對角矩陣 (diagonal matrix)，它的對角線上是每個主成分變量的方差，它們又被叫做特徵值 (eigenvalues)。特徵值常常又被叫做慣性 (inertia)，特徵值從對角線左上角起往右下角是從大到小排列，每一個特徵值是每個特徵向量的方差，也就是數據整體方差投射在這個主成分變量上的慣性，可以理解爲該主成分能夠解釋多少整個數據的方差 (explained variance)。
Theorem 68.1 (Spectral decomposition) 根據譜定理 Spectral decomposition：如果矩陣 $$\textbf{S}$$ 是對稱的，它總是可以被分解爲： $\textbf{S} = \textbf{P}\bf{\Lambda}\textbf{P}^t$

### 68.3.1 超越對稱矩陣：奇異值分解 (singular value decomposition, SVD)

$\mathbf{X}_{n\times p} = \mathbf{U}_{n\times n}\bf{\Sigma}_{n \times p} \mathbf{W}_{p\times p}^t$

• $$\mathbf{U}_{n\times n}$$ 是含有左奇異向量 (left singular vectors) 的矩陣；
• $$\Sigma_{n \times p}$$ 是含有奇異值 (singular values)的矩陣；
• $$\mathbf{W}_{p\times p}$$ 則是含有右奇異向量 (right singular vectors) 的矩陣。

\begin{aligned} \mathbf{X}^t \mathbf{X} & = \mathbf{W}\bf{\Sigma}\mathbf{U}^t\times\mathbf{U}\bf{\Sigma}\mathbf{W}^t \\ & = \mathbf{W}\bf{\Sigma}^2\mathbf{W}^t \\ \Rightarrow \bf{\Sigma}^2 & = \bf{\Lambda} \end{aligned}

## 68.4 主成分分析數據實例

Odour.intensity Odour.typicality Pulpiness Intensity.of.taste Acidity Bitterness Sweetness
Pampryl amb. 2.82 2.53 1.66 3.46 3.15 2.97 2.60
Tropicana amb. 2.76 2.82 1.91 3.23 2.55 2.08 3.32
Fruvita fr. 2.83 2.88 4.00 3.45 2.42 1.76 3.38
Joker amb. 2.76 2.59 1.66 3.37 3.05 2.56 2.80
Tropicana fr. 3.20 3.02 3.69 3.12 2.33 1.97 3.34
Pampryl fr. 3.07 2.73 3.34 3.54 3.31 2.63 2.90

insheet using "http://factominer.free.fr/bookV2/orange.csv" , delimiter(";") clear
pca odour* pulp* intens* acid* bitter* sweetness, cor


Principal components/correlation                 Number of obs    =          6
Number of comp.  =          5
Trace            =          7
Rotation: (unrotated = principal)            Rho              =     1.0000

--------------------------------------------------------------------------
Component |   Eigenvalue   Difference         Proportion   Cumulative
-------------+------------------------------------------------------------
Comp1 |      4.74369       3.4104             0.6777       0.6777
Comp2 |      1.33329      .513448             0.1905       0.8681
Comp3 |      .819842      .735818             0.1171       0.9853
Comp4 |     .0840232     .0648702             0.0120       0.9973
Comp5 |      .019153      .019153             0.0027       1.0000
Comp6 |            0            0             0.0000       1.0000
Comp7 |            0            .             0.0000       1.0000
--------------------------------------------------------------------------

Principal components (eigenvectors)

------------------------------------------------------------------------------
Variable |    Comp1     Comp2     Comp3     Comp4     Comp5 | Unexplained
-------------+--------------------------------------------------+-------------
odourinten~y |   0.2110    0.6534   -0.5174    0.0286    0.0310 |           0
odourtypic~y |   0.4524    0.1162   -0.0646    0.2668    0.2952 |           0
pulpiness |   0.3313    0.5340    0.3290   -0.3327   -0.2250 |           0
intensityo~e |  -0.2984    0.3714    0.6910    0.0189    0.3456 |           0
acidity |  -0.4191    0.3017   -0.0237    0.7065   -0.4106 |           0
bitterness |  -0.4292    0.1628   -0.3152   -0.0974    0.6712 |           0
sweetness |   0.4384   -0.1374    0.2061    0.5553    0.3503 |           0
------------------------------------------------------------------------------

# library(FactoMineR)
org.pca <- PCA(orange[, 1:7], ncp = 7, graph = FALSE)

# library(factoextra)
eig.val <- get_eigenvalue(org.pca)
eig.val # eigenvalue (variances of each principal components)
##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.743692688      67.76703840                   67.767038
## Dim.2 1.333289855      19.04699793                   86.814036
## Dim.3 0.819841150      11.71201643                   98.526053
## Dim.4 0.084023297       1.20033282                   99.726386
## Dim.5 0.019153009       0.27361442                  100.000000
# eigen vectors:
org.pca$svd$V
##             [,1]        [,2]         [,3]         [,4]         [,5]
## [1,]  0.21100074  0.65340689 -0.517409852  0.028573070  0.030958154
## [2,]  0.45241413  0.11618305 -0.064606287  0.266760192  0.295222955
## [3,]  0.33132165  0.53403262  0.329025446 -0.332685134 -0.225026986
## [4,] -0.29836065  0.37144476  0.690990232  0.018942515  0.345597119
## [5,] -0.41905731  0.30166462 -0.023688451  0.706533003 -0.410644925
## [6,] -0.42917948  0.16282112 -0.315220908 -0.097425116  0.671196644
## [7,]  0.43840960 -0.13742859  0.206064224  0.555251136  0.350251763

\begin{aligned} y_1 & = 0.2110x_1 + 0.4524x_2 + 0.3313x_3 - 0.2984x_4 - 0.4191x_5 - 0.4292x_6 + 0.4384x_7 \\ y_2 & = 0.6534x_1 + 0.1162x_2 + 0.5340x_3 + 0.3714x_4 + 0.3017x_5 + 0.1628x_6 - 0.1374x_7 \\ y_3 & =-0.5174x_1 - 0.0646x_2 + 0.3290x_3 + 0.6910x_4 - 0.0237x_5 - 0.3152x_6 + 0.2061x_7 \\ y_4 & = 0.0286x_1 + 0.2668x_2 - 0.3327x_3 + 0.0189x_4 + 0.7065x_5 - 0.0974x_6 + 0.5553x_7 \\ y_5 & = 0.0310x_1 + 0.2952x_2 - 0.2250x_3 + 0.3456x_4 - 0.4106x_5 + 0.6712x_6 + 0.3503x_7 \\ \end{aligned}

fviz_pca_ind(org.pca, pointsize = "cos2", pointshape = 21,
fill = "#E7B800", repel = TRUE, labelsize = 2) 

summary(org.pca)
##
## Call:
## PCA(X = orange[, 1:7], ncp = 7, graph = FALSE)
##
##
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5
## Variance               4.744   1.333   0.820   0.084   0.019
## % of var.             67.767  19.047  11.712   1.200   0.274
## Cumulative % of var.  67.767  86.814  98.526  99.726 100.000
##
## Individuals
##                        Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## Pampryl amb.       |  3.029 | -2.984 31.288  0.970 | -0.082  0.085  0.001 | -0.333  2.254  0.012 |
## Tropicana amb.     |  1.976 |  0.886  2.761  0.201 | -1.715 36.771  0.753 | -0.087  0.154  0.002 |
## Fruvita fr.        |  2.595 |  1.937 13.182  0.557 |  0.040  0.020  0.000 |  1.710 59.450  0.434 |
## Joker amb.         |  2.094 | -1.896 12.631  0.820 | -0.834  8.686  0.158 | -0.154  0.481  0.005 |
## Tropicana fr.      |  3.512 |  3.186 35.660  0.823 |  0.589  4.335  0.028 | -1.345 36.774  0.147 |
## Pampryl fr.        |  2.338 | -1.129  4.479  0.233 |  2.002 50.102  0.733 |  0.209  0.887  0.008 |
##
## Variables
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
## Odour.intensity    |  0.460  4.452  0.211 |  0.754 42.694  0.569 | -0.468 26.771  0.219 |
## Odour.typicality   |  0.985 20.468  0.971 |  0.134  1.350  0.018 | -0.058  0.417  0.003 |
## Pulpiness          |  0.722 10.977  0.521 |  0.617 28.519  0.380 |  0.298 10.826  0.089 |
## Intensity.of.taste | -0.650  8.902  0.422 |  0.429 13.797  0.184 |  0.626 47.747  0.391 |
## Acidity            | -0.913 17.561  0.833 |  0.348  9.100  0.121 | -0.021  0.056  0.000 |
## Bitterness         | -0.935 18.420  0.874 |  0.188  2.651  0.035 | -0.285  9.936  0.081 |
## Sweetness          |  0.955 19.220  0.912 | -0.159  1.889  0.025 |  0.187  4.246  0.035 |

...{omitted}...
Individuals
Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2
Pampryl amb.       |  3.029 | -2.984 31.288  0.970 | -0.082  0.085  0.001
Tropicana amb.     |  1.976 |  0.886  2.761  0.201 | -1.715 36.771  0.753
Fruvita fr.        |  2.595 |  1.937 13.182  0.557 |  0.040  0.020  0.000
Joker amb.         |  2.094 | -1.896 12.631  0.820 | -0.834  8.686  0.158
Tropicana fr.      |  3.512 |  3.186 35.660  0.823 |  0.589  4.335  0.028
Pampryl fr.        |  2.338 | -1.129  4.479  0.233 |  2.002 50.102  0.733
...{omitted}...

• Dist 是每個個體(行數據)，到座標軸原點 (平均重心位置) 的距離。此數據中距離原點最遠的兩種橙汁是 Pampryl amb. (最左邊) 和 Tropicana fr. (最右邊)。
• Dim.1 是該個體，在第一個主成分變量座標軸上的座標。
• ctr 是該個體在第一個主成分變量提取時貢獻的百分比。
• cos2 是該個體在該主成分變量上投影的慣性除以該個體本身的慣性所佔的比例，又叫做該個體對相應主成分變量的代表性評價 (the quality of representation of an individual $$i$$ on the principle component $$s$$ is measured by the distance between the point within the space $$u_s$$ and the projection on the component)。

$\text{quality of representation}_s(i) = \frac{\text{Projected inertia of }i \text{ on } u_s}{\text{Total inertia of }i} = \cos^2\theta_i^s$

PCA報告的下半部分，是關於數據中變量與變量之間關係的分析結果。

Variables
Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2
Odour.intensity    |  0.460  4.452  0.211 |  0.754 42.694  0.569 | -0.468 26.771  0.219 |
Odour.typicality   |  0.985 20.468  0.971 |  0.134  1.350  0.018 | -0.058  0.417  0.003 |
Pulpiness          |  0.722 10.977  0.521 |  0.617 28.519  0.380 |  0.298 10.826  0.089 |
Intensity.of.taste | -0.650  8.902  0.422 |  0.429 13.797  0.184 |  0.626 47.747  0.391 |
Acidity            | -0.913 17.561  0.833 |  0.348  9.100  0.121 | -0.021  0.056  0.000 |
Bitterness         | -0.935 18.420  0.874 |  0.188  2.651  0.035 | -0.285  9.936  0.081 |
Sweetness          |  0.955 19.220  0.912 | -0.159  1.889  0.025 |  0.187  4.246  0.035 |

fviz_pca_var(org.pca, repel = TRUE, labelsize = 2) 
• 在第一個主成分軸上 (Dim.1)，正相關的變量 Odour.intensity, Odour.typicality, Pulpiness, Sweetness 被歸類在右半球，而負相關的變量 Intensity.of.taste, Acidity, Bitterness 則被歸類在第一主成分軸的左半球。
• 相似地，在第二個主成分軸上 (Dim.2)，只有負相關的 Sweetness 被歸類在下半球。
• 每個變量從原點出發時的箭頭長度越長 cos2，代表它在該主成分軸上代表質量更好 (the quality of representation of the variable on the component)

fviz_pca_biplot(org.pca, repel = TRUE, pointsize = "cos2", pointshape = 21,
labelsize = 2) 

## 68.5 在PCA圖形中加入補充變量和補充個體 (supplementary elements)

Glucose Fructose Saccharose Sweetening.power pH Citric.acid Vitamin.C Way.of.preserving Origin
Pampryl amb. 25.32 27.36 36.45 89.95 3.59 0.84 43.44 Ambient Other
Tropicana amb. 17.33 20.00 44.15 82.55 3.89 0.67 32.70 Ambient Florida
Fruvita fr. 23.65 25.65 52.12 102.22 3.85 0.69 37.00 Fresh Florida
Joker amb. 32.42 34.54 22.92 90.71 3.60 0.95 36.60 Ambient Other
Tropicana fr. 22.70 25.32 45.80 94.87 3.82 0.71 39.50 Fresh Florida
Pampryl fr. 27.16 29.48 38.94 96.51 3.68 0.74 27.00 Fresh Other

org.pca <- PCA(orange, quanti.sup = 8:14, quali.sup = 15:16,
graph = FALSE)
org.pca$quanti.sup ##$coord
##                         Dim.1       Dim.2         Dim.3        Dim.4       Dim.5
## Glucose          -0.572454497  0.31123036  0.0263849025 -0.208332016 -0.72892600
## Fructose         -0.561054870  0.31451133 -0.0084203081 -0.181973281 -0.74371694
## Saccharose        0.750440168  0.14492075  0.3246761207 -0.075192796  0.55205886
## Sweetening.power  0.300767457  0.67471255  0.4895557731 -0.389880490 -0.25026037
## pH                0.879663611 -0.23629707  0.1935892274  0.245926101  0.26907097
## Citric.acid      -0.739370266 -0.12160048 -0.1957416737 -0.278669842 -0.56795532
## Vitamin.C        -0.044575912 -0.31698263 -0.2545161911 -0.905066399  0.11666756
##
## $cor ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 ## Glucose -0.572454497 0.31123036 0.0263849025 -0.208332016 -0.72892600 ## Fructose -0.561054870 0.31451133 -0.0084203081 -0.181973281 -0.74371694 ## Saccharose 0.750440168 0.14492075 0.3246761207 -0.075192796 0.55205886 ## Sweetening.power 0.300767457 0.67471255 0.4895557731 -0.389880490 -0.25026037 ## pH 0.879663611 -0.23629707 0.1935892274 0.245926101 0.26907097 ## Citric.acid -0.739370266 -0.12160048 -0.1957416737 -0.278669842 -0.56795532 ## Vitamin.C -0.044575912 -0.31698263 -0.2545161911 -0.905066399 0.11666756 ## ##$cos2
##                         Dim.1       Dim.2          Dim.3        Dim.4       Dim.5
## Glucose          0.3277041510 0.096864337 0.000696163079 0.0434022288 0.531333120
## Fructose         0.3147825674 0.098917374 0.000070901589 0.0331142749 0.553114882
## Saccharose       0.5631604458 0.021002025 0.105414583332 0.0056539566 0.304768989
## Sweetening.power 0.0904610632 0.455237031 0.239664854983 0.1520067964 0.062630255
## pH               0.7738080690 0.055836307 0.037476788962 0.0604796473 0.072399188
## Citric.acid      0.5466683910 0.014786677 0.038314802810 0.0776568810 0.322573248
## Vitamin.C        0.0019870119 0.100477991 0.064778491512 0.8191451868 0.013611319

fviz_pca_var(org.pca, repel = TRUE) 

### 68.5.1 展示分類輔助性變量和個體的關係

p <- fviz_pca_ind(org.pca, habillage = 15,
palette = "jco", repel = TRUE)
p

p <- fviz_pca_ind(org.pca, habillage = 16,
palette = "jco", repel = TRUE)
p

## 68.6 Cluster analysis/PCA practical

1. 如何使用聚類分析，和主成分分析法來探索一組多變量數據之間的關係；
2. 理解並懂得如何選取合適的距離測量尺度，和聚類分析方法；
3. 繪製並能夠解釋由多層聚類分析算法 (hierarchical clustering algorithm) 獲得的樹狀圖；
4. 使用主成分分析法對數據進行座標轉換，計算多個變量之間的方差，協方差矩陣，懂得如何判斷保留主成分的個數；
5. 通過把數據繪製在較低維度的主成分座標軸上來判斷數據中可能存在的潛在分層/分組。

### 68.6.1 使用的數據和簡單背景知識

1. 在R裏讀入你的數據，看看這4種生物標幟物的簡單統計量和分佈，它們用的是相同的測量單位嗎？
plant <- read_dta("backupfiles/plant.dta")
plant <- plant[, 1:4]
head(plant)
## # A tibble: 6 x 4
##       bm1     bm2     bm3     bm4
##     <dbl>   <dbl>   <dbl>   <dbl>
## 1  17.4    78.6   101.    109.
## 2  87      30.1    79.1     6.60
## 3   0.100   0.600   0.900   0.200
## 4 106      10      44.6    57.6
## 5 141.    122     115.    123.
## 6   0.5     0.800   0.200   0.5
summ(plant)
##
## No. of observations = 50
##
##   Var. name obs. mean   median  s.d.   min.   max.
## 1 bm1       50   56.6   47.55   48.05  0      143
## 2 bm2       50   53.21  52.7    45.13  0      143.6
## 3 bm3       50   61.43  55.25   51.47  0.2    147.9
## 4 bm4       50   57.43  56.75   45.45  0.1    146.1
psych::describe(plant)
##     vars  n  mean    sd median trimmed   mad min   max range skew kurtosis   se
## bm1    1 50 56.60 48.05  47.55   53.36 66.05 0.0 143.0 143.0 0.27    -1.38 6.80
## bm2    2 50 53.21 45.13  52.70   49.62 59.45 0.0 143.6 143.6 0.43    -1.07 6.38
## bm3    3 50 61.43 51.47  55.25   58.86 69.76 0.2 147.9 147.7 0.27    -1.47 7.28
## bm4    4 50 57.43 45.45  56.75   54.71 52.41 0.1 146.1 146.0 0.32    -1.12 6.43

1. 這些生物標幟物能否單獨提供關於該植物的某部分特徵信息呢？思考我們該如何回答這個問題（提示：計算這些指標直接的相關係數）

cor(plant)
##            bm1        bm2        bm3        bm4
## bm1 1.00000000 0.49826220 0.59414820 0.26769269
## bm2 0.49826220 1.00000000 0.50574946 0.33347350
## bm3 0.59414820 0.50574946 1.00000000 0.32094816
## bm4 0.26769269 0.33347350 0.32094816 1.00000000

1. 請描述前一步中我們計算的相關係數矩陣的維度(dimension)。

1. 再次思考問題1.的答案，思考並選擇合適的測量不同樣本個體之間距離 (distance) 的度量衡。嘗試使用簡單的聚類分析命令對樣本進行分類。

# prepare hierarchical cluster
hc <-  hclust(dist(plant), "ave")

plot(hc, cex = 0.8, hang = -1,
main = "", ylab = "L2 dissimilarity measure",
xlab = "No. of specimen")

1. 從簡單的歐幾里得距離改成歐幾里得距離平方來測量樣本之間的距離的話，圖形會變成什麼樣？
hc <- hclust(dist(plant)^2)

plot(hc, cex = 0.8, hang = -1,
main = "", ylab = "L2squared dissimilarity measure",
xlab = "No. of specimen", sub = "")

plot(cluster::agnes(plant, metric = "manhattan", stand = F), which.plots = 2, hang = -1,
xlab = "No. of specimen", main = "", ylab = "L1 dissimilarity measure", sub = "", cex = 0.8)

1. 接下來使用歐幾里得距離做度量衡，與上面的嘗試不同，這裏我們嘗試用完全連接，和單連接