07. PCA(Pincipal Component Analysis)

Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

without haste but without rest

07. PCA(Pincipal Component Analysis) 본문

Homework/DataMining

07. PCA(Pincipal Component Analysis)

JinungKim 2020. 5. 5. 13:53

0. 개요

1. 주성분 분석을 하는 이유

- 변수들이 많은 경우 종속변수에 영향을 크게 주는 주요한 속성들을 골라내서 모델을 간단하게 만들 수 있다.

2. 과정

- 데이터 로드

- 선형 변환

- 표준화

- 비교 분석(원자료 vs 차원축소 자료)

1. 랜덤 변수 생성

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

import numpy as np

''' 모의 실험 '''
# 난수 생성
rnd = np.random.RandomState(5)
X_ = rnd.normal(size = (300, 2))

plt.scatter(X_[:, 0], X_[:, 1], c = X_[:, 0], 
            linewidths = 0, s = 60, cmap = 'viridis')

print(np.var(X_, axis = 0))

2. 선형 변환

주성분 분석은 데이터를 한개의 축으로 사상시켰을 때 그 분산이 가장 커지는 축을 첫 번째 주성분, 두 번째로 커지는 축을 두 번째 주성분으로 놓이도록 새로운 좌표계로 데이터를 선형 변환한다.

''' 선형변환 '''

# Y = XB + a
# X: 300*2 행렬
# B: 2*2 기울기 행렬
# a: 절편 벡터
B = rnd.normal(size = (2, 2))
a = rnd.normal(size = 2)
X_blob = np.dot(X_, B) + a

print(np.var(X_blob, axis = 0))

plt.scatter(X_blob[:, 0], X_blob[:, 1], c = X_blob[:, 0], 
            linewidths = 0, s = 60, cmap = 'viridis')

3. PCA

''' 주성분 분석 '''

pca = PCA()
pca.fit(X_blob)
X_pca = pca.transform(X_blob)

print(np.var(X_pca, axis = 0))

print(pca.mean_)


''' 주성분 분석 결과 시각화 '''

S = X_pca.std(axis = 0)

fig, axes = plt.subplots(1, 2, figsize = (10, 10))
axes = axes.ravel()

axes[0].set_title("Original data")
axes[0].scatter(X_blob[:, 0], X_blob[:, 1], c = X_pca[:, 0], 
                linewidths = 0, s = 60, cmap = 'viridis')
axes[0].set_xlabel("feature 1")
axes[0].set_ylabel("feature 2")
axes[0].arrow(pca.mean_[0], pca.mean_[1], S[0] * pca.components_[0, 0],
              S[0] * pca.components_[0, 1], width = .1, head_width = .3,
              color = 'k')
axes[0].arrow(pca.mean_[0], pca.mean_[1], S[1] * pca.components_[1, 0],
              S[1] * pca.components_[1, 1], width = .1, head_width = .3,
              color = 'k')

axes[0].text(-1.5, -.5, "Component 2", size = 14)
axes[0].text(-4, -4, "Component 1", size = 14)
axes[0].set_aspect('equal')

#
axes[1].set_title("Transformed data")
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c = X_pca[:, 0], 
                linewidths = 0, s = 60, cmap = 'viridis')
axes[1].set_xlabel("First principal component")
axes[1].set_ylabel("Second principal component")
axes[1].set_aspect('equal')
axes[1].set_ylim(-8, 8)

4. 첫 번째 주성분 시각화

''' 첫번째 주성분 시각화 '''

fig, axes = plt.subplots(1, 2, figsize = (10, 10))

pca = PCA(n_components=1)
pca.fit(X_blob)
X_inverse = pca.inverse_transform(pca.transform(X_blob))

axes[0].set_title("Transformed data w/ second component dropped")
axes[0].scatter(X_pca[:, 0], np.zeros(X_pca.shape[0]), c = X_pca[:, 0],
                linewidths = 0, s = 60, cmap = 'viridis')
axes[0].set_xlabel("First principal component")
axes[0].set_aspect('equal')
axes[0].set_ylim(-8, 8)

axes[1].set_title("Back-rotation using only first component")
axes[1].scatter(X_inverse[:, 0], X_inverse[:, 1], c = X_pca[:, 0],
                linewidths = 0, s = 60, cmap = 'viridis')
axes[1].set_xlabel("feature 1")
axes[1].set_ylabel("feature 2")
axes[1].set_aspect('equal')
axes[1].set_xlim(-8, 4)
axes[1].set_ylim(-8, 4)

예제

reference - 파이썬 라이브러리를 활용한 머신러닝

http://www.yes24.com/Product/Goods/70969329?scode=032&OzSrank=1

파이썬 라이브러리를 활용한 머신러닝

사이킷런 핵심 개발자에게 배우는 머신러닝 이론과 구현 현업에서 머신러닝을 연구하고 인공지능 서비스를 개발하기 위해 꼭 학위를 받을 필요는 없습니다. 사이킷런(scikit-learn)과 같은 훌륭한 머신러닝 라이브러리가 복잡하고 난해한 작업을 직관적인 인터페이스로 감싸주는 덕분이죠. 이 책에서는 사이킷런의 핵심 개발자가 복잡한 수학을 동원...

www.yes24.com

1. cancer data example - Visualize

- malignant: 악성

- benign: 양성

- 두 타겟 데이터를 구분할 수 있는 변수들을 찾는 것이 목표다. 따라서 로지스틱 회귀를 사용할 것이다.

- 표준화: 각 변수마다 스케일이 다르므로 표준화를 통해 스케일을 맞춰준다.

##############################################################################
## 주성분 분석 적용 예 : cancer 자료 
##############################################################################

''' scikit-learn data '''
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

## 사용 자료 탐색 ###

print(cancer)

print(cancer.data)

print(cancer.target)

print(cancer.data.shape)



malignant = cancer.data[cancer.target == 0] # 악성
benign = cancer.data[cancer.target == 1] # 양성

## 변수 특성 파악: 시각화 ##

from matplotlib.colors import ListedColormap

cm3 = ListedColormap(['#0000aa', '#ff2020', '#50ff50'])

fig, axes = plt.subplots(15, 2, figsize = (10, 20))

ax = axes.ravel()

for i in range(30):
    _, bins = np.histogram(cancer.data[:, i], bins = 50)
    ax[i].hist(malignant[:, i], bins = bins, color = cm3(0), alpha = .5)
    ax[i].hist(benign[:, i], bins = bins, color = cm3(2), alpha = .5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["malignant", "benign"], loc="best")
fig.tight_layout()


## 표준화 ##
"""
각 속성마다 스케일이 다르다.
따라서 각 속성마다 스케일이 갖게 모수의 위치를 재설정 해준다.
"""
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)


## 표준화 자료 시각화 ##

malignant = X_scaled[cancer.target == 0] # 악성
benign = X_scaled[cancer.target == 1] # 양성

fig, axes = plt.subplots(15, 2, figsize=(10, 20))

ax = axes.ravel()

for i in range(30):
    _, bins = np.histogram(X_scaled[:, i], bins=50)
    ax[i].hist(malignant[:, i], bins = bins, color = cm3(0), alpha = .5)
    ax[i].hist(benign[:, i], bins = bins, color = cm3(2), alpha = .5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["malignant", "benign"], loc="best")
fig.tight_layout()

위 dist plot에서 속성별로 두 타겟을 색상으로 분리했다. 비교적 잘 분리되어 있는 데이터는 worst concave points 정도가 있다.

## 주성분 분석: 2개 주성분(2차원)

pca = PCA(n_components = 2)

pca.fit(X_scaled)

X_pca = pca.transform(X_scaled)

print("Original shape: {}".format(str(X_scaled.shape)))
print("Reduced shape: {}".format(str(X_pca.shape)))
"""
주성분 1. 가중 평균
주성분 2. 선형대비
"""

def discrete_scatter(x1, x2, y = None, markers = None, s = 10, ax = None,
                     labels = None, padding = .2, alpha = 1, c = None, markeredgewidth = None):
    import matplotlib as mpl
    from matplotlib.colors import colorConverter
    
    if ax is None:
        ax = plt.gca()

    if y is None:
        y = np.zeros(len(x1))

    unique_y = np.unique(y)

    if markers is None:
        markers = ['o', '^', 'v', 'D', 's', '*', 'p', 'h', 'H', '8', '<', '>'] * 10

    if len(markers) == 1:
        markers = markers * len(unique_y)

    if labels is None:
        labels = unique_y

    # lines in the matplotlib sense, not actual lines
    lines = []

    current_cycler = mpl.rcParams['axes.prop_cycle']

    for i, (yy, cycle) in enumerate(zip(unique_y, current_cycler())):
        mask = y == yy
        # if c is none, use color cycle
        if c is None:
            color = cycle['color']
        elif len(c) > 1:
            color = c[i]
        else:
            color = c
        # use light edge for dark markers
        if np.mean(colorConverter.to_rgb(color)) < .4:
            markeredgecolor = "grey"
        else:
            markeredgecolor = "black"

        lines.append(ax.plot(x1[mask], x2[mask], markers[i], markersize=s,
                             label=labels[i], alpha=alpha, c=color,
                             markeredgewidth=markeredgewidth,
                             markeredgecolor=markeredgecolor)[0])

    if padding != 0:
        pad1 = x1.std() * padding
        pad2 = x2.std() * padding
        xlim = ax.get_xlim()
        ylim = ax.get_ylim()
        ax.set_xlim(min(x1.min() - pad1, xlim[0]), max(x1.max() + pad1, xlim[1]))
        ax.set_ylim(min(x2.min() - pad2, ylim[0]), max(x2.max() + pad2, ylim[1]))

    return lines

주성분1, 주성분2로 차트를 그렸을 때 악성과 양성 그룹이 선형으로 분리가 가능할 것 같은 모양을 보인다.

히트맵 보는 법 내용 추가

- 주성분1은 모두 양수 이므로 가중평균이다.

- 주성분2는 음수와 양수가 모두 존재하므로 선형대비다.

2. logistic regression

## 로지스틱 선형회귀모형을 이용한 분류

from sklearn.linear_model import LogisticRegression

## 원자료
logistic = LogisticRegression(random_state = 0).fit(cancer.data, cancer.target)
logistic.predict(cancer.data)

print(logistic.predict_proba(cancer.data))

print(logistic.score(cancer.data, cancer.target))


## 2개 주성분
logistic = LogisticRegression(random_state = 0).fit(X_pca, cancer.target)
logistic.predict(X_pca)

print(logistic.predict_proba(X_pca))

print(logistic.score(X_pca, cancer.target))

print(logistic.coef_)
print(logistic.intercept_)

0.94 => 30개의 속성을 모두 사용했을 때의 정확도

0.95 => PCA로 차원축소를 진행하고 2개의 속성으로 로지스틱 회귀를 진행한 정확도 (상승)

주성분1, 2 가중치

절편

1. iris data-set example

##############################################################################
## iris 자료
##############################################################################

from sklearn.datasets import load_iris
iris = load_iris()

setosa = iris.data[iris.target == 0]
virgicolor = iris.data[iris.target == 1]
virginica = iris.data[iris.target == 2]

## 변수 특성: 품종별 분포
from matplotlib.colors import ListedColormap

cm3 = ListedColormap(['#0000aa', '#ff2020', '#50ff50'])

fig, axes = plt.subplots(2, 2, figsize = (10, 10))

ax = axes.ravel()

for i in range(4):
    _, bins = np.histogram(iris.data[:, i], bins = 10)
    ax[i].hist(setosa[:, i], bins = bins, color = cm3(0), alpha = .5)
    ax[i].hist(virgicolor[:, i], bins = bins, color = cm3(1), alpha = .5)
    ax[i].hist(virginica[:, i], bins = bins, color = cm3(2), alpha = .5)
    ax[i].set_title(iris.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["setosa", "virgicolor", "virginica"], loc="best")
fig.tight_layout()

표준화를 하지 않은 아이리스 데이터 셋의 스케일

2. 표준화

## 표준화
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(iris.data)
iris_scaled = scaler.transform(iris.data)

## 표준화자료 시각화
setosa = iris_scaled[iris.target == 0]
virgicolor = iris_scaled[iris.target == 1]
virginica = iris_scaled[iris.target == 2]

fig, axes = plt.subplots(2, 2, figsize = (10, 10))

ax = axes.ravel()

for i in range(4):
    _, bins = np.histogram(iris_scaled[:, i], bins = 10)
    ax[i].hist(setosa[:, i], bins = bins, color = cm3(0), alpha = .5)
    ax[i].hist(virgicolor[:, i], bins = bins, color = cm3(1), alpha = .5)
    ax[i].hist(virginica[:, i], bins = bins, color = cm3(2), alpha = .5)
    ax[i].set_title(iris.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["setosa", "virgicolor", "virginica"], loc="best")
fig.tight_layout()

## 주성분 분석
pca = PCA(n_components = 2)

pca.fit(iris_scaled)

iris_pca = pca.transform(iris_scaled)

print("Original shape: {}".format(str(iris_scaled.shape)))
print("Reduced shape: {}".format(str(iris_pca.shape)))


plt.figure(figsize = (8, 8))
discrete_scatter(iris_pca[:, 0], iris_pca[:, 1], iris.target)
plt.legend(iris.target_names, loc = "best")
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")


print("PCA component shape: {}".format(pca.components_.shape))
print("PCA components:\n{}".format(pca.components_))

plt.matshow(pca.components_, cmap='viridis')
plt.yticks([0, 1], ["First component", "Second component"])
plt.colorbar()
plt.xticks(range(len(iris.feature_names)),
           iris.feature_names, rotation = 60, ha = 'left')
plt.xlabel("Feature")
plt.ylabel("Principal components")

스케일이 맞춰진 것을 확인 할 수 있다.

2개의 주성분으로 축소하여 sactter plot을 그려보면 위와 같은 양상을 보인다.

세토사는 쉽게 구분할 수 있을 것으로 판단된다.

히트맵은 위와 같다. 히트맵은 추가적인 공부가 필요하다.

- sepal withd 와 나머지 변수들 간의 선형 대비이다.

<이유>

- 주성분 1에서 sepal width 외에 나머지 변수들은 수치가 균일하다. (1번째에 sepal width는 중요하지 않음 선형대비)

- 주성분 2에는 sepal width가 큰 영향을 미치고 있다.

3. logistic regression

## 로지스틱 선형회귀모형을 이용한 분류

from sklearn.linear_model import LogisticRegression

## 원자료
logistic = LogisticRegression(random_state = 0).fit(iris.data, iris.target)
logistic.predict(iris.data)

print(logistic.predict_proba(iris.data))

print(logistic.score(iris.data, iris.target))


## 2개 주성분
logistic = LogisticRegression(random_state = 0).fit(iris_pca, iris.target)
logistic.predict(iris_pca)

print(logistic.predict_proba(iris_pca))

print(logistic.score(iris_pca, iris.target))

print(logistic.coef_)
print(logistic.intercept_)

- 원자료로 실시한 로지스틱 회귀 정확도

- 2개의 주성분으로 실시한 로지스틱 회귀의 정확도

단 원자료를 모두 가지고 실시했으므로 과적합의 문제가 발생한다. 따라서 트레이닝 군과 테스트 군으로 나누어 실시해야할 필요성이 있다.

저작자표시 (새창열림)

'Homework > DataMining' 카테고리의 다른 글

09. Clustering - dbscan, spectal (0)	2020.05.19
08. Clustering - K-means, Hierarchical (0)	2020.05.12
06. Feature Selection (0)	2020.05.01
05. One-Hot Encoding (0)	2020.05.01
04. Interpolation / Normalization & Standardization (0)	2020.04.21

'Homework/DataMining' Related Articles

Comments

without haste but without rest

07. PCA(Pincipal Component Analysis) 본문

07. PCA(Pincipal Component Analysis)

0. 개요

1. 주성분 분석을 하는 이유

2. 과정

1. 랜덤 변수 생성

2. 선형 변환

3. PCA

4. 첫 번째 주성분 시각화

1. cancer data example - Visualize

2. logistic regression

1. iris data-set example

2. 표준화

3. logistic regression

'Homework > DataMining' 카테고리의 다른 글

티스토리툴바