Notice

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

without haste but without rest

10. OLS, SGD 본문

Homework/DataMining

10. OLS, SGD

JinungKim 2020. 5. 26. 12:36

0. 개요

(1) OLS (Ordinary Least Squares)

- 일반적으로 흔히 사용하는 최소 제곱법을 이용하는 회귀분석

(2) SGD (Stochastic Gradient Desecnt)

- 확률적 경사 하강법

- cost function을 최소화 하는 지점을 찾기 위해 사용한다.

- 미니 배치를 사용해서 학습 속도를 높일 수 있다.

1. 데이터 로드

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## California house price

url = "https://raw.githubusercontent.com/johnwheeler/handson-ml/master/datasets/housing/housing.csv"
housing = pd.read_csv(url)

"""
1. longitude: A measure of how far west a house is; a higher value is farther west
2. latitude: A measure of how far north a house is; a higher value is farther north
3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
4. totalRooms: Total number of rooms within a block
5. totalBedrooms: Total number of bedrooms within a block
6. population: Total number of people residing within a block
7. households: Total number of households, a group of people residing within a home unit, for a block
8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
10. oceanProximity: Location of the house w.r.t ocean/sea
"""

2. 결측치 확인, 보간

1-1 결측치 확인

## check NA
print(housing.describe().transpose())

print(housing.isnull())

print(housing.isnull().values.any())

print(housing.isnull().values.sum())

print(housing.info())

print(housing.describe().transpose())

print(housing[housing['total_bedrooms'].isnull()])
print(housing.loc[290])

total_bedrooms 속성에 결측치가 있는 것을 확인 할 수 있다.

1-2 보간

## 보간 
## imputation
housing['total_bedrooms'][housing['total_bedrooms'].isnull()] = np.mean(housing['total_bedrooms'])
print(housing.loc[290])

total_bedrooms 데이터의 평균 값으로 결측치를 보간했다.

1-3 히스토그램으로 데이터 확인

## check features
sns.set(style = "white", palette = "muted", color_codes = True)

# Plot a simple histogram with binsize determined automatically
sns.distplot(housing['median_house_value'], kde = False, color = "b")

sns.distplot(housing['total_bedrooms'], kde = False, color = "b")

50만 이상의 값에서 특이한 분포를 보이고 있다.

total_bedrooms 는 skewd 한 형태를 보인다.

당연히 히스토그램만으로 정규성을 판단이 불가능하지만 위 두 그래프는 정규분포와는 거리가 많이 멀어 보인다.

정규성 검정은 Q-Q plot을 사용한다.

https://jinyes-tistory.tistory.com/145?category=855831

1-4 데이터 변환

## transformation
housing['avg_rooms'] = housing['total_rooms'] / housing['households']
housing['avg_bedrooms'] = housing['total_bedrooms'] / housing['households']
housing['pop_household'] = housing['population'] / housing['households']
print(housing.head())

분포가 차이가 많이 나므로 전체 자료수를 나타내는 households 속성을 이용해서 자료를 변환한다.

(분포가 왜 어떻게 차이가 나는지 이해 못했음)

1-5 상관계수 확인

## check relationship
print(housing.corr())

1-6 joint plot

sns.jointplot(housing['avg_bedrooms'], housing['median_house_value'])

1-7 범주형 자료 변환 - 더미 변수

## on-hot
housing['NEAR BAY'] = 0
housing['INLAND'] = 0
housing['<1H OCEAN']  =0
housing['ISLAND'] = 0
housing['NEAR OCEAN'] = 0
print(housing.head())


housing.loc[housing['ocean_proximity']=='NEAR BAY','NEAR BAY'] = 1
housing.loc[housing['ocean_proximity']=='INLAND','INLAND'] = 1
housing.loc[housing['ocean_proximity']=='<1H OCEAN','<1H OCEAN'] = 1
housing.loc[housing['ocean_proximity']=='ISLAND','ISLAND']=1
housing.loc[housing['ocean_proximity']=='NEAR OCEAN','NEAR OCEAN'] = 1
print(housing.head())

2 OLS

1-1 fitting

## OLS
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# 변환한 자료 드롭 
train_x = housing.drop(['total_rooms','total_bedrooms','households',
                        'ocean_proximity','median_house_value'], axis = 1)

train_y = housing['median_house_value']

X, test_x, Y, test_y = train_test_split(train_x, train_y, test_size = 0.2)

print(X.describe().transpose())

mlr = LinearRegression()
mlr.fit(np.array(X), Y)

print(mlr.coef_)
print(mlr.score(np.array(X), Y))
print(mlr.score(np.array(test_x), test_y)) # 결정계수를 의미함

1-2 예측 값 확인

import math

def roundup(x):

    return int(math.ceil(x / 100.0)) * 100 

pred = list(map(roundup, mlr.predict(test_x)))

print(pred[:10])
test_y[:10]

3 SGD

1-1 fitting

## SGD
from sklearn.linear_model import SGDRegressor

"""
옵션 설명

eto0 - 그라디언트 값을 계산하면 커지므로 보정하기 위해서 작은 값으로 보정
max_iter - 최대 반복횟수, 데이터 수 보다는 많게 한다.
tol - 계산값이 이전 계산값과 1E-3 차이 나면 멈춘다.
"""

sgd = SGDRegressor(learning_rate = 'constant', eta0 = 1E-8, max_iter = 20000, tol = 1E-3)
sgd.fit(np.array(X), Y)

print(sgd.coef_)
print(sgd.score(np.array(X), Y))

상당히 낮은 정확도를 보인다. 데이터를 표준화하고 진행해보자

1-2 표준화 후 적용

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
test_scaler = StandardScaler()

X = scaler.fit_transform(X)
test_x = test_scaler.fit_transform(test_x)

sgd = SGDRegressor(learning_rate = 'constant', eta0 = 1E-8, max_iter = 20000, tol = 1E-3)
sgd.fit(np.array(X), Y)

print(sgd.coef_)
print(sgd.score(np.array(X), Y))
print(sgd.score(np.array(test_x), test_y))

표준화 후 SGD에서 스코어가 OLS 와 비슷한 수준으로 상승했다. 이때 sklearn 라이브러리에서 학습 횟수를 증가시키는 것을 권장하는 안내 메시지가 출력된다.

테스트 데이터와 트레이닝 데이터의 스코어를 보면 오버 피팅이 된 것 같지는 않아 보인다.

저작자표시 (새창열림)

'Homework > DataMining' 카테고리의 다른 글

12. Decision Tree (0)	2020.06.09
11. Logistic regression - deep learning (0)	2020.06.02
09. Clustering - dbscan, spectal (0)	2020.05.19
08. Clustering - K-means, Hierarchical (0)	2020.05.12
07. PCA(Pincipal Component Analysis) (0)	2020.05.05

'Homework/DataMining' Related Articles

Comments

without haste but without rest

10. OLS, SGD 본문

10. OLS, SGD

0. 개요

1. 데이터 로드

2. 결측치 확인, 보간

1-1 결측치 확인

1-2 보간

1-3 히스토그램으로 데이터 확인

1-4 데이터 변환

1-5 상관계수 확인

1-6 joint plot

1-7 범주형 자료 변환 - 더미 변수

2 OLS

1-1 fitting

1-2 예측 값 확인

3 SGD

1-1 fitting

1-2 표준화 후 적용

'Homework > DataMining' 카테고리의 다른 글

티스토리툴바