without haste but without rest
05. One-Hot Encoding 본문
0. 개요
1. 원핫 인코딩
- 범주형 데이터를 수치로 변환한다.
- 각각의 범주를 속성으로 만들어서 해당 범주에 속하면 1, 아니면 0
2. 라벨 인코딩
- 범주형 데이터를 수치로 변환한다.
- 각각의 범주형 데이터에 고유 번호를 부여한다.
1. One-Hot Encoding
-범주형 데이터 (Categorical Data)를 수치로 변경해주는 작업이다.
(1) costom coding
###################################################################################
## One-hot Encoding
###################################################################################
print(jump['Jersey Size'].value_counts())
print(jump['Jersey Size'].value_counts().index)
js_cat = jump['Jersey Size'].value_counts().index
print(len(js_cat))
print(js_cat)
print(js_cat.values)
print(js_cat.sort_values())
print(js_cat.sort_values(ascending = False))
js_cat = js_cat.sort_values(ascending = False)
print(js_cat)
print(jump['Jersey Size'][0])
print(jump['Jersey Size'][1])
# 'small' ==> [1, 0, 0]
# 'medium' ==> [0, 1, 0]
# 'large' ==> [0, 0, 1]
print(js_cat == 'small')
print(js_cat == 'medium')
print(js_cat == 'large')
print((js_cat == 'small') * 1)
print((js_cat == jump['Jersey Size'][0]) * 1)
for idx, js in enumerate(jump['Jersey Size']):
print(js_cat == js)
for idx, js in enumerate(jump['Jersey Size']):
print((js_cat == js) * 1)
js_raw = list()
for idx, js in enumerate(jump['Jersey Size']):
js_raw.append((js_cat == js) * 1)
print(js_raw)
js_one_hot = pd.DataFrame(data = js_raw)
print(js_one_hot)
js_one_hot.rename(columns = {0 : 'Jearsey Size_small', 1 : 'Jearsey Size_medium', 2 : 'Jearsey Size_large'}, inplace = True)
print(js_one_hot)
jump.index
js_one_hot.index = jump.index
print(js_one_hot)
jump_js_one_hot = pd.concat([jump, js_one_hot], axis = 1)
print(jump_js_one_hot.transpose())
jump_js_one_hot.drop('Jersey Size', axis = 1, inplace = True)
print(jump_js_one_hot.transpose())
(2) scikit learn 의 OneHotEncoder 모듈
###################################################################################
## Scikit-learn: OneHotEncder
###################################################################################
jump = pd.read_csv("./long_jump.csv")
jump.set_index('Person', inplace = True)
# filter in categorical columns for demonstration
cats = ['Jersey Size']
print(jump[cats])
# 위 데이터는 카테고리컬 데이터이다.
# import module and instantiate encoder object
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
# fit and transform in one call and print categories
out_encoder = encoder.fit_transform(jump[cats])
print(out_encoder)
new_cols = encoder.get_feature_names(cats).tolist()
print(new_cols)
# create temporary dataframe "jump_enc" for concatenation with original data
jump_enc = pd.DataFrame(data = out_encoder, columns = new_cols)
jump_enc.index = jump.index
print(jump_enc.transpose())
# drop original columns and concat new encoded columns
jump.drop(cats, axis = 1, inplace = True)
jump = pd.concat([jump, jump_enc], axis = 1)
print(jump.transpose())
(3) costom encoding
###################################################################################
## custom coding
###################################################################################
jump = pd.read_csv("./long_jump.csv")
jump.set_index('Person', inplace = True)
js_cat = jump['Jersey Size'].value_counts().index
js_raw = list()
for idx, js in enumerate(jump['Jersey Size']):
js_raw.append((js_cat == js) * 1)
js_one_hot = pd.DataFrame(data = js_raw)
js_one_hot.rename(columns = {0 : 'Jearsey Size_small', 1 : 'Jearsey Size_medium', 2 : 'Jearsey Size_large'}, inplace = True)
js_one_hot.index = jump.index
jump_js_one_hot = pd.concat([jump, js_one_hot], axis = 1)
jump_js_one_hot.drop('Jersey Size', axis = 1, inplace = True)
print(jump_js_one_hot.transpose())
(4). label encoder
###################################################################################
## label encoding
###################################################################################
# import modules and instantiate enc object
from sklearn import preprocessing
enc = preprocessing.LabelEncoder()
# fit with integer labels and transform
out_enc = enc.fit_transform([1, 2, 5, 2, 4, 2, 5])
print(out_enc) # [1, 2, 4, 5]
# fit with string labels and transform
out_enc = enc.fit_transform(["blue", "red", "blue", "green", "red", "red"])
print(out_enc)
'Homework > DataMining' 카테고리의 다른 글
07. PCA(Pincipal Component Analysis) (0) | 2020.05.05 |
---|---|
06. Feature Selection (0) | 2020.05.01 |
04. Interpolation / Normalization & Standardization (0) | 2020.04.21 |
03. Visualization with Seabron (0) | 2020.04.14 |
02. Data Load with sqlite3 (0) | 2020.04.07 |
Comments