without haste but without rest

05. One-Hot Encoding 본문

Homework/DataMining

05. One-Hot Encoding

JinungKim 2020. 5. 1. 21:47

0. 개요

 

1. 원핫 인코딩

- 범주형 데이터를 수치로 변환한다.

- 각각의 범주를 속성으로 만들어서 해당 범주에 속하면 1, 아니면 0

 

 

2. 라벨 인코딩

- 범주형 데이터를 수치로 변환한다.

- 각각의 범주형 데이터에 고유 번호를 부여한다.


1. One-Hot Encoding

-범주형 데이터 (Categorical Data)를 수치로 변경해주는 작업이다.

 


(1) costom coding

###################################################################################
## One-hot Encoding
###################################################################################

print(jump['Jersey Size'].value_counts())

print(jump['Jersey Size'].value_counts().index)

js_cat = jump['Jersey Size'].value_counts().index

print(len(js_cat))
print(js_cat)

print(js_cat.values)


print(js_cat.sort_values())
print(js_cat.sort_values(ascending = False))


js_cat = js_cat.sort_values(ascending = False)
print(js_cat)


print(jump['Jersey Size'][0])
print(jump['Jersey Size'][1])

# 'small' ==> [1, 0, 0]
# 'medium' ==> [0, 1, 0]
# 'large' ==> [0, 0, 1]

print(js_cat == 'small')
print(js_cat == 'medium')
print(js_cat == 'large')

print((js_cat == 'small') * 1)

print((js_cat == jump['Jersey Size'][0]) * 1)


for idx, js in enumerate(jump['Jersey Size']):
    
    print(js_cat == js)
    
for idx, js in enumerate(jump['Jersey Size']):
    
    print((js_cat == js) * 1)
    

js_raw = list()

for idx, js in enumerate(jump['Jersey Size']):

    js_raw.append((js_cat == js) * 1)

print(js_raw)

js_one_hot = pd.DataFrame(data = js_raw)

print(js_one_hot)

js_one_hot.rename(columns = {0 : 'Jearsey Size_small', 1 : 'Jearsey Size_medium', 2 : 'Jearsey Size_large'}, inplace = True)
print(js_one_hot)

jump.index
js_one_hot.index = jump.index
print(js_one_hot)

jump_js_one_hot = pd.concat([jump, js_one_hot], axis = 1)
print(jump_js_one_hot.transpose())

jump_js_one_hot.drop('Jersey Size', axis = 1, inplace = True)
print(jump_js_one_hot.transpose())

 


(2) scikit learn 의 OneHotEncoder 모듈

###################################################################################
## Scikit-learn: OneHotEncder
###################################################################################
jump = pd.read_csv("./long_jump.csv")
jump.set_index('Person', inplace = True)
# filter in categorical columns for demonstration
cats = ['Jersey Size']
print(jump[cats])
# 위 데이터는 카테고리컬 데이터이다.

# import module and instantiate encoder object
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# fit and transform in one call and print categories
out_encoder = encoder.fit_transform(jump[cats])

print(out_encoder)

new_cols = encoder.get_feature_names(cats).tolist()

print(new_cols)


# create temporary dataframe "jump_enc" for concatenation with original data
jump_enc = pd.DataFrame(data = out_encoder, columns = new_cols)
jump_enc.index = jump.index

print(jump_enc.transpose())


# drop original columns and concat new encoded columns
jump.drop(cats, axis = 1, inplace = True)
jump = pd.concat([jump, jump_enc], axis = 1)
print(jump.transpose())

 

 

 


(3) costom encoding

###################################################################################
## custom coding
###################################################################################

jump = pd.read_csv("./long_jump.csv")
jump.set_index('Person', inplace = True)

js_cat = jump['Jersey Size'].value_counts().index

js_raw = list()
for idx, js in enumerate(jump['Jersey Size']):
    js_raw.append((js_cat == js) * 1)

js_one_hot = pd.DataFrame(data = js_raw)
js_one_hot.rename(columns = {0 : 'Jearsey Size_small', 1 : 'Jearsey Size_medium', 2 : 'Jearsey Size_large'}, inplace = True)
js_one_hot.index = jump.index

jump_js_one_hot = pd.concat([jump, js_one_hot], axis = 1)
jump_js_one_hot.drop('Jersey Size', axis = 1, inplace = True)

print(jump_js_one_hot.transpose())

 

 


(4). label encoder

###################################################################################
## label encoding
###################################################################################

# import modules and instantiate enc object
from sklearn import preprocessing
enc = preprocessing.LabelEncoder()

# fit with integer labels and transform
out_enc = enc.fit_transform([1, 2, 5, 2, 4, 2, 5])
print(out_enc) # [1, 2, 4, 5]


# fit with string labels and transform
out_enc = enc.fit_transform(["blue", "red", "blue", "green", "red", "red"])
print(out_enc)

'Homework > DataMining' 카테고리의 다른 글

07. PCA(Pincipal Component Analysis)  (0) 2020.05.05
06. Feature Selection  (0) 2020.05.01
04. Interpolation / Normalization & Standardization  (0) 2020.04.21
03. Visualization with Seabron  (0) 2020.04.14
02. Data Load with sqlite3  (0) 2020.04.07
Comments