1장 – 한 눈에 보는 머신러닝¶

1.0 OUTLINE¶

머신러닝이란?
왜 머신러닝을 사용하는가?
머신러닝 시스템의 종류
머신러닝의 주요 도전 과제
테스트와 검증

1.1 머신러닝이란?¶

정의: 데이터로부터 학습하도록 컴퓨터를 프로그래밍하는 과학¶

T. Mitchell (1997): 어떤 작업 T에 대한 컴퓨터 프로그램의 성능을 P로 측정했을 때 경험 E로 인해 성능이 향상됐다면, 이 컴퓨터 프로그램은 작업 T와 성능 측정 P에 대한 경험 E로 학습한 것

eg¶

스팸필터: 스팸 메일과 일반 메일의 샘플을 이용하여 스팸 메일 구분법을 배울 수 있는 머신러닝 프로그램
T: 새로운 메일이 스팸인지 구분하는 것, E: 훈련 데이터, P: 오류율

용어¶

훈련 세트(training set) : 시스템이 학습하는 데 사용하는 샘플
훈련 사례(training instance) : 각 훈련 데이터

1.2 왜 머신러닝을 사용하는가?¶

전통적인 프로그래밍 방식에서는 계속 새로운 규칙을 추가해야 하는데, 머신러닝 기법에서는 학습을 통해 자동으로 인식할 수 있어서
음성인식 등과 같이 전통적인 방식으로는 전혀 해결 방법이 없는 문제에
대용량 데이터에서 패턴을 발견하고자

1.3 머신러닝 시스템의 종류¶

지도 학습 vs. 비지도 학습, 준지도 학습, 강화 학습¶

배치 학습 vs. 온라인 학습¶

사례 기반 vs. 모델 기반¶

1.3.1 지도 학습 supervised learning¶

훈련 데이터에 레이블(label)이 포함되어 있음
eg : 분류(classification), regression 등!

Cap%202018-08-21%2016-23-54-993.jpg

Cap%202018-08-21%2016-26-24-138.jpg

1.3.2 비지도 학습 unsupervised learning¶

훈련 데이터에 레이블이 없음!
eg : 군집, 시각화, 차원 축소(dimensionality reduction) , 이상치 탐색, 연관성 규칙(association rule)

Cap%202018-08-21%2016-45-02-571.jpg

Cap%202018-08-21%2016-46-07-322.jpg

1.3.3 준지도 학습 semi-supervised learning¶

레이블이 일부만 있는 훈련 데이터를 학습
eg : 심층 신뢰 신경망(deep belief network) 등

Cap%202018-08-21%2016-55-57-110.jpg

1.3.4 강화 학습 reinforcement learning¶

환경(environment, 에이전트라 부름)를 관찰해서 행동(action)을 실행하고 그 결과로 보상(reward) 또는 벌점(penalty)을 받음
가장 큰 보상을 얻기 위해 전략(정책 policy라 부름)을 스스로 학습
eg : 보행 로봇, DeepMind의 알파고 등

1.3.5 뱃지 학습 batch learning¶

가용한 데이터를 모두 사용해 훈련시킴
시간과 자원을 많이 소모되므로 오프라인에서 수행 : 오프라인 학습

1.3.6 온라인 학습 online learning¶

데이터를 순차적으로 한 개씩 또는 미니뱃지라 부르는 작은 묶음 단위로 주입하여 시스템을 훈련시킴
연속적으로 데이터를 받고 빠른 변화에 스스로 적응해야 하는 시스템이나 컴퓨터 자원이 제한된 경우에 적합
컴퓨터 한 대의 메인 메모리에 들어갈 수 없는 아주 큰 데이터셋을 학습하는 시스템에도 적용 가능

1.3.7 사례 기반 학습 instance-based learning¶

시스템이 사례를 기억함으로써 학습하고, 유사도를 측정하여 새로운 데이터에 일반화함

Cap%202018-08-21%2015-46-48-503.jpg

1.3.8 모델 기반 학습 model-based learning¶

최상의 성능을 가진 모델 선택¶

얼마나 좋은지? 효용(utility) 함수 or 적합도(fitness) 함수
얼마나 나쁜지? 비용(cost) 함수

설정¶

# 공통
import numpy as np
import os
import pandas as pd
import sklearn.linear_model

# 일관된 출력을 위해 유사난수 초기화
np.random.seed(42)

# 맷플롯립 설정
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc

font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()

rc('font', family=font_name)

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

plt.rcParams['axes.unicode_minus'] = False

# 그림을 저장할 폴드
PROJECT_ROOT_DIR = "C:/Users/Admin/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/sally/Dropbox/2019-Fall-Semester/ML"
CHAPTER_ID = "fundamentals"

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id + ".png")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

# SciPy 이슈 #5998에 해당하는 경고를 무시합니다(https://github.com/scipy/scipy/issues/5998).
# import warnings
# warnings.filterwarnings(action="ignore", module="scipy", message="^internal gelsd")

Q : Does money make people happier?¶

datapath = os.path.join("datasets", "lifesat", "")
oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')

oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]

oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
oecd_bli.head()

oecd_bli["Life satisfaction"].head()

Country
Australia    7.3
Austria      6.9
Belgium      6.9
Brazil       7.0
Canada       7.3
Name: Life satisfaction, dtype: float64

gdp_per_capita = pd.read_csv(datapath+"gdp_per_capita.csv", thousands=',', delimiter='\t',
                             encoding='latin1', na_values="n/a")

gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)

gdp_per_capita.set_index("Country", inplace=True)
gdp_per_capita.head()

gdp_per_capita["GDP per capita"].head()

Country
Afghanistan              599.994
Albania                 3995.383
Algeria                 4318.135
Angola                  4100.315
Antigua and Barbuda    14414.302
Name: GDP per capita, dtype: float64

# inner join
full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita, left_index=True, right_index=True)

full_country_stats.sort_values(by="GDP per capita", inplace=True)

full_country_stats[["GDP per capita", 'Life satisfaction']].head()

remove_indices = [0, 1, 6, 8, 33, 34, 35]
keep_indices = list(set(range(36)) - set(remove_indices))

sample_data = full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]

missing_data = full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[remove_indices]

ax = sample_data.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(7,5))
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')
plt.axis([0, 60000, 0, 10])

[0, 60000, 0, 10]

position_text = {
    "Hungary": (5000, 1, '헝가리'),
    "Korea": (18000, 1.7, '대한민국'),
    "France": (29000, 2.4, '프랑스'),
    "Australia": (40000, 3.0, '호주'),
    "United States": (52000, 3.8, '미국'),
}

for country, pos_text in position_text.items():
    pos_data_x, pos_data_y = sample_data.loc[country]
    country = "U.S." if country == "United States" else country
    plt.annotate(pos_text[2], xy=(pos_data_x, pos_data_y), xytext=pos_text[:2],
            arrowprops=dict(facecolor='black', width=0.5, shrink=0.1, headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "ro")
save_fig('money_happy_scatterplot')
plt.show()

어떤 경향이 보이나요?

A simle linear model¶

life_satisfaction = $\theta_0$+$\theta_1\times$GDP_per_capita¶

import numpy as np

ax = sample_data.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(7,5))
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')
plt.axis([0, 60000, 0, 10])
X=np.linspace(0, 60000, 1000)
plt.plot(X, 2*X/100000, "r")
plt.text(40000, 2.7, r"$\theta_0 = 0$", fontsize=14, color="r")
plt.text(40000, 1.8, r"$\theta_1 = 2 \times 10^{-5}$", fontsize=14, color="r")

plt.plot(X, 8 - 5*X/100000, "g")
plt.text(5000, 9.1, r"$\theta_0 = 8$", fontsize=14, color="g")
plt.text(5000, 8.2, r"$\theta_1 = -5 \times 10^{-5}$", fontsize=14, color="g")

plt.plot(X, 4 + 5*X/100000, "b")
plt.text(5000, 3.5, r"$\theta_0 = 4$", fontsize=14, color="b")
plt.text(5000, 2.6, r"$\theta_1 = 5 \times 10^{-5}$", fontsize=14, color="b")
save_fig('tweaking_model_params_plot')
plt.show()

from sklearn import linear_model
lin1 = linear_model.LinearRegression()

Xsample = np.c_[sample_data["GDP per capita"]]
Xsample

array([[ 9054.914],
       [ 9437.372],
       [12239.894],
       [12495.334],
       [15991.736],
       [17288.083],
       [18064.288],
       [19121.592],
       [20732.482],
       [25864.721],
       [27195.197],
       [29866.581],
       [32485.545],
       [35343.336],
       [37044.891],
       [37675.006],
       [40106.632],
       [40996.511],
       [41973.988],
       [43331.961],
       [43603.115],
       [43724.031],
       [43770.688],
       [49866.266],
       [50854.583],
       [50961.865],
       [51350.744],
       [52114.165],
       [55805.204]])

ysample = np.c_[sample_data["Life satisfaction"]]
ysample

array([[6. ],
       [5.6],
       [4.9],
       [5.8],
       [6.1],
       [5.6],
       [4.8],
       [5.1],
       [5.7],
       [6.5],
       [5.8],
       [6. ],
       [5.9],
       [7.4],
       [7.3],
       [6.5],
       [6.9],
       [7. ],
       [7.4],
       [7.3],
       [7.3],
       [6.9],
       [6.8],
       [7.2],
       [7.5],
       [7.3],
       [7. ],
       [7.5],
       [7.2]])

lin1.fit(Xsample, ysample)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

t0, t1 = lin1.intercept_[0], lin1.coef_[0][0]
t0, t1

(4.853052800266436, 4.911544589158483e-05)

ax = sample_data.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(7,5))
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')
plt.axis([0, 60000, 0, 10])

X=np.linspace(0, 60000, 1000)

plt.plot(X, t0 + t1*X, "b")
plt.text(5000, 3.1, r"$\theta_0 = 4.85$", fontsize=14, color="b")
plt.text(5000, 2.2, r"$\theta_1 = 4.91 \times 10^{-5}$", fontsize=14, color="b")
save_fig('best_fit_model_plot')
plt.show()

cyprus_gdp_per_capita = gdp_per_capita.loc["Cyprus"]["GDP per capita"]
print(cyprus_gdp_per_capita)

22587.49

cyprus_predicted_life_satisfaction = lin1.predict([[cyprus_gdp_per_capita]])
cyprus_predicted_life_satisfaction

array([[5.96244744]])

ax = sample_data.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(7,5), s=1)
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')

X=np.linspace(0, 60000, 1000)

plt.plot(X, t0 + t1*X, "b")
plt.axis([0, 60000, 0, 10])
plt.text(5000, 7.5, r"$\theta_0 = 4.85$", fontsize=14, color="b")
plt.text(5000, 6.6, r"$\theta_1 = 4.91 \times 10^{-5}$", fontsize=14, color="b")

plt.plot([cyprus_gdp_per_capita, cyprus_gdp_per_capita], [0, cyprus_predicted_life_satisfaction], "r--")
plt.text(25000, 5.0, r"예측 = 5.96", fontsize=14, color="b")

plt.plot(cyprus_gdp_per_capita, cyprus_predicted_life_satisfaction, "ro")

save_fig('cyprus_prediction_plot')
plt.show()

1.4 머신러닝의 주요 도전 과제¶

나쁜 데이터¶

충분하지 않은 양의 훈련 데이터
대표성이 없는 훈련 데이터
낮은 품질의 데이터
관련 없는 특성

나쁜 알고리즘¶

훈련 데이터 과대적합
훈련 데이터 과소적합

1.4.1 충분하지 않은 양의 훈련 데이터¶

데이터가 많아야!
"복잡한 문제에서 알고리즘보다 데이터가 더 중요하다."

Cap%202018-08-21%2014-14-00-463.jpg

1.4.2 대표성이 없는 훈련 데이터¶

매우 가난하거나 부유한 나라에서 잘못 예측하는 모델을 훈련시킴! 샘플링 편향 (sampling bias)!

position_text2 = {
    "Brazil": (1000, 9.0, '브라질'),
    "Mexico": (11000, 9.0, '멕시코'),
    "Chile": (25000, 9.0, '칠레'),
    "Czech Republic": (35000, 9.0, '체코'),
    "Norway": (60000, 3, '노르웨이'),
    "Switzerland": (72000, 3.0, '스위스'),
    "Luxembourg": (90000, 3.0, '룩셈부르크'),
}

ax = sample_data.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(8,3))
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')
plt.axis([0, 110000, 0, 10])

for country, pos_text in position_text2.items():
    pos_data_x, pos_data_y = missing_data.loc[country]
    plt.annotate(pos_text[2], xy=(pos_data_x, pos_data_y), xytext=pos_text[:2],
            arrowprops=dict(facecolor='black', width=0.5, shrink=0.1, headwidth=5))
    plt.plot(pos_data_x, pos_data_y, "rs")
    
X=np.linspace(0, 110000, 1000)

plt.plot(X, t0 + t1*X, "b:")

lin_reg_full = linear_model.LinearRegression()

Xfull = np.c_[full_country_stats["GDP per capita"]]
yfull = np.c_[full_country_stats["Life satisfaction"]]

lin_reg_full.fit(Xfull, yfull)

t0full, t1full = lin_reg_full.intercept_[0], lin_reg_full.coef_[0][0]

X = np.linspace(0, 110000, 1000)

plt.plot(X, t0full + t1full * X, "k")

save_fig('representative_training_data_scatterplot')
plt.show()

1.4.3 낮은 품질의 데이터¶

훈련 데이터가 에러, 이상치 (outlier), 잡음이 많으면 머신러닝 시스템이 패턴을 찾기 어려워! 데이터 정제에 많은 시간을 들여야!

1.4.4 관련 없는 특성¶

특성 선택 (feature selection) : 훈련에 가장 유용한 특성을 선택
특성 추출 (feature extraction) : 특성을 결합하여 더 유용한 특성을 만듦

1.4.5 훈련 데이터 과대적합¶

모델이 훈련 데이터에 너무 잘 맞지만 일반성이 떨어짐

ax = full_country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction', figsize=(8,3))
ax.set(xlabel='1인당 GDP', ylabel='삶의 만족도')
plt.axis([0, 110000, 0, 10])

from sklearn import preprocessing
from sklearn import pipeline

poly = preprocessing.PolynomialFeatures(degree=60, include_bias=False)

scaler = preprocessing.StandardScaler()

lin_reg2 = linear_model.LinearRegression()

pipeline_reg = pipeline.Pipeline([('poly', poly), ('scal', scaler), ('lin', lin_reg2)])

pipeline_reg.fit(Xfull, yfull)

curve = pipeline_reg.predict(X[:, np.newaxis])

plt.plot(X, curve)

save_fig('overfitting_model_plot')
plt.show()

C:\ProgramData\Anaconda3\lib\site-packages\numpy\lib\nanfunctions.py:1508: RuntimeWarning: overflow encountered in multiply
  sqr = np.multiply(arr, arr, out=arr)
C:\ProgramData\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py:86: RuntimeWarning: overflow encountered in reduce
  return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

모델을 단순하게, 모델에 제약을 가하여

plt.figure(figsize=(8,3))

plt.xlabel("1인당 GDP"); plt.ylabel('삶의 만족도')

plt.plot(list(sample_data["GDP per capita"]), list(sample_data["Life satisfaction"]), "bo")

plt.plot(list(missing_data["GDP per capita"]), list(missing_data["Life satisfaction"]), "rs")

X = np.linspace(0, 110000, 1000)

plt.plot(X, t0full + t1full * X, "r--", label="모든 데이터로 만든 선형 모델")

plt.plot(X, t0 + t1*X, "b:", label="일부 데이터로 만든 선형 모델")

ridge = linear_model.Ridge(alpha=10**9.5)

Xsample = np.c_[sample_data["GDP per capita"]]

ysample = np.c_[sample_data["Life satisfaction"]]

ridge.fit(Xsample, ysample)

t0ridge, t1ridge = ridge.intercept_[0], ridge.coef_[0][0]

plt.plot(X, t0ridge + t1ridge * X, "b", label="일부 데이터로 만든 규제가 적용된 선형 모델")

plt.legend(loc="lower right")

plt.axis([0, 110000, 0, 10])

save_fig('ridge_model_plot')
plt.show()

1.4.6 훈련 데이터 과소적합¶

모델이 너무 단순해서 내재된 구조를 학습하지 못할 때
파라미터가 더 많은 모델 선택
학습 알고리즘에 더 좋은 특성을 제공
모델의 제약을 줄임

1.5 테스트와 검증¶

훈련 세트(80%)와 테스트 세트(20%)로 나눔¶

훈련 세트로 모델을 훈련, 테스트 세트로 모델을 평가
일반화 오차 (generalization error) : 새로운 샘플에 대한 오류 비율

과대적합을 피하기 위해 검증 세트(validation set)를 만듦¶

훈련 세트로 여러 모델을 훈련, 검증 세트로 최상의 성능을 내는 모델 선택
테스트 세트로 단한번만 테스트

Indicator	Air pollution	Assault rate	Consultation on rule-making	Dwellings without basic facilities	Educational attainment	Employees working very long hours	Employment rate	Homicide rate	Household net adjusted disposable income	Household net financial wealth	...	Long-term unemployment rate	Personal earnings	Quality of support network	Rooms per person	Self-reported health	Student skills	Time devoted to leisure and personal care	Voter turnout	Water quality	Years in education
Country
Australia	13.0	2.1	10.5	1.1	76.0	14.02	72.0	0.8	31588.0	47657.0	...	1.08	50449.0	92.0	2.3	85.0	512.0	14.41	93.0	91.0	19.4
Austria	27.0	3.4	7.1	1.0	83.0	7.61	72.0	0.4	31173.0	49887.0	...	1.19	45199.0	89.0	1.6	69.0	500.0	14.46	75.0	94.0	17.0
Belgium	21.0	6.6	4.5	2.0	72.0	4.57	62.0	1.1	28307.0	83876.0	...	3.88	48082.0	94.0	2.2	74.0	509.0	15.71	89.0	87.0	18.9
Brazil	18.0	7.9	4.0	6.7	45.0	10.41	67.0	25.5	11664.0	6844.0	...	1.97	17177.0	90.0	1.6	69.0	402.0	14.97	79.0	72.0	16.3
Canada	15.0	1.3	10.5	0.2	89.0	3.94	72.0	1.5	29365.0	67913.0	...	0.90	46911.0	92.0	2.5	89.0	522.0	14.25	61.0	91.0	17.2

	Subject Descriptor	Units	Scale	Country/Series-specific Notes	GDP per capita	Estimates Start After
Country
Afghanistan	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	599.994	2013.0
Albania	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	3995.383	2010.0
Algeria	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	4318.135	2014.0
Angola	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	4100.315	2014.0
Antigua and Barbuda	Gross domestic product per capita, current prices	U.S. dollars	Units	See notes for: Gross domestic product, curren...	14414.302	2011.0

	GDP per capita	Life satisfaction
Country
Brazil	8669.998	7.0
Mexico	9009.280	6.7
Russia	9054.914	6.0
Turkey	9437.372	5.6
Hungary	12239.894	4.9