6장 – 결정 나무

Outline

. 결정 나무 학습과 시각화

. 예측하기

. 클래스 확률 추정

. CART 훈련 알고리즘

. 계산 복잡도

. 지니 불순도 또는 엔트로피?

. 규제 매개 변수

. 회귀

. 불안정성

6.0 설정

In [1]:
# 공통
import numpy as np
import os
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from matplotlib import font_manager, rc
from sklearn.datasets import load_iris
from sklearn.datasets import make_moons

from sklearn.tree import export_graphviz
from matplotlib.colors import ListedColormap
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

# 일관된 출력을 위해 유사난수 초기화
np.random.seed(42)

# 맷플롯립 설정
font_name = font_manager.FontProperties(fname = "c:/Windows/Fonts/malgun.ttf").get_name()
rc('font',family = font_name)

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['axes.unicode_minus'] = False

# 그림을 저장할 폴드
PROJECT_ROOT_DIR = "C:/Users/Admin/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/sally/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/User/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/sally/Dropbox/2019-Fall-Semester/ML"

CHAPTER_ID = "decision_trees"

IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id,tight_layout = True):
    path = os.path.join(IMAGES_PATH,fig_id + ".png")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path,format = 'png',dpi = 300)

6.1 결정 나무 학습과 시각화

. sklearn.tree.DecisionTreeClassifier : 결정 나무 분류기

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [2]:
iris = load_iris()
X = iris.data[:,2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth = 3,random_state = 42)
tree_clf.fit(X, y)
Out[2]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')
In [3]:
export_graphviz(
        tree_clf,
        out_file = os.path.join(IMAGES_PATH,"iris_tree.dot"),
        feature_names = ["petal length (cm)","petal width (cm)"],
        class_names = iris.target_names,
        rounded = True,
        filled = True
    )

. .dot 파일을 .png 파일로 바꾸기

.. > dot -Tpng "C:/Users/Admin/Desktop/ML/images/decision_trees/iris_tree.dot" -o "C:/Users/Admin/Desktop/ML/images/decision_trees/iris_tree.png"

.. > conda install python-graphviz

iris_tree.png

.. max_depth = 3

%EC%A3%BC%EC%84%9D%202019-11-05%20182101.png

.. max_depth = 4

%EC%A3%BC%EC%84%9D%202019-11-05%20182101-2.png

6.2 예측하기

. 루트 노드(root node) : 깊이(depth)가 0인 맨 꼭대기의 노드

. 자식 노드(child node) : 하나의 노드로부터 분리되어 나간 2개 노드

. 리프 노드(leaf node) : 자식 노드를 가지지 않는 노드

. 노드의 속성

.. sample : 얼마나 많은 훈련 샘플이 적용되었는지?

.. value : 각 클래스에 얼마나 많은 훈련 샘플이 있는지?

.. 지니(Gini) : 불순도(impurity)

... $G_i=1-\sum_{k=1}^np_{i,k}^2$

... $p_{i,k}$ : $i$번째 노드에 있는 훈련 샘플 중 클래스 $k$에 속한 비율

. 싸이킷런에서는 이진 나무만 만드는 CART 알고리즘을 사용

.. ID3 알고리즘(Iterative Dichotomiser 3)은 다중 분기가 가능

... https://en.wikipedia.org/wiki/ID3_algorithm

. 모델 해석

.. 화이트박스(white box) 모델 : 결정 나무 - 직관적, 해석 용이

.. 블랙박스(black box) 모델 : 랜덤포레스트, 신경망 - 좋은 예측력, 해석 용이하지 않음

In [4]:
def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#a0faa0','#9898ff'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
        plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
        plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("꽃잎 길이", fontsize=14)
        plt.ylabel("꽃잎 너비", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)

plt.figure(figsize=(11, 5))
plot_decision_boundary(tree_clf, X, y,legend=True)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "깊이=0", fontsize=15)
plt.text(3.2, 1.80, "깊이=1", fontsize=13)
plt.text(4.55, 0.5, "깊이=2", fontsize=11)
plt.text(4.45, 2.5, "깊이=2", fontsize=11)

save_fig("decision_tree_decision_boundaries_plot")
plt.show()

6.3 클래스 확률 추정

. 리프 노드에서 클래스 $k$의 훈련 샘플의 비율을 클래스 $k$의 확률로 추정

. 가장 높은 확률을 가진 클래스로 예측

In [5]:
tree_clf.predict_proba([[5, 1.5]])
Out[5]:
array([[0.        , 0.33333333, 0.66666667]])
In [6]:
tree_clf.predict([[5, 1.5]])
Out[6]:
array([2])

6.4 CART 훈련 알고리즘

. 개념 : 훈련 세트를 하나의 특성 $k$의 임계값 $t_k$를 써서 두 개의 서브셋으로 나눔. 순수한 서브셋을 갖도록 $(k,t_k)$ 쌍을 찾음

.. 탐욕적 알고리즘(greedy algorithm)

. 비용 함수 : 분류

.. $J(k,t_k)=\frac{m_{\rm left}}{m}G_{\rm left}+\frac{m_{\rm right}}{m}G_{\rm rightt}$

.. $G_{\rm left} (G_{\rm rightt})$ : 왼쪽 (오른쪽) 서브셋의 불순도

.. $m_{\rm left} (m_{\rm rightt})$ : 왼쪽 (오른쪽) 서브셋의 샘플수

. 중지 규칙 : max_depth (최대 깊이) 또는 불순도 감소가 불가능 할 때

. 최적 나무 찾기 : NP-composite 문제 (both NP and NP-Hard)

.. NP 문제 : set of problems whose solutions can be verified in polynomial time

.. NP-Hard 문제 : problem to which any NP problem can be reduced in polynomial time

6.5 계산 복잡도

. 예측하기 위해 거처야 될 노드 수 (예측 복잡도) : $O(\log_2(m))$

.. 특성 수에는 무관 : 큰 훈련 세트에 적합

. 훈련 복잡도 : $O(n\times m\log(m))$

6.6 지니 불순도 또는 엔트로피?

. 엔트로피(entripy) : 분자의 무질서함을 측정하는 측도

.. 엔트로피의 축소는 정보 이득 (information gain)과 동일

. $H_i=-\sum_{k=1,p_{i,k}\ne0}^np_{i,k}\log_2(p_{i,k})$

. 둘 간 차이가 크지 않음

6.7 규제 매개변수

. 비파라미터 모델(non-parameter model) : 훈련되기 전에 파라미터 수가 결정 되지 않는 모델

. 과대적합 위험

. 규제 매개변수 : min 으로 시작하는 매개변수를 증기시키거나 max로 시작하는 매개변수를 감소시키면 모델의 규제가 증가

.. min_samples_split : minimum number of samples required to split an internal node

.. min_samples_leaf : minimum number of samples required to be at a leaf node

.. min_weight_fraction_leaf : minimum weighted fraction required to be at a leaf node

.. max_leaf_nodes : 리프 노드의 최대 수

.. max_features : 각 노드에서 분할에 사용할 특성의 최대 수

In [7]:
Xm, ym = make_moons(n_samples=100, noise=0.25, random_state=53)
In [8]:
deep_tree_clf1 = DecisionTreeClassifier(random_state=42)
deep_tree_clf1.fit(Xm, ym)
Out[8]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')
In [9]:
deep_tree_clf2 = DecisionTreeClassifier(min_samples_leaf=4, random_state=42)
deep_tree_clf2.fit(Xm, ym)
Out[9]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=4, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')
In [10]:
plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(deep_tree_clf1, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("규제 없음", fontsize=16)
plt.subplot(122)
plot_decision_boundary(deep_tree_clf2, Xm, ym, axes=[-1.5, 2.5, -1, 1.5], iris=False)
plt.title("min_samples_leaf = {}".format(deep_tree_clf2.min_samples_leaf), fontsize=14)

save_fig("min_samples_leaf_plot")
plt.show()
In [11]:
export_graphviz(
        deep_tree_clf2,
        out_file=os.path.join(IMAGES_PATH,"moon_tree.dot"),
        feature_names=["X1", "X2"],
        rounded=True,
        filled=True
    )

moon_tree.png

. 가지치기(prunning) : 제한없이 결정 나무를 훈련시키고 불필요한 노드를 제거하는 알고리즘

. 리프 노드를 자식 노드로 가진 노들을 가지치기 : 순도의 증가가 통계적으로 유의하지 않을 때

.. 카이제곱 검정 : $p$값 > 0.05 이면 자식 노드를 가지치기

6.8 회귀

. sklearn.tree.DecisionTreeRegressor : 결정 나무 회귀

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

. 비용 함수 : 회귀

. $J(k,t_k)=\frac{m_{\rm left}}{m}{\rm MSE}_{\rm left}+\frac{m_{\rm right}}{m}{\rm MSE}_{\rm right}$

.. ${\rm MSE}_{\rm node}=\sum_{i\in{\rm node}}({\hat y}_{\rm node}-y^{(i)})^2$

.. ${\hat y}_{\rm node}=\frac{1}{m_{\rm node}}\sum_{i\in{\rm node}}y^{(i)}$

In [12]:
# 2차식으로 만든 데이터셋 + 잡음
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10
In [13]:
tree_reg = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg.fit(X, y)
Out[13]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [14]:
tree_reg1 = DecisionTreeRegressor(random_state=42, max_depth=2)
tree_reg1.fit(X, y)
Out[14]:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [15]:
tree_reg2 = DecisionTreeRegressor(random_state=42, max_depth=3)
tree_reg2.fit(X, y)
Out[15]:
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [16]:
def plot_regression_predictions(tree_reg, X, y, axes=[0, 1, -0.2, 1], ylabel="$y$"):
    x1 = np.linspace(axes[0], axes[1], 500).reshape(-1, 1)
    y_pred = tree_reg.predict(x1)
    plt.axis(axes)
    plt.xlabel("$x_1$", fontsize=18)
    if ylabel:
        plt.ylabel(ylabel, fontsize=18, rotation=0)
    plt.plot(X, y, "b.")
    plt.plot(x1, y_pred, "r.-", linewidth=2, label=r"$\hat{y}$")

plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_regression_predictions(tree_reg1, X, y)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
plt.text(0.21, 0.65, "Depth=0", fontsize=15)
plt.text(0.01, 0.2, "Depth=1", fontsize=13)
plt.text(0.65, 0.8, "Depth=1", fontsize=13)
plt.legend(loc="upper center", fontsize=18)
plt.title("max_depth=2", fontsize=14)

plt.subplot(122)
plot_regression_predictions(tree_reg2, X, y, ylabel=None)
for split, style in ((0.1973, "k-"), (0.0917, "k--"), (0.7718, "k--")):
    plt.plot([split, split], [-0.2, 1], style, linewidth=2)
for split in (0.0458, 0.1298, 0.2873, 0.9040):
    plt.plot([split, split], [-0.2, 1], "k:", linewidth=1)
plt.text(0.3, 0.5, "Depth=2", fontsize=13)
plt.title("max_depth=3", fontsize=14)

plt.show()
In [17]:
export_graphviz(
        tree_reg1,
        out_file=os.path.join(IMAGES_PATH,"regression_tree.dot"),
        feature_names=["x1"],
        rounded=True,
        filled=True
    )

regression_tree.png

In [18]:
tree_reg1 = DecisionTreeRegressor(random_state=42)
tree_reg1.fit(X, y)
Out[18]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [19]:
tree_reg2 = DecisionTreeRegressor(random_state=42, min_samples_leaf=10)
tree_reg2.fit(X, y)
Out[19]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=10,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [20]:
x1 = np.linspace(0, 1, 500).reshape(-1, 1)
y_pred1 = tree_reg1.predict(x1)
y_pred2 = tree_reg2.predict(x1)

plt.figure(figsize=(11, 4))

plt.subplot(121)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred1, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", fontsize=18, rotation=0)
plt.legend(loc="upper center", fontsize=18)
plt.title("규제 없음", fontsize=14)

plt.subplot(122)
plt.plot(X, y, "b.")
plt.plot(x1, y_pred2, "r.-", linewidth=2, label=r"$\hat{y}$")
plt.axis([0, 1, -0.2, 1.1])
plt.xlabel("$x_1$", fontsize=18)
plt.title("min_samples_leaf={}".format(tree_reg2.min_samples_leaf), fontsize=14)

save_fig("tree_regression_regularization_plot")
plt.show()

6.9 불안정성

. 장점 : 이해하고 해석하기 쉬움. 사용하기 편리. 다양한 용도. 뛰어난 성능

. 단점 :

.. 훈련 세트의 회전에 민감 $\Rightarrow$ PCA : 최적의 회전 방향

.. 훈련 세트의 작은 변화에도 민감 $\Rightarrow$ 랜덤포레스트 : 많은 나무에서 만든 예측의 평균을 사용

In [21]:
np.random.seed(6)
Xs = np.random.rand(100, 2) - 0.5
ys = (Xs[:, 0] > 0).astype(np.float32) * 2

angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)], [np.sin(angle), np.cos(angle)]])
Xsr = Xs.dot(rotation_matrix)
In [22]:
tree_clf_s = DecisionTreeClassifier(random_state=42)
tree_clf_s.fit(Xs, ys)
Out[22]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')
In [23]:
tree_clf_sr = DecisionTreeClassifier(random_state=42)
tree_clf_sr.fit(Xsr, ys)
Out[23]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')
In [24]:
plt.figure(figsize=(11, 4))
plt.subplot(121)
plot_decision_boundary(tree_clf_s, Xs, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)
plt.subplot(122)
plot_decision_boundary(tree_clf_sr, Xsr, ys, axes=[-0.7, 0.7, -0.7, 0.7], iris=False)

save_fig("sensitivity_to_rotation_plot")
plt.show()
In [25]:
X = iris.data[:, 2:] # petal length and width
y = iris.target

X[(X[:, 1]==X[:, 1][y==1].max()) & (y==1)] # 가장 너비가 큰 Iris-Versicolor

not_widest_versicolor = (X[:, 1]!=1.8) | (y==2)
X_tweaked = X[not_widest_versicolor]
y_tweaked = y[not_widest_versicolor]

tree_clf_tweaked = DecisionTreeClassifier(max_depth=2, random_state=40)
tree_clf_tweaked.fit(X_tweaked, y_tweaked)
Out[25]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=40,
            splitter='best')
In [26]:
plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf_tweaked, X_tweaked, y_tweaked, legend=False)
plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.text(1.0, 0.9, "깊이=0", fontsize=15)
plt.text(1.0, 1.80, "깊이=1", fontsize=13)

save_fig("decision_tree_instability_plot")
plt.show()

The End of Chapter 6