2. 머신러닝 프로젝트의 처음부터 끝까지

. 머신러닝 주택 회사에 오신 것을 환영합니다! 여러분이 해야 할 일은 캘리포니아 인구조사 데이터를 사용해 이 지역의 주택 가격 모델을 만드는 것입니다.

2.0 설정

In [1]:
# 공통
from warnings import simplefilter
import numpy as np
import os
import pandas as pd
import sklearn.linear_model
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import Imputer
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

simplefilter(action='ignore', category=FutureWarning)

# 일관된 출력을 위해 유사난수 초기화
np.random.seed(42)

%matplotlib inline

font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()

rc('font', family=font_name)

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['axes.unicode_minus'] = False

# 그림을 저장할 폴드
PROJECT_ROOT_DIR = "C:/Users/Admin/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/User/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/sally/Dropbox/2019-Fall-Semester/ML"

CHAPTER_ID = "end_to_end_project"

IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(IMAGES_PATH, fig_id + ".png")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

2.3.2 데이터 다운로드

In [2]:
housing_path = os.path.join("datasets","housing","")
In [3]:
housing = pd.read_csv(housing_path + "housing.csv")

2.3.3 데이터 훑어보기

In [4]:
housing.head()
Out[4]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY

. 207개의 구역에서 total_bedrooms가 결측

In [5]:
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

. ocean_proximity 특성의 도수분포표 작성

. pandas.Series.value_count https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

In [6]:
housing["ocean_proximity"].value_counts()
Out[6]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

. 숫자형 특성의 summary statistics

In [7]:
housing.describe()
Out[7]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [8]:
housing.hist(bins=50, figsize=(20,15))

save_fig("attribute_histogram_plots")

. housing_median_age와 median_house_value의 최댓값을 한정한 것으로 보임

. 특성 별로 스케일 다름

. 오른쪽으로 긴 꼬리를 가진 특성이 많음

2.3.4 테스트 세트 만들기

. 단순임의추출법(simple random sampling; SRS)

. sklearn.model_selection.train_test_split

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [9]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
In [10]:
test_set.head()
Out[10]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
20046 -119.01 36.06 25.0 1505.0 NaN 1392.0 359.0 1.6812 47700.0 INLAND
3024 -119.46 35.14 30.0 2943.0 NaN 1565.0 584.0 2.5313 45800.0 INLAND
15663 -122.44 37.80 52.0 3830.0 NaN 1310.0 963.0 3.4801 500001.0 NEAR BAY
20484 -118.72 34.28 17.0 3051.0 NaN 1705.0 495.0 5.7376 218600.0 <1H OCEAN
9814 -121.93 36.62 34.0 2351.0 NaN 1063.0 428.0 3.7250 278000.0 NEAR OCEAN

. 중간 소득이 중간 주택 가격을 예측하는 데 매우 중요함

. SRS보다 층화임의추출법 (stratified random sampling) 적용

In [11]:
# 소득 카테고리 개수를 제한하기 위해 1.5로 나눕니다.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)

. median_income < 1.5 이면 1,

. median_income < 3 이면 2,

. median_income < 4.5 이면 3,

. median_income < 6 이면 4,

. median_income < 7.5 이면 5,

. median_income < 9 이면 6,

. ...

In [12]:
housing[["median_income", "income_cat"]]
Out[12]:
median_income income_cat
0 8.3252 6.0
1 8.3014 6.0
2 7.2574 5.0
3 5.6431 4.0
4 3.8462 3.0
5 4.0368 3.0
6 3.6591 3.0
7 3.1200 3.0
8 2.0804 2.0
9 3.6912 3.0
10 3.2031 3.0
11 3.2705 3.0
12 3.0750 3.0
13 2.6736 2.0
14 1.9167 2.0
15 2.1250 2.0
16 2.7750 2.0
17 2.1202 2.0
18 1.9911 2.0
19 2.6033 2.0
20 1.3578 1.0
21 1.7135 2.0
22 1.7250 2.0
23 2.1806 2.0
24 2.6000 2.0
25 2.4038 2.0
26 2.4597 2.0
27 1.8080 2.0
28 1.6424 2.0
29 1.6875 2.0
... ... ...
20610 1.3631 1.0
20611 1.2857 1.0
20612 1.4934 1.0
20613 1.4958 1.0
20614 2.4695 2.0
20615 2.3598 2.0
20616 2.0469 2.0
20617 3.3021 3.0
20618 2.2500 2.0
20619 2.7303 2.0
20620 4.5625 4.0
20621 2.3661 2.0
20622 2.4167 2.0
20623 2.8235 2.0
20624 3.0739 3.0
20625 4.1250 3.0
20626 2.1667 2.0
20627 3.0000 2.0
20628 2.5952 2.0
20629 2.0943 2.0
20630 3.5673 3.0
20631 3.5179 3.0
20632 3.1250 3.0
20633 2.5495 2.0
20634 3.7125 3.0
20635 1.5603 2.0
20636 2.5568 2.0
20637 1.7000 2.0
20638 1.8672 2.0
20639 2.3886 2.0

20640 rows × 2 columns

In [13]:
# 5 이상은 5로 레이블합니다.
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
In [14]:
housing[["median_income", "income_cat"]]
Out[14]:
median_income income_cat
0 8.3252 5.0
1 8.3014 5.0
2 7.2574 5.0
3 5.6431 4.0
4 3.8462 3.0
5 4.0368 3.0
6 3.6591 3.0
7 3.1200 3.0
8 2.0804 2.0
9 3.6912 3.0
10 3.2031 3.0
11 3.2705 3.0
12 3.0750 3.0
13 2.6736 2.0
14 1.9167 2.0
15 2.1250 2.0
16 2.7750 2.0
17 2.1202 2.0
18 1.9911 2.0
19 2.6033 2.0
20 1.3578 1.0
21 1.7135 2.0
22 1.7250 2.0
23 2.1806 2.0
24 2.6000 2.0
25 2.4038 2.0
26 2.4597 2.0
27 1.8080 2.0
28 1.6424 2.0
29 1.6875 2.0
... ... ...
20610 1.3631 1.0
20611 1.2857 1.0
20612 1.4934 1.0
20613 1.4958 1.0
20614 2.4695 2.0
20615 2.3598 2.0
20616 2.0469 2.0
20617 3.3021 3.0
20618 2.2500 2.0
20619 2.7303 2.0
20620 4.5625 4.0
20621 2.3661 2.0
20622 2.4167 2.0
20623 2.8235 2.0
20624 3.0739 3.0
20625 4.1250 3.0
20626 2.1667 2.0
20627 3.0000 2.0
20628 2.5952 2.0
20629 2.0943 2.0
20630 3.5673 3.0
20631 3.5179 3.0
20632 3.1250 3.0
20633 2.5495 2.0
20634 3.7125 3.0
20635 1.5603 2.0
20636 2.5568 2.0
20637 1.7000 2.0
20638 1.8672 2.0
20639 2.3886 2.0

20640 rows × 2 columns

In [15]:
housing["income_cat"].value_counts()
Out[15]:
3.0    7236
2.0    6581
4.0    3639
5.0    2362
1.0     822
Name: income_cat, dtype: int64
In [16]:
housing["income_cat"].hist()

save_fig('income_category_hist')
In [17]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
In [18]:
for train_index, test_index in split.split(housing, housing["income_cat"]):
    print(train_index)
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
[17606 18632 14650 ... 13908 11159 15775]
In [19]:
strat_train_set.head()
Out[19]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity income_cat
17606 -121.89 37.29 38.0 1568.0 351.0 710.0 339.0 2.7042 286600.0 <1H OCEAN 2.0
18632 -121.93 37.05 14.0 679.0 108.0 306.0 113.0 6.4214 340600.0 <1H OCEAN 5.0
14650 -117.20 32.77 31.0 1952.0 471.0 936.0 462.0 2.8621 196900.0 NEAR OCEAN 2.0
3230 -119.61 36.31 25.0 1847.0 371.0 1460.0 353.0 1.8839 46300.0 INLAND 2.0
3555 -118.59 34.23 17.0 6592.0 1525.0 4459.0 1463.0 3.0347 254500.0 <1H OCEAN 3.0
In [20]:
strat_test_set["income_cat"].value_counts(normalize=True)
Out[20]:
3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cat, dtype: float64
In [21]:
def income_cat_proportions(data):
    return data["income_cat"].value_counts(normalize=True)
In [22]:
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
In [23]:
compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
In [24]:
compare_props
Out[24]:
Overall Stratified Random
1.0 0.039826 0.039729 0.040213
2.0 0.318847 0.318798 0.324370
3.0 0.350581 0.350533 0.358527
4.0 0.176308 0.176357 0.167393
5.0 0.114438 0.114583 0.109496
In [25]:
compare_props["Rand. %error"] = 100 * (compare_props["Random"] - compare_props["Overall"]) / compare_props["Overall"]
In [26]:
compare_props["Strat. %error"] = 100 * (compare_props["Stratified"] - compare_props["Overall"]) / compare_props["Overall"]
In [27]:
compare_props
Out[27]:
Overall Stratified Random Rand. %error Strat. %error
1.0 0.039826 0.039729 0.040213 0.973236 -0.243309
2.0 0.318847 0.318798 0.324370 1.732260 -0.015195
3.0 0.350581 0.350533 0.358527 2.266446 -0.013820
4.0 0.176308 0.176357 0.167393 -5.056334 0.027480
5.0 0.114438 0.114583 0.109496 -4.318374 0.127011

. pandas.DataFrame.drop

Drop specified labels from rows or columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

In [28]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
In [29]:
strat_train_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 10 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16354 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
median_house_value    16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB

2.4 데이터 이해를 위한 탐색과 시각화

In [30]:
housing = strat_train_set.copy()
In [31]:
housing.plot(kind="scatter", x="longitude", y="latitude")

save_fig("bad_visualization_plot")

. 원의 반지름은 구역의 인구에 비례, 색깔은 중간 주택 가격 (파란색(낮은 가격)에서 빨간색(높은 가격)으로 변하는 jet), 투명도는 0.4로 지정

In [32]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100,label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, sharex=False)

save_fig("better_visualization_plot")
In [33]:
corr_matrix = housing.corr()
corr_matrix
Out[33]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
longitude 1.000000 -0.924478 -0.105848 0.048871 0.076598 0.108030 0.063070 -0.019583 -0.047432
latitude -0.924478 1.000000 0.005766 -0.039184 -0.072419 -0.115222 -0.077647 -0.075205 -0.142724
housing_median_age -0.105848 0.005766 1.000000 -0.364509 -0.325047 -0.298710 -0.306428 -0.111360 0.114110
total_rooms 0.048871 -0.039184 -0.364509 1.000000 0.929379 0.855109 0.918392 0.200087 0.135097
total_bedrooms 0.076598 -0.072419 -0.325047 0.929379 1.000000 0.876320 0.980170 -0.009740 0.047689
population 0.108030 -0.115222 -0.298710 0.855109 0.876320 1.000000 0.904637 0.002380 -0.026920
households 0.063070 -0.077647 -0.306428 0.918392 0.980170 0.904637 1.000000 0.010781 0.064506
median_income -0.019583 -0.075205 -0.111360 0.200087 -0.009740 0.002380 0.010781 1.000000 0.687160
median_house_value -0.047432 -0.142724 0.114110 0.135097 0.047689 -0.026920 0.064506 0.687160 1.000000
In [34]:
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[34]:
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

. 중간 소득이 올라갈수록 중간 주택 가격도 올라감

. 위도가 높을수록 (북쪽으로 갈수록) 중간 주택 가격은 하락함

In [35]:
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

save_fig("scatter_matrix_plot")
In [36]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)

save_fig("income_vs_house_value_scatterplot")

. 상관관계가 매우 강함

. 가격 제한 값이 500,000불에서 수평선으로 보임

2.4.3 특성 조합으로 실험

In [37]:
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
In [38]:
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
In [39]:
housing["population_per_household"] = housing["population"]/housing["households"]
In [40]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[40]:
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

. 전체 방 개수나 침대 개수보다 bedrooms_per_room의 상관관계가 높음

. bedrooms_per_room이 낮으면 중간 주택 가격은 높은 경향

. rooms_per_houserhold도 전체 방 개수보다 상관관계가 높음

In [41]:
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
# plt.axis([0, 5, 0, 520000])
plt.show()
In [42]:
housing.describe()
Out[42]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value rooms_per_household bedrooms_per_room population_per_household
count 16512.000000 16512.000000 16512.000000 16512.000000 16354.000000 16512.000000 16512.000000 16512.000000 16512.000000 16512.000000 16354.000000 16512.000000
mean -119.575834 35.639577 28.653101 2622.728319 534.973890 1419.790819 497.060380 3.875589 206990.920724 5.440341 0.212878 3.096437
std 2.001860 2.138058 12.574726 2138.458419 412.699041 1115.686241 375.720845 1.904950 115703.014830 2.611712 0.057379 11.584826
min -124.350000 32.540000 1.000000 6.000000 2.000000 3.000000 2.000000 0.499900 14999.000000 1.130435 0.100000 0.692308
25% -121.800000 33.940000 18.000000 1443.000000 295.000000 784.000000 279.000000 2.566775 119800.000000 4.442040 0.175304 2.431287
50% -118.510000 34.260000 29.000000 2119.500000 433.000000 1164.000000 408.000000 3.540900 179500.000000 5.232284 0.203031 2.817653
75% -118.010000 37.720000 37.000000 3141.000000 644.000000 1719.250000 602.000000 4.744475 263900.000000 6.056361 0.239831 3.281420
max -114.310000 41.950000 52.000000 39320.000000 6210.000000 35682.000000 5358.000000 15.000100 500001.000000 141.909091 1.000000 1243.333333

2.5 머신러닝 알고리즘을 위한 데이터 준비

In [43]:
housing = strat_train_set.drop("median_house_value", axis=1) # 훈련 세트를 위해 레이블 삭제
In [44]:
housing_labels = strat_train_set["median_house_value"].copy()

2.5.1 데이터 정제

In [45]:
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
Out[45]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
4629 -118.30 34.07 18.0 3759.0 NaN 3296.0 1462.0 2.2708 <1H OCEAN
6068 -117.86 34.01 16.0 4632.0 NaN 3038.0 727.0 5.1762 <1H OCEAN
17923 -121.97 37.35 30.0 1955.0 NaN 999.0 386.0 4.6328 <1H OCEAN
13656 -117.30 34.05 6.0 2155.0 NaN 1039.0 391.0 1.6675 INLAND
19252 -122.79 38.48 7.0 6837.0 NaN 3468.0 1405.0 3.1662 <1H OCEAN
In [46]:
sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # 옵션 1: 해당 구역을 제거
# sample_incomplete_rows.dropna()    # 옵션 1: 해당 구역을 제거
Out[46]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
In [47]:
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # 옵션 2: 전체 특성을 삭제
Out[47]:
longitude latitude housing_median_age total_rooms population households median_income ocean_proximity
4629 -118.30 34.07 18.0 3759.0 3296.0 1462.0 2.2708 <1H OCEAN
6068 -117.86 34.01 16.0 4632.0 3038.0 727.0 5.1762 <1H OCEAN
17923 -121.97 37.35 30.0 1955.0 999.0 386.0 4.6328 <1H OCEAN
13656 -117.30 34.05 6.0 2155.0 1039.0 391.0 1.6675 INLAND
19252 -122.79 38.48 7.0 6837.0 3468.0 1405.0 3.1662 <1H OCEAN
In [48]:
median = housing["total_bedrooms"].median()
In [49]:
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # 옵션 3: median으로 채움
sample_incomplete_rows
Out[49]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity
4629 -118.30 34.07 18.0 3759.0 433.0 3296.0 1462.0 2.2708 <1H OCEAN
6068 -117.86 34.01 16.0 4632.0 433.0 3038.0 727.0 5.1762 <1H OCEAN
17923 -121.97 37.35 30.0 1955.0 433.0 999.0 386.0 4.6328 <1H OCEAN
13656 -117.30 34.05 6.0 2155.0 433.0 1039.0 391.0 1.6675 INLAND
19252 -122.79 38.48 7.0 6837.0 433.0 3468.0 1405.0 3.1662 <1H OCEAN
In [50]:
imputer = SimpleImputer(strategy="median")

. median이 수치형 특성에서만 계산될 수 있기 때문에 텍스트 특성을 삭제

In [51]:
housing_num = housing.drop('ocean_proximity', axis = 1)
In [52]:
imputer.fit(housing_num)
Out[52]:
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)
In [53]:
imputer.statistics_
Out[53]:
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

. 각 특성의 median이 수동으로 계산한 것과 같은지 확인

. pandas.DataFrame.median

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html

. pandas.Series.values

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.values.html

In [54]:
housing_num.median().values
Out[54]:
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

. 훈련 세트 변환

In [55]:
X = imputer.transform(housing_num) #변환된 특성들이 들아 있는 numpy 배열
X
Out[55]:
array([[-121.89  ,   37.29  ,   38.    , ...,  710.    ,  339.    ,
           2.7042],
       [-121.93  ,   37.05  ,   14.    , ...,  306.    ,  113.    ,
           6.4214],
       [-117.2   ,   32.77  ,   31.    , ...,  936.    ,  462.    ,
           2.8621],
       ...,
       [-116.4   ,   34.09  ,    9.    , ..., 2098.    ,  765.    ,
           3.2723],
       [-118.01  ,   33.82  ,   31.    , ..., 1356.    ,  356.    ,
           4.0625],
       [-122.45  ,   37.77  ,   52.    , ..., 1269.    ,  639.    ,
           3.575 ]])
In [56]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))
In [57]:
housing_tr.loc[sample_incomplete_rows.index.values]
Out[57]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income
4629 -118.30 34.07 18.0 3759.0 433.0 3296.0 1462.0 2.2708
6068 -117.86 34.01 16.0 4632.0 433.0 3038.0 727.0 5.1762
17923 -121.97 37.35 30.0 1955.0 433.0 999.0 386.0 4.6328
13656 -117.30 34.05 6.0 2155.0 433.0 1039.0 391.0 1.6675
19252 -122.79 38.48 7.0 6837.0 433.0 3468.0 1405.0 3.1662

2.5.2 텍스트와 범주형 특성 다루기

. 범주형 입력 특성인 ocean_proximity를 전처리

In [58]:
housing_cat = housing['ocean_proximity']
housing_cat.head(10)
Out[58]:
17606     <1H OCEAN
18632     <1H OCEAN
14650    NEAR OCEAN
3230         INLAND
3555      <1H OCEAN
19480        INLAND
8879      <1H OCEAN
13685        INLAND
4937      <1H OCEAN
4861      <1H OCEAN
Name: ocean_proximity, dtype: object

. pandas의 factorize() 메서드는 문자열 범주형 특성을 머신러닝 알고리즘이 다루기 쉬운 숫자 범주형 특성으로 변환

In [59]:
housing_cat_encoded, housing_categories = housing_cat.factorize()
In [60]:
housing_cat_encoded[:10]
Out[60]:
array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)
In [61]:
housing_categories
Out[61]:
Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

. OneHotEncoder를 사용하여 범주형 값을 원-핫 벡터로 변경

. sklearn.preprocessing.OneHotEncoder

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [62]:
encoder = OneHotEncoder(categories = 'auto')
In [63]:
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1)) # 행 차원은 unknown, 열 차원은 1  1
housing_cat_1hot
Out[63]:
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

. OneHotEncoder는 기본적으로 희소 행렬을 반환. 필요하면 밀집 배열로 변환 가능

In [64]:
housing_cat_1hot.toarray()
Out[64]:
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
In [65]:
housing_cat_reshaped = housing_cat.values.reshape(-1, 1) # 텍스트 카테고리
housing_cat_reshaped
Out[65]:
array([['<1H OCEAN'],
       ['<1H OCEAN'],
       ['NEAR OCEAN'],
       ...,
       ['INLAND'],
       ['<1H OCEAN'],
       ['NEAR BAY']], dtype=object)
In [66]:
housing_cat_1hot = encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot
Out[66]:
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>
In [67]:
housing_cat_1hot.toarray()
Out[67]:
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
In [68]:
encoder.categories_
Out[68]:
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

2.5.3 나만의 변환기

In [69]:
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
In [70]:
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
In [71]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room = False)
In [72]:
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs
Out[72]:
array([[-121.89, 37.29, 38.0, ..., '<1H OCEAN', 4.625368731563422,
        2.094395280235988],
       [-121.93, 37.05, 14.0, ..., '<1H OCEAN', 6.008849557522124,
        2.7079646017699117],
       [-117.2, 32.77, 31.0, ..., 'NEAR OCEAN', 4.225108225108225,
        2.0259740259740258],
       ...,
       [-116.4, 34.09, 9.0, ..., 'INLAND', 6.34640522875817,
        2.742483660130719],
       [-118.01, 33.82, 31.0, ..., '<1H OCEAN', 5.50561797752809,
        3.808988764044944],
       [-122.45, 37.77, 52.0, ..., 'NEAR BAY', 4.843505477308295,
        1.9859154929577465]], dtype=object)
In [73]:
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs, 
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()
Out[73]:
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income ocean_proximity rooms_per_household population_per_household
0 -121.89 37.29 38 1568 351 710 339 2.7042 <1H OCEAN 4.62537 2.0944
1 -121.93 37.05 14 679 108 306 113 6.4214 <1H OCEAN 6.00885 2.70796
2 -117.2 32.77 31 1952 471 936 462 2.8621 NEAR OCEAN 4.22511 2.02597
3 -119.61 36.31 25 1847 371 1460 353 1.8839 INLAND 5.23229 4.13598
4 -118.59 34.23 17 6592 1525 4459 1463 3.0347 <1H OCEAN 4.50581 3.04785

2.5.4 특성 스케일링(scaling)

. min-max 스케일링 : 0~1 범위에 값이 들도록 값을 이동

.. MinMaxScaler 변환기 제공, 0~1 사이를 원하지 않으면 범위 변경 가능 (feature_range 매개변수)

. 표준화 (standardization) : StandardScaler 변환기 제공, 범위의 상한과 하한이 없음

2.5.5 변환 파이프라인

. 수치 특성을 전처리하기 위한 파이프라인 작성

In [74]:
num_pipeline = Pipeline([
#         ('imputer', Imputer(strategy="median")),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])
In [75]:
housing_num_tr = num_pipeline.fit_transform(housing_num)
In [76]:
housing_num_tr
Out[76]:
array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])

. pandas DataFrame 컬럼의 일부를 선택하는 변환기를 작성

In [77]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

. 하나의 큰 파이프라인에 이들을 모두 결합하여 수치형과 범주형 특성을 전처리

In [78]:
df = pd.DataFrame({'age':    [ 3,  29],
                   'height': [94, 170],
                   'weight': [31, 115]})
In [79]:
type(df.values)
Out[79]:
numpy.ndarray
In [80]:
num_attribs = list(housing_num)
num_attribs
Out[80]:
['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']
In [81]:
cat_attribs = ["ocean_proximity"]
In [82]:
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
#         ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])
In [83]:
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', encoder),
#         ('cat_encoder', OneHotEncoder(categories = 'auto')),
#         ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])
In [84]:
full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline)
    ])
In [85]:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Out[85]:
<16512x16 sparse matrix of type '<class 'numpy.float64'>'
	with 198144 stored elements in Compressed Sparse Row format>
In [86]:
housing_prepared.shape
Out[86]:
(16512, 16)
In [87]:
housing_prepared.toarray()
Out[87]:
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

2.6 모델 선택과 훈련

2.6.1 훈련 세트에서 훈련하고 평가하기

In [88]:
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels) # 선형 회귀 모델 훈련
Out[88]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
In [89]:
# 훈련 샘플 몇 개를 사용해 전체 파이프라인을 적용
some_data = housing.iloc[:5]
In [90]:
some_labels = housing_labels.iloc[:5]
In [91]:
some_data_prepared = full_pipeline.transform(some_data)
print("예측:", lin_reg.predict(some_data_prepared))
예측: [210644.60466242 317768.80713423 210956.43317372  59218.98851123
 189747.55854047]

. 실제 값과 비교

In [92]:
print("레이블:", list(some_labels))
레이블: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
In [93]:
some_data_prepared
Out[93]:
<5x16 sparse matrix of type '<class 'numpy.float64'>'
	with 60 stored elements in Compressed Sparse Row format>

. 전체 훈련 세트의 mean_square_error 함수 적용

In [94]:
housing_predictions = lin_reg.predict(housing_prepared)
In [95]:
lin_mse = mean_squared_error(housing_labels, housing_predictions)
In [96]:
lin_rmse = np.sqrt(lin_mse)
lin_rmse
Out[96]:
68628.19819848923

. 과소 적합 모델?

.. 더 강력한 모델 선택

.. 더 좋은 특성 주입

.. 모델의 규제 감소

In [97]:
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
Out[97]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')
In [98]:
housing_predictions = tree_reg.predict(housing_prepared)
In [99]:
tree_mse = mean_squared_error(housing_labels, housing_predictions)
In [100]:
tree_rmse = np.sqrt(tree_mse)
tree_rmse
Out[100]:
0.0

. 과대 적합 모델?

2.6.2 교차 검증을 사용한 평가

. K-겹교차 검증 (K-fold cross-validation)

.. 훈련 세트를 10개의 서브셋으로 무작위로 분할

.. 매번 다른 서브셋으로 평가하고 나머지 9개 폴드 (fold)는 훈련에 사용

.. 10번 훈련하고 평가

In [101]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
In [102]:
lin_rmse_scores = np.sqrt(-lin_scores)
In [103]:
def display_scores(scores):
    print("점수:", scores)
    print("평균:", scores.mean())
    print("표준편차:", scores.std())

display_scores(lin_rmse_scores)
점수: [66782.73844104 66960.1176327  70347.95243196 74739.57053176
 68031.13387783 71193.8418342  64969.63057129 68281.61137872
 71552.91568652 67665.10086244]
평균: 69052.4613248459
표준편차: 2731.674035171386
In [104]:
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
display_scores(tree_rmse_scores)
점수: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
평균: 71407.68766037929
표준편차: 2439.4345041191004

. 앙상블 학습 : 랜덤 포레스트

In [105]:
forest_reg = RandomForestRegressor(random_state = 42,n_estimators=10)
forest_reg.fit(housing_prepared, housing_labels)
Out[105]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)
In [106]:
housing_predictions = forest_reg.predict(housing_prepared)
In [107]:
forest_mse = mean_squared_error(housing_labels, housing_predictions)
In [108]:
forest_rmse = np.sqrt(forest_mse)
forest_rmse
Out[108]:
21933.31414779769

. 과대 적합 모델?

In [109]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
In [110]:
forest_rmse_scores = np.sqrt(-forest_scores)
In [111]:
display_scores(forest_rmse_scores)
점수: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
평균: 52583.72407377466
표준편차: 2298.353351147122

2.7 모델 세부 튜닝

2.7.1 그리드 탐색

In [112]:
param_grid = [
    # 하이퍼파라미터 12(=3×4)개의 조합을 시도합니다.
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # bootstrap은 False로 하고 6(=2×3)개의 조합을 시도합니다.
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]
In [113]:
forest_reg = RandomForestRegressor(random_state=42)
# 다섯 폴드에서 훈련하면 총 (12+6)*5=90번의 훈련이 일어납니다.
In [114]:
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', 
                           return_train_score=True, n_jobs=-1)
grid_search.fit(housing_prepared, housing_labels)
Out[114]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

. 최상의 파라미터 조합:

In [115]:
grid_search.best_params_
Out[115]:
{'max_features': 8, 'n_estimators': 30}
In [116]:
grid_search.best_estimator_
Out[116]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

. 그리드 탐색에서 테스트한 하이퍼파라미터 조합의 점수를 확인

In [117]:
cvres = grid_search.cv_results_
In [118]:
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

2.7.4 최상의 모델과 오차 분석

In [119]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[119]:
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])
In [120]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
In [121]:
# cat_one_hot_attribs = list(cat_encoder.categories_[0])
cat_one_hot_attribs = list(encoder.categories_[0])
cat_one_hot_attribs
Out[121]:
['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
In [122]:
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
In [123]:
sorted(zip(feature_importances, attributes), reverse=True)
Out[123]:
[(0.3661589806181342, 'median_income'),
 (0.1647809935615905, 'INLAND'),
 (0.10879295677551573, 'pop_per_hhold'),
 (0.07334423551601242, 'longitude'),
 (0.0629090704826203, 'latitude'),
 (0.05641917918195401, 'rooms_per_hhold'),
 (0.05335107734767581, 'bedrooms_per_room'),
 (0.041143798478729635, 'housing_median_age'),
 (0.014874280890402767, 'population'),
 (0.014672685420543237, 'total_rooms'),
 (0.014257599323407807, 'households'),
 (0.014106483453584102, 'total_bedrooms'),
 (0.010311488326303787, '<1H OCEAN'),
 (0.002856474637320158, 'NEAR OCEAN'),
 (0.00196041559947807, 'NEAR BAY'),
 (6.028038672736599e-05, 'ISLAND')]

2.7.5 테스트 세트로 시스템 평가하기

In [124]:
final_model = grid_search.best_estimator_
In [125]:
X_test = strat_test_set.drop("median_house_value", axis=1)
In [126]:
y_test = strat_test_set["median_house_value"].copy()
In [127]:
X_test_prepared = full_pipeline.transform(X_test)
In [128]:
final_predictions = final_model.predict(X_test_prepared)
In [129]:
final_mse = mean_squared_error(y_test, final_predictions)
In [130]:
final_rmse = np.sqrt(final_mse)

final_rmse
Out[130]:
47730.22690385927

2.8 론칭, 모니터링, 그리고 시스템 유지 보수

. 모니터링 코드 작성 : 일정 간격으로 시스템의 실시간 성능을 체크. 성능이 떨어졌을 때 알람을 통지

. 시스템의 성능 평가

. 시스템의 입력 데이터 품질 평가

. 새로운 데이터로 정기적으로 모델 훈련

축하합니다! 이제 머신러닝에 대해 꽤 많은 것을 알게 되었습니다.

The End of Chapter 2