2. 머신러닝 프로젝트의 처음부터 끝까지¶

. 머신러닝 주택 회사에 오신 것을 환영합니다! 여러분이 해야 할 일은 캘리포니아 인구조사 데이터를 사용해 이 지역의 주택 가격 모델을 만드는 것입니다.¶

2.0 설정¶

# 공통
from warnings import simplefilter
import numpy as np
import os
import pandas as pd
import sklearn.linear_model
import matplotlib.pyplot as plt
from matplotlib import font_manager, rc
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# from sklearn.preprocessing import Imputer
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

simplefilter(action='ignore', category=FutureWarning)

# 일관된 출력을 위해 유사난수 초기화
np.random.seed(42)

%matplotlib inline

font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()

rc('font', family=font_name)

plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
plt.rcParams['axes.unicode_minus'] = False

# 그림을 저장할 폴드
PROJECT_ROOT_DIR = "C:/Users/Admin/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/User/Desktop/ML/"
# PROJECT_ROOT_DIR = "C:/Users/sally/Dropbox/2019-Fall-Semester/ML"

CHAPTER_ID = "end_to_end_project"

IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)

def save_fig(fig_id, tight_layout=True):
    path = os.path.join(IMAGES_PATH, fig_id + ".png")
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format='png', dpi=300)

2.3.2 데이터 다운로드¶

housing_path = os.path.join("datasets","housing","")

housing = pd.read_csv(housing_path + "housing.csv")

2.3.3 데이터 훑어보기¶

housing.head()

. 207개의 구역에서 total_bedrooms가 결측¶

housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

. ocean_proximity 특성의 도수분포표 작성¶

. pandas.Series.value_count https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html ¶

housing["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

. 숫자형 특성의 summary statistics¶

housing.describe()

housing.hist(bins=50, figsize=(20,15))

save_fig("attribute_histogram_plots")

. housing_median_age와 median_house_value의 최댓값을 한정한 것으로 보임¶

. 특성 별로 스케일 다름¶

. 오른쪽으로 긴 꼬리를 가진 특성이 많음¶

2.3.4 테스트 세트 만들기¶

. 단순임의추출법(simple random sampling; SRS)¶

. sklearn.model_selection.train_test_split¶

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

test_set.head()

. 중간 소득이 중간 주택 가격을 예측하는 데 매우 중요함¶

. SRS보다 층화임의추출법 (stratified random sampling) 적용¶

# 소득 카테고리 개수를 제한하기 위해 1.5로 나눕니다.
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)

. median_income < 1.5 이면 1,¶

. median_income < 3 이면 2,¶

. median_income < 4.5 이면 3,¶

. median_income < 6 이면 4,¶

. median_income < 7.5 이면 5,¶

. median_income < 9 이면 6,¶

. ...¶

housing[["median_income", "income_cat"]]

. pandas.DataFrame.where¶

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html

# 5 이상은 5로 레이블합니다.
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)

housing[["median_income", "income_cat"]]

housing["income_cat"].value_counts()

3.0    7236
2.0    6581
4.0    3639
5.0    2362
1.0     822
Name: income_cat, dtype: int64

housing["income_cat"].hist()

save_fig('income_category_hist')

. 층화추출 & split 메서드¶

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html

. pandas.DataFrame.loc¶

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.loc.html

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    print(train_index)
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

[17606 18632 14650 ... 13908 11159 15775]

strat_train_set.head()

strat_test_set["income_cat"].value_counts(normalize=True)

3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cat, dtype: float64

def income_cat_proportions(data):
    return data["income_cat"].value_counts(normalize=True)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()

compare_props

compare_props["Rand. %error"] = 100 * (compare_props["Random"] - compare_props["Overall"]) / compare_props["Overall"]

compare_props["Strat. %error"] = 100 * (compare_props["Stratified"] - compare_props["Overall"]) / compare_props["Overall"]

compare_props

. pandas.DataFrame.drop¶

Drop specified labels from rows or columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

strat_train_set.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 10 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16354 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
median_house_value    16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(9), object(1)
memory usage: 1.4+ MB

2.4 데이터 이해를 위한 탐색과 시각화¶

housing = strat_train_set.copy()

. pandas.DataFrame.plot¶

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

housing.plot(kind="scatter", x="longitude", y="latitude")

save_fig("bad_visualization_plot")

. 원의 반지름은 구역의 인구에 비례, 색깔은 중간 주택 가격 (파란색(낮은 가격)에서 빨간색(높은 가격)으로 변하는 jet), 투명도는 0.4로 지정¶

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100,label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, sharex=False)

save_fig("better_visualization_plot")

corr_matrix = housing.corr()
corr_matrix

corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

. 중간 소득이 올라갈수록 중간 주택 가격도 올라감¶

. 위도가 높을수록 (북쪽으로 갈수록) 중간 주택 가격은 하락함¶

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))

save_fig("scatter_matrix_plot")

housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)

save_fig("income_vs_house_value_scatterplot")

. 상관관계가 매우 강함¶

. 가격 제한 값이 500,000불에서 수평선으로 보임¶

2.4.3 특성 조합으로 실험¶

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"] = housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

. 전체 방 개수나 침대 개수보다 bedrooms_per_room의 상관관계가 높음¶

. bedrooms_per_room이 낮으면 중간 주택 가격은 높은 경향¶

. rooms_per_houserhold도 전체 방 개수보다 상관관계가 높음¶

housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
             alpha=0.2)
# plt.axis([0, 5, 0, 520000])
plt.show()

housing.describe()

2.5 머신러닝 알고리즘을 위한 데이터 준비¶

housing = strat_train_set.drop("median_house_value", axis=1) # 훈련 세트를 위해 레이블 삭제

housing_labels = strat_train_set["median_house_value"].copy()

2.5.1 데이터 정제¶

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows

sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # 옵션 1: 해당 구역을 제거
# sample_incomplete_rows.dropna()    # 옵션 1: 해당 구역을 제거

sample_incomplete_rows.drop("total_bedrooms", axis=1)       # 옵션 2: 전체 특성을 삭제

median = housing["total_bedrooms"].median()

sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # 옵션 3: median으로 채움
sample_incomplete_rows

. sklearn.preprocessing.Imputer¶

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html

imputer = SimpleImputer(strategy="median")

. median이 수치형 특성에서만 계산될 수 있기 때문에 텍스트 특성을 삭제¶

housing_num = housing.drop('ocean_proximity', axis = 1)

imputer.fit(housing_num)

SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)

imputer.statistics_

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

. 각 특성의 median이 수동으로 계산한 것과 같은지 확인¶

. pandas.DataFrame.median¶

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html

. pandas.Series.values¶

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.values.html

housing_num.median().values

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

. 훈련 세트 변환¶

X = imputer.transform(housing_num) #변환된 특성들이 들아 있는 numpy 배열
X

array([[-121.89  ,   37.29  ,   38.    , ...,  710.    ,  339.    ,
           2.7042],
       [-121.93  ,   37.05  ,   14.    , ...,  306.    ,  113.    ,
           6.4214],
       [-117.2   ,   32.77  ,   31.    , ...,  936.    ,  462.    ,
           2.8621],
       ...,
       [-116.4   ,   34.09  ,    9.    , ..., 2098.    ,  765.    ,
           3.2723],
       [-118.01  ,   33.82  ,   31.    , ..., 1356.    ,  356.    ,
           4.0625],
       [-122.45  ,   37.77  ,   52.    , ..., 1269.    ,  639.    ,
           3.575 ]])

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index = list(housing.index.values))

. pandas.DataFrame.loc¶

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.loc.html

housing_tr.loc[sample_incomplete_rows.index.values]

2.5.2 텍스트와 범주형 특성 다루기¶

. 범주형 입력 특성인 ocean_proximity를 전처리¶

housing_cat = housing['ocean_proximity']
housing_cat.head(10)

17606     <1H OCEAN
18632     <1H OCEAN
14650    NEAR OCEAN
3230         INLAND
3555      <1H OCEAN
19480        INLAND
8879      <1H OCEAN
13685        INLAND
4937      <1H OCEAN
4861      <1H OCEAN
Name: ocean_proximity, dtype: object

. pandas의 factorize() 메서드는 문자열 범주형 특성을 머신러닝 알고리즘이 다루기 쉬운 숫자 범주형 특성으로 변환¶

housing_cat_encoded, housing_categories = housing_cat.factorize()

housing_cat_encoded[:10]

array([0, 0, 1, 2, 0, 2, 0, 2, 0, 0], dtype=int64)

housing_categories

Index(['<1H OCEAN', 'NEAR OCEAN', 'INLAND', 'NEAR BAY', 'ISLAND'], dtype='object')

. OneHotEncoder를 사용하여 범주형 값을 원-핫 벡터로 변경¶

. sklearn.preprocessing.OneHotEncoder¶

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

encoder = OneHotEncoder(categories = 'auto')

housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1)) # 행 차원은 unknown, 열 차원은 1  1
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

. OneHotEncoder는 기본적으로 희소 행렬을 반환. 필요하면 밀집 배열로 변환 가능¶

housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       ...,
       [0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

housing_cat_reshaped = housing_cat.values.reshape(-1, 1) # 텍스트 카테고리
housing_cat_reshaped

array([['<1H OCEAN'],
       ['<1H OCEAN'],
       ['NEAR OCEAN'],
       ...,
       ['INLAND'],
       ['<1H OCEAN'],
       ['NEAR BAY']], dtype=object)

housing_cat_1hot = encoder.fit_transform(housing_cat_reshaped)
housing_cat_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

housing_cat_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

2.5.3 나만의 변환기¶

. numpy.c__¶

https://docs.scipy.org/doc/numpy/reference/generated/numpy.c_.html

rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room = False)

housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs

array([[-121.89, 37.29, 38.0, ..., '<1H OCEAN', 4.625368731563422,
        2.094395280235988],
       [-121.93, 37.05, 14.0, ..., '<1H OCEAN', 6.008849557522124,
        2.7079646017699117],
       [-117.2, 32.77, 31.0, ..., 'NEAR OCEAN', 4.225108225108225,
        2.0259740259740258],
       ...,
       [-116.4, 34.09, 9.0, ..., 'INLAND', 6.34640522875817,
        2.742483660130719],
       [-118.01, 33.82, 31.0, ..., '<1H OCEAN', 5.50561797752809,
        3.808988764044944],
       [-122.45, 37.77, 52.0, ..., 'NEAR BAY', 4.843505477308295,
        1.9859154929577465]], dtype=object)

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs, 
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()

2.5.4 특성 스케일링(scaling)¶

. min-max 스케일링 : 0~1 범위에 값이 들도록 값을 이동¶

.. MinMaxScaler 변환기 제공, 0~1 사이를 원하지 않으면 범위 변경 가능 (feature_range 매개변수)¶

. 표준화 (standardization) : StandardScaler 변환기 제공, 범위의 상한과 하한이 없음¶

2.5.5 변환 파이프라인¶

. 수치 특성을 전처리하기 위한 파이프라인 작성¶

num_pipeline = Pipeline([
#         ('imputer', Imputer(strategy="median")),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

housing_num_tr

array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])

. pandas DataFrame 컬럼의 일부를 선택하는 변환기를 작성¶

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

. 하나의 큰 파이프라인에 이들을 모두 결합하여 수치형과 범주형 특성을 전처리¶

df = pd.DataFrame({'age':    [ 3,  29],
                   'height': [94, 170],
                   'weight': [31, 115]})

type(df.values)

numpy.ndarray

num_attribs = list(housing_num)
num_attribs

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
#         ('imputer', Imputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', encoder),
#         ('cat_encoder', OneHotEncoder(categories = 'auto')),
#         ('cat_encoder', CategoricalEncoder(encoding="onehot-dense")),
    ])

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline)
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

<16512x16 sparse matrix of type '<class 'numpy.float64'>'
	with 198144 stored elements in Compressed Sparse Row format>

housing_prepared.shape

(16512, 16)

housing_prepared.toarray()

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

2.6 모델 선택과 훈련¶

2.6.1 훈련 세트에서 훈련하고 평가하기¶

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels) # 선형 회귀 모델 훈련

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

. pandas.DataFrame.iloc¶

https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.DataFrame.iloc.html

# 훈련 샘플 몇 개를 사용해 전체 파이프라인을 적용
some_data = housing.iloc[:5]

some_labels = housing_labels.iloc[:5]

some_data_prepared = full_pipeline.transform(some_data)
print("예측:", lin_reg.predict(some_data_prepared))

예측: [210644.60466242 317768.80713423 210956.43317372  59218.98851123
 189747.55854047]

. 실제 값과 비교¶

print("레이블:", list(some_labels))

레이블: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

some_data_prepared

<5x16 sparse matrix of type '<class 'numpy.float64'>'
	with 60 stored elements in Compressed Sparse Row format>

. 전체 훈련 세트의 mean_square_error 함수 적용¶

housing_predictions = lin_reg.predict(housing_prepared)

lin_mse = mean_squared_error(housing_labels, housing_predictions)

lin_rmse = np.sqrt(lin_mse)
lin_rmse

68628.19819848923

. 과소 적합 모델?¶

.. 더 강력한 모델 선택¶

.. 더 좋은 특성 주입¶

.. 모델의 규제 감소¶

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=42, splitter='best')

housing_predictions = tree_reg.predict(housing_prepared)

tree_mse = mean_squared_error(housing_labels, housing_predictions)

tree_rmse = np.sqrt(tree_mse)
tree_rmse

0.0

. 과대 적합 모델?¶

2.6.2 교차 검증을 사용한 평가¶

. K-겹교차 검증 (K-fold cross-validation)¶

.. 훈련 세트를 10개의 서브셋으로 무작위로 분할¶

.. 매번 다른 서브셋으로 평가하고 나머지 9개 폴드 (fold)는 훈련에 사용¶

.. 10번 훈련하고 평가¶

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)

def display_scores(scores):
    print("점수:", scores)
    print("평균:", scores.mean())
    print("표준편차:", scores.std())

display_scores(lin_rmse_scores)

점수: [66782.73844104 66960.1176327  70347.95243196 74739.57053176
 68031.13387783 71193.8418342  64969.63057129 68281.61137872
 71552.91568652 67665.10086244]
평균: 69052.4613248459
표준편차: 2731.674035171386

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
display_scores(tree_rmse_scores)

점수: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
평균: 71407.68766037929
표준편차: 2439.4345041191004

. 앙상블 학습 : 랜덤 포레스트¶

forest_reg = RandomForestRegressor(random_state = 42,n_estimators=10)
forest_reg.fit(housing_prepared, housing_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

housing_predictions = forest_reg.predict(housing_prepared)

forest_mse = mean_squared_error(housing_labels, housing_predictions)

forest_rmse = np.sqrt(forest_mse)
forest_rmse

21933.31414779769

. 과대 적합 모델?¶

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)

forest_rmse_scores = np.sqrt(-forest_scores)

display_scores(forest_rmse_scores)

점수: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
평균: 52583.72407377466
표준편차: 2298.353351147122

2.7 모델 세부 튜닝¶

2.7.1 그리드 탐색¶

param_grid = [
    # 하이퍼파라미터 12(=3×4)개의 조합을 시도합니다.
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # bootstrap은 False로 하고 6(=2×3)개의 조합을 시도합니다.
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# 다섯 폴드에서 훈련하면 총 (12+6)*5=90번의 훈련이 일어납니다.

grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', 
                           return_train_score=True, n_jobs=-1)
grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid=[{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]}, {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

. 최상의 파라미터 조합:¶

grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}

grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False, random_state=42,
           verbose=0, warm_start=False)

. 그리드 탐색에서 테스트한 하이퍼파라미터 조합의 점수를 확인¶

cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

63669.05791727153 {'max_features': 2, 'n_estimators': 3}
55627.16171305252 {'max_features': 2, 'n_estimators': 10}
53384.57867637289 {'max_features': 2, 'n_estimators': 30}
60965.99185930139 {'max_features': 4, 'n_estimators': 3}
52740.98248528835 {'max_features': 4, 'n_estimators': 10}
50377.344409590376 {'max_features': 4, 'n_estimators': 30}
58663.84733372485 {'max_features': 6, 'n_estimators': 3}
52006.15355973719 {'max_features': 6, 'n_estimators': 10}
50146.465964159885 {'max_features': 6, 'n_estimators': 30}
57869.25504027614 {'max_features': 8, 'n_estimators': 3}
51711.09443660957 {'max_features': 8, 'n_estimators': 10}
49682.25345942335 {'max_features': 8, 'n_estimators': 30}
62895.088889905004 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.14484390074 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.399594730654 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52725.01091081235 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.612956065226 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.51445842374 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

2.7.4 최상의 모델과 오차 분석¶

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]

# cat_one_hot_attribs = list(cat_encoder.categories_[0])
cat_one_hot_attribs = list(encoder.categories_[0])
cat_one_hot_attribs

['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']

attributes = num_attribs + extra_attribs + cat_one_hot_attribs

sorted(zip(feature_importances, attributes), reverse=True)

[(0.3661589806181342, 'median_income'),
 (0.1647809935615905, 'INLAND'),
 (0.10879295677551573, 'pop_per_hhold'),
 (0.07334423551601242, 'longitude'),
 (0.0629090704826203, 'latitude'),
 (0.05641917918195401, 'rooms_per_hhold'),
 (0.05335107734767581, 'bedrooms_per_room'),
 (0.041143798478729635, 'housing_median_age'),
 (0.014874280890402767, 'population'),
 (0.014672685420543237, 'total_rooms'),
 (0.014257599323407807, 'households'),
 (0.014106483453584102, 'total_bedrooms'),
 (0.010311488326303787, '<1H OCEAN'),
 (0.002856474637320158, 'NEAR OCEAN'),
 (0.00196041559947807, 'NEAR BAY'),
 (6.028038672736599e-05, 'ISLAND')]

2.7.5 테스트 세트로 시스템 평가하기¶

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)

y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)

final_rmse = np.sqrt(final_mse)

final_rmse

47730.22690385927

2.8 론칭, 모니터링, 그리고 시스템 유지 보수¶

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	20640.000000	20640.000000	20640.000000	20640.000000	20433.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	-119.569704	35.631861	28.639486	2635.763081	537.870553	1425.476744	499.539680	3.870671	206855.816909
std	2.003532	2.135952	12.585558	2181.615252	421.385070	1132.462122	382.329753	1.899822	115395.615874
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.800000	33.930000	18.000000	1447.750000	296.000000	787.000000	280.000000	2.563400	119600.000000
50%	-118.490000	34.260000	29.000000	2127.000000	435.000000	1166.000000	409.000000	3.534800	179700.000000
75%	-118.010000	37.710000	37.000000	3148.000000	647.000000	1725.000000	605.000000	4.743250	264725.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
20046	-119.01	36.06	25.0	1505.0	NaN	1392.0	359.0	1.6812	47700.0	INLAND
3024	-119.46	35.14	30.0	2943.0	NaN	1565.0	584.0	2.5313	45800.0	INLAND
15663	-122.44	37.80	52.0	3830.0	NaN	1310.0	963.0	3.4801	500001.0	NEAR BAY
20484	-118.72	34.28	17.0	3051.0	NaN	1705.0	495.0	5.7376	218600.0	<1H OCEAN
9814	-121.93	36.62	34.0	2351.0	NaN	1063.0	428.0	3.7250	278000.0	NEAR OCEAN

	median_income	income_cat
0	8.3252	6.0
1	8.3014	6.0
2	7.2574	5.0
3	5.6431	4.0
4	3.8462	3.0
5	4.0368	3.0
6	3.6591	3.0
7	3.1200	3.0
8	2.0804	2.0
9	3.6912	3.0
10	3.2031	3.0
11	3.2705	3.0
12	3.0750	3.0
13	2.6736	2.0
14	1.9167	2.0
15	2.1250	2.0
16	2.7750	2.0
17	2.1202	2.0
18	1.9911	2.0
19	2.6033	2.0
20	1.3578	1.0
21	1.7135	2.0
22	1.7250	2.0
23	2.1806	2.0
24	2.6000	2.0
25	2.4038	2.0
26	2.4597	2.0
27	1.8080	2.0
28	1.6424	2.0
29	1.6875	2.0
...	...	...
20610	1.3631	1.0
20611	1.2857	1.0
20612	1.4934	1.0
20613	1.4958	1.0
20614	2.4695	2.0
20615	2.3598	2.0
20616	2.0469	2.0
20617	3.3021	3.0
20618	2.2500	2.0
20619	2.7303	2.0
20620	4.5625	4.0
20621	2.3661	2.0
20622	2.4167	2.0
20623	2.8235	2.0
20624	3.0739	3.0
20625	4.1250	3.0
20626	2.1667	2.0
20627	3.0000	2.0
20628	2.5952	2.0
20629	2.0943	2.0
20630	3.5673	3.0
20631	3.5179	3.0
20632	3.1250	3.0
20633	2.5495	2.0
20634	3.7125	3.0
20635	1.5603	2.0
20636	2.5568	2.0
20637	1.7000	2.0
20638	1.8672	2.0
20639	2.3886	2.0

	median_income	income_cat
0	8.3252	5.0
1	8.3014	5.0
2	7.2574	5.0
3	5.6431	4.0
4	3.8462	3.0
5	4.0368	3.0
6	3.6591	3.0
7	3.1200	3.0
8	2.0804	2.0
9	3.6912	3.0
10	3.2031	3.0
11	3.2705	3.0
12	3.0750	3.0
13	2.6736	2.0
14	1.9167	2.0
15	2.1250	2.0
16	2.7750	2.0
17	2.1202	2.0
18	1.9911	2.0
19	2.6033	2.0
20	1.3578	1.0
21	1.7135	2.0
22	1.7250	2.0
23	2.1806	2.0
24	2.6000	2.0
25	2.4038	2.0
26	2.4597	2.0
27	1.8080	2.0
28	1.6424	2.0
29	1.6875	2.0
...	...	...
20610	1.3631	1.0
20611	1.2857	1.0
20612	1.4934	1.0
20613	1.4958	1.0
20614	2.4695	2.0
20615	2.3598	2.0
20616	2.0469	2.0
20617	3.3021	3.0
20618	2.2500	2.0
20619	2.7303	2.0
20620	4.5625	4.0
20621	2.3661	2.0
20622	2.4167	2.0
20623	2.8235	2.0
20624	3.0739	3.0
20625	4.1250	3.0
20626	2.1667	2.0
20627	3.0000	2.0
20628	2.5952	2.0
20629	2.0943	2.0
20630	3.5673	3.0
20631	3.5179	3.0
20632	3.1250	3.0
20633	2.5495	2.0
20634	3.7125	3.0
20635	1.5603	2.0
20636	2.5568	2.0
20637	1.7000	2.0
20638	1.8672	2.0
20639	2.3886	2.0

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity	income_cat
17606	-121.89	37.29	38.0	1568.0	351.0	710.0	339.0	2.7042	286600.0	<1H OCEAN	2.0
18632	-121.93	37.05	14.0	679.0	108.0	306.0	113.0	6.4214	340600.0	<1H OCEAN	5.0
14650	-117.20	32.77	31.0	1952.0	471.0	936.0	462.0	2.8621	196900.0	NEAR OCEAN	2.0
3230	-119.61	36.31	25.0	1847.0	371.0	1460.0	353.0	1.8839	46300.0	INLAND	2.0
3555	-118.59	34.23	17.0	6592.0	1525.0	4459.0	1463.0	3.0347	254500.0	<1H OCEAN	3.0

	Overall	Stratified	Random
1.0	0.039826	0.039729	0.040213
2.0	0.318847	0.318798	0.324370
3.0	0.350581	0.350533	0.358527
4.0	0.176308	0.176357	0.167393
5.0	0.114438	0.114583	0.109496

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
longitude	1.000000	-0.924478	-0.105848	0.048871	0.076598	0.108030	0.063070	-0.019583	-0.047432
latitude	-0.924478	1.000000	0.005766	-0.039184	-0.072419	-0.115222	-0.077647	-0.075205	-0.142724
housing_median_age	-0.105848	0.005766	1.000000	-0.364509	-0.325047	-0.298710	-0.306428	-0.111360	0.114110
total_rooms	0.048871	-0.039184	-0.364509	1.000000	0.929379	0.855109	0.918392	0.200087	0.135097
total_bedrooms	0.076598	-0.072419	-0.325047	0.929379	1.000000	0.876320	0.980170	-0.009740	0.047689
population	0.108030	-0.115222	-0.298710	0.855109	0.876320	1.000000	0.904637	0.002380	-0.026920
households	0.063070	-0.077647	-0.306428	0.918392	0.980170	0.904637	1.000000	0.010781	0.064506
median_income	-0.019583	-0.075205	-0.111360	0.200087	-0.009740	0.002380	0.010781	1.000000	0.687160
median_house_value	-0.047432	-0.142724	0.114110	0.135097	0.047689	-0.026920	0.064506	0.687160	1.000000

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	rooms_per_household	bedrooms_per_room	population_per_household
count	16512.000000	16512.000000	16512.000000	16512.000000	16354.000000	16512.000000	16512.000000	16512.000000	16512.000000	16512.000000	16354.000000	16512.000000
mean	-119.575834	35.639577	28.653101	2622.728319	534.973890	1419.790819	497.060380	3.875589	206990.920724	5.440341	0.212878	3.096437
std	2.001860	2.138058	12.574726	2138.458419	412.699041	1115.686241	375.720845	1.904950	115703.014830	2.611712	0.057379	11.584826
min	-124.350000	32.540000	1.000000	6.000000	2.000000	3.000000	2.000000	0.499900	14999.000000	1.130435	0.100000	0.692308
25%	-121.800000	33.940000	18.000000	1443.000000	295.000000	784.000000	279.000000	2.566775	119800.000000	4.442040	0.175304	2.431287
50%	-118.510000	34.260000	29.000000	2119.500000	433.000000	1164.000000	408.000000	3.540900	179500.000000	5.232284	0.203031	2.817653
75%	-118.010000	37.720000	37.000000	3141.000000	644.000000	1719.250000	602.000000	4.744475	263900.000000	6.056361	0.239831	3.281420
max	-114.310000	41.950000	52.000000	39320.000000	6210.000000	35682.000000	5358.000000	15.000100	500001.000000	141.909091	1.000000	1243.333333

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity
4629	-118.30	34.07	18.0	3759.0	NaN	3296.0	1462.0	2.2708	<1H OCEAN
6068	-117.86	34.01	16.0	4632.0	NaN	3038.0	727.0	5.1762	<1H OCEAN
17923	-121.97	37.35	30.0	1955.0	NaN	999.0	386.0	4.6328	<1H OCEAN
13656	-117.30	34.05	6.0	2155.0	NaN	1039.0	391.0	1.6675	INLAND
19252	-122.79	38.48	7.0	6837.0	NaN	3468.0	1405.0	3.1662	<1H OCEAN

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	ocean_proximity	rooms_per_household	population_per_household
0	-121.89	37.29	38	1568	351	710	339	2.7042	<1H OCEAN	4.62537	2.0944
1	-121.93	37.05	14	679	108	306	113	6.4214	<1H OCEAN	6.00885	2.70796
2	-117.2	32.77	31	1952	471	936	462	2.8621	NEAR OCEAN	4.22511	2.02597
3	-119.61	36.31	25	1847	371	1460	353	1.8839	INLAND	5.23229	4.13598
4	-118.59	34.23	17	6592	1525	4459	1463	3.0347	<1H OCEAN	4.50581	3.04785

2. 머신러닝 프로젝트의 처음부터 끝까지¶

. 머신러닝 주택 회사에 오신 것을 환영합니다! 여러분이 해야 할 일은 캘리포니아 인구조사 데이터를 사용해 이 지역의 주택 가격 모델을 만드는 것입니다.¶

2.0 설정¶

2.3.2 데이터 다운로드¶

2.3.3 데이터 훑어보기¶

. 207개의 구역에서 total_bedrooms가 결측¶

. ocean_proximity 특성의 도수분포표 작성¶

. pandas.Series.value_count https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html¶

. 숫자형 특성의 summary statistics¶

. housing_median_age와 median_house_value의 최댓값을 한정한 것으로 보임¶

. 특성 별로 스케일 다름¶

. 오른쪽으로 긴 꼬리를 가진 특성이 많음¶

2.3.4 테스트 세트 만들기¶

. 단순임의추출법(simple random sampling; SRS)¶

. sklearn.model_selection.train_test_split¶

. 중간 소득이 중간 주택 가격을 예측하는 데 매우 중요함¶

. SRS보다 층화임의추출법 (stratified random sampling) 적용¶

. median_income < 1.5 이면 1,¶

. median_income < 3 이면 2,¶

. median_income < 4.5 이면 3,¶

. median_income < 6 이면 4,¶

. median_income < 7.5 이면 5,¶

. median_income < 9 이면 6,¶

. ...¶

. pandas.DataFrame.where¶

. 층화추출 & split 메서드¶

. pandas.DataFrame.loc¶

. pandas.DataFrame.drop¶

2.4 데이터 이해를 위한 탐색과 시각화¶

. pandas.DataFrame.plot¶

. 원의 반지름은 구역의 인구에 비례, 색깔은 중간 주택 가격 (파란색(낮은 가격)에서 빨간색(높은 가격)으로 변하는 jet), 투명도는 0.4로 지정¶

. 중간 소득이 올라갈수록 중간 주택 가격도 올라감¶

. 위도가 높을수록 (북쪽으로 갈수록) 중간 주택 가격은 하락함¶

. 상관관계가 매우 강함¶

. 가격 제한 값이 500,000불에서 수평선으로 보임¶

2.4.3 특성 조합으로 실험¶

. 전체 방 개수나 침대 개수보다 bedrooms_per_room의 상관관계가 높음¶

. bedrooms_per_room이 낮으면 중간 주택 가격은 높은 경향¶

. rooms_per_houserhold도 전체 방 개수보다 상관관계가 높음¶

2.5 머신러닝 알고리즘을 위한 데이터 준비¶

2.5.1 데이터 정제¶

. pandas.DataFrame.any¶

. pandas.Datafeame.dropna¶

. pandas.DataFrame.fillna¶

. sklearn.preprocessing.Imputer¶

. median이 수치형 특성에서만 계산될 수 있기 때문에 텍스트 특성을 삭제¶

. 각 특성의 median이 수동으로 계산한 것과 같은지 확인¶

. pandas.DataFrame.median¶

. pandas.Series.values¶

. 훈련 세트 변환¶

. pandas.DataFrame.loc¶

2.5.2 텍스트와 범주형 특성 다루기¶

. 범주형 입력 특성인 ocean_proximity를 전처리¶

. pandas의 factorize() 메서드는 문자열 범주형 특성을 머신러닝 알고리즘이 다루기 쉬운 숫자 범주형 특성으로 변환¶

. OneHotEncoder를 사용하여 범주형 값을 원-핫 벡터로 변경¶

. sklearn.preprocessing.OneHotEncoder¶

. OneHotEncoder는 기본적으로 희소 행렬을 반환. 필요하면 밀집 배열로 변환 가능¶

2.5.3 나만의 변환기¶

. numpy.c__¶

2.5.4 특성 스케일링(scaling)¶

. min-max 스케일링 : 0~1 범위에 값이 들도록 값을 이동¶

.. MinMaxScaler 변환기 제공, 0~1 사이를 원하지 않으면 범위 변경 가능 (feature_range 매개변수)¶

. 표준화 (standardization) : StandardScaler 변환기 제공, 범위의 상한과 하한이 없음¶

2.5.5 변환 파이프라인¶

. 수치 특성을 전처리하기 위한 파이프라인 작성¶

. pandas DataFrame 컬럼의 일부를 선택하는 변환기를 작성¶

. 하나의 큰 파이프라인에 이들을 모두 결합하여 수치형과 범주형 특성을 전처리¶

2.6 모델 선택과 훈련¶

2.6.1 훈련 세트에서 훈련하고 평가하기¶

. pandas.DataFrame.iloc¶

. 실제 값과 비교¶

. 전체 훈련 세트의 mean_square_error 함수 적용¶

. 과소 적합 모델?¶

.. 더 강력한 모델 선택¶

.. 더 좋은 특성 주입¶

.. 모델의 규제 감소¶

. 과대 적합 모델?¶

2.6.2 교차 검증을 사용한 평가¶

. K-겹교차 검증 (K-fold cross-validation)¶

.. 훈련 세트를 10개의 서브셋으로 무작위로 분할¶

. pandas.Series.value_count https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html ¶