機械学習入門_第2章途中まで - 学生の備忘録なブログ

import sys
print("Python version: {}".format(sys.version))

import pandas as pd
print("pandas version: {}".format(pd.__version__))

import matplotlib
print("matplotlib version: {}".format(matplotlib.__version__))

import numpy as np
print("NumPy version: {}".format(np.__version__))

import scipy as sp
print("SciPy version: {}".format(sp.__version__))

import IPython
print("IPython version: {}".format(IPython.__version__))

import sklearn
print("scikit-learn version: {}".format(sklearn.__version__))

import mglearn
import matplotlib.pyplot as plt

Python version: 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
pandas version: 0.20.3
matplotlib version: 2.0.2
NumPy version: 1.13.1
SciPy version: 0.19.1
IPython version: 6.1.0
scikit-learn version: 0.19.0

%matplotlib inline
from preamble import *

---------------------------------------------------------------------------

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-10-e31a7faecc4a> in <module>()
      1 get_ipython().magic('matplotlib inline')
----> 2 from preamble import *


ModuleNotFoundError: No module named 'preamble'

教師あり学習は，ある入力に対しての特定の出力を予測したい場合で，入力出力のペアのデータが入手可能な際に用いられる．

2.1 クラス分類と回帰

教師あり機械学習問題は，２つに大別することができる． * クラス分類(classification) * 2クラス分類(binary classification) * 多クラス分類(multiclass classification) * 回帰(regression)

クラス分類問題と回帰問題を区別するには，出力に何らかの連続性があるかを考えてみれば良い．出力に連続性があるなら回帰問題である．年収を予測する場合を考えてみる．40,000ドルと40,001ドルは近似できる．しかし，webサイトの言語を認識するタスク(クラス分類問題)では，量の大小は問題ではない．言語には連続性がない．英語とフランス語の中間の言語は存在しない．

汎化，過剰適合，適合不足

非常に複雑なモデルを作ることを許せば，訓練データに対してはいくらでも正確な予測を行うようにできてしまう．

[f:id:forhighlow:20171006222712p:plain]

教師あり機械学習 アルゴリズム

これから，最も一般的な機械学習アルゴリズムについて，どのようにデータから学習し，どのように予想を行うん丘を見ていく．モデルの複雑さという概念が個々のモデルで果たす役割について述べ，個々のアルゴリズムがモデルを構築する方法の概念を示す．更に，それぞれのアルゴリズムの長所と短所，適しているデータの種類について述べる．重要なパラメータとオプションについても説明する．多くのアルゴリズムは，クラス分類と回帰のバリエーションがあるので両方とも説明する．

個々のアルゴリズムの説明を詳しく読む必要はないが，モデルを理解することで個々の機械学習アルゴリズムの働き方についてより良く理解することができるだろう．本章はリファレンスガイドとしても利用できる．

サンプルデータセット

合成した２クラス分類データセットの例として，forgeデータセットを見てみる．

第1特徴量をx軸に、第2特徴量はy軸にプロットしている．

# generate dataset
X, y = mglearn.datasets.make_forge()
# plot dataset
mglearn.discrete_scatter(X[:, 0], X[:, 1], y)
plt.legend(["Class 0", "Class 1"], loc=4)
plt.xlabel("First feature")
plt.ylabel("Second feature")
print("X.shape: {}".format(X.shape))

X.shape: (26, 2)

[f:id:forhighlow:20171006222715p:plain]

X.shapeからわかるようにこのデータセットは２つの特徴量を持つ26のデータポイントで構成されている．

回帰あるを紹介する際には，合成したwaveデータセットを用いる．このwaveデータセットは入力として1つの特徴量と，モデルの対象となる連続地のターゲット変数(もしくは(response))を持つ．次のグラフは特徴量をx軸に，回帰のターゲット(出力)をy軸に取る．

X, y = mglearn.datasets.make_wave(n_samples=40)
plt.plot(X, y, 'o')
plt.ylim(-3, 3)
plt.xlabel("Feature")
plt.ylabel("Target")

<matplotlib.text.Text at 0x114641160>

[f:id:forhighlow:20171006222719p:plain]

学習にこれから用いるデータセットの一つに乳がんの腫瘍を測定したデータが有る．

from sklearn.datasets import load_breast_cancer
cancer= load_breast_cancer()
print("cancer.keys(): \n{}".format(cancer.keys()))

cancer.keys(): 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

print("shape of cancer data: {}".format(cancer.data.shape))

shape of cancer data: (569, 30)

print("Sample counts per class:\n{}".format(
      {n: v for n, v in zip(cancer.target_names, np.bincount(cancer.target))}))

Sample counts per class:
{'malignant': 212, 'benign': 357}

212が悪性で，357が良性である．

個々の特徴量の意味を示す記述は，feature_names属性に格納されている．

print("Feature names:^n{}".format(cancer.feature_names))##特徴量名

Feature names:^n['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']

## DESCRを見れば，詳しい情報が得られる．
print("Feature names:^n{}".format(cancer.DESCR))

Feature names:^nBreast Cancer Wisconsin (Diagnostic) Database
=============================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

References
----------
   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.

print("Feature names:^n{}".format(cancer.data))

Feature names:^n[[  1.79900000e+01   1.03800000e+01   1.22800000e+02 ...,   2.65400000e-01
    4.60100000e-01   1.18900000e-01]
 [  2.05700000e+01   1.77700000e+01   1.32900000e+02 ...,   1.86000000e-01
    2.75000000e-01   8.90200000e-02]
 [  1.96900000e+01   2.12500000e+01   1.30000000e+02 ...,   2.43000000e-01
    3.61300000e-01   8.75800000e-02]
 ..., 
 [  1.66000000e+01   2.80800000e+01   1.08300000e+02 ...,   1.41800000e-01
    2.21800000e-01   7.82000000e-02]
 [  2.06000000e+01   2.93300000e+01   1.40100000e+02 ...,   2.65000000e-01
    4.08700000e-01   1.24000000e-01]
 [  7.76000000e+00   2.45400000e+01   4.79200000e+01 ...,   0.00000000e+00
    2.87100000e-01   7.03900000e-02]]

詳しく見てもよくわからない．

ボストン住宅価格

実世界の回帰データセットとして，boston_housingデータセットを用いる．これはボストン郊外の住宅地の住宅価格の中央世知を犯罪率チャールズ川からの距離，高速道路への利便性などから予測するものだ．

13notokutyouryouwomotu 506のデータポイントが含まれる．

from sklearn.datasets import load_boston
boston = load_boston()
print("Data shape: {}".format(boston.data.shape))

Data shape: (506, 13)

このデータセットを拡張子，１３の測定結果だけを特徴量とするのではなく，特徴量間の積(交互作用(interaction)と呼ぶ)も見ることにする．つまり，犯罪率と高速道路への利便性を特徴量として見るだけでなく，それらの積も特徴量として考える．このように導出された特徴量を含めることを特徴量エンジニアリング(feature engineering)と呼ぶ．

X, y = mglearn.datasets.load_extended_boston()
print("X.shape: {}".format(X.shape))

X.shape: (506, 104)

104の特徴量とは，もとの13の特徴量に，13の特徴量から2つの特徴量を選ぶ重複ありの組合せ91を足したものである．

これから，様々な機械学習アルゴリズムの特徴を説明していく．

k-最近傍法

k-NNアルゴリズムは，最も単純な学習アルゴリズムであると言われる．新しいデータポイントに対する予測を行う際には，訓練データセットのんかから一番近い点つまり最近傍点を見つける．

k-最近傍法によるクラス分類

一番簡単な倍位には，k-NNアルゴリズムは，１つの近傍点，つまり訓練データに含まれる天の中で，予測したいデータポイントに一番近いものだけみる．予測には，この点に対する出力をそのまま用いる．

mglearn.plots.plot_knn_classification(n_neighbors=1)

f:id:forhighlow:20171006222723p:plain

ここでは，星印で示される３つの新しいデータポイントを加えている．それぞれに対して訓練データのうちで，最も近いものに印を付けた．

最近傍法アルゴリズムでの予測では，近傍店のラベルが予測されたラベルになる．(星印の色で表されている)．

１つ以上の近傍店を考慮にいれる場合は，投票でラベルを決める．つまり，個々のテストする点に対して，近傍点のうちいくつがクラス０に属し，いくつが暮らすに属するのかを数える．そして最も多く現れたクラスをその点に与える．言い換えればｋ-最近傍点の多数はのクラスを採用する．

要は多数決的に近しいものと同じになる．朱に交われば赤くなる，ということだ．

mglearn.plots.plot_knn_classification(n_neighbors=3)

png

ここでも，予測された結果は，星印の色で示されている．左上の新しいデータポイントに対する予測は，１つの最近傍点だけを使った場合とことなっている．

scikit-learnを用いてk-最近傍点アルゴリズムが適用できるか見てみたい．まず，データを訓練セットとテストセットに分割し，汎化性能を評価可能にする．

from sklearn.model_selection import train_test_split
X, y = mglearn.datasets.make_forge()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

次にクラスをインポートして，インスタンスを作成する．この際に，近傍店の数などのパラメータを渡すことができる．ここでは３にしている．

from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)

次に，訓練データセットを用いてクラス分類器を訓練する． for KNeighborsClassifier this means storing the dataset, so we can compute neighbors during prediction.

clf.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

finds the most common class among these

clf.predict(X_test)

array([1, 0, 1, 0, 1, 0, 0])

clf.score(X_test, y_test)

0.8571428571428571

85%の正確さである．