『应用机器学习的建议』的学习笔记

1277 查看

这篇文章是以Bremen大学机器学习课程的教程为基础的。本文总结了使用机器学习解决新问题的一些建议。包括：

可视化数据的方法
选择一个适合当前问题的机器学习方法
鉴别和解决过拟合和欠拟合问题
处理大数据库问题（注意：不是非常小的）
不同损失函数的利弊

本文以Andrew Ng的《应用机器学习的建议 | Advice for applying Machine Learning》为基础。这个笔记的目的是用一个互动的方法解释这些观点。有些建议是可以讨论的。它们仅是建议，不是严格的规则。

In [1]:

import time

import numpy as np

np.random.seed(0)

In [2]:

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

In [3] :

#Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html

From sklearn.learning_curve import learning_curve

Def plot_learning_curve(estimator, title, x, ylim=None, cv=None, train_sizes=np.linspace(.1,1.0,5)):

Generate a simple plot of the test and train learning curve.

Parameters

----------------

estimator:object type that implements the “fit” and “predict” methods

An object of that type which is cloned for each validation.

title : string

Title for the chart.

x : array-like, shape(n_samples, n_features)

Training vector, where n_samples is the number of samples and n_features is the number of features

y : array-like, shape (n_samples) or (n_samples, n_features)

Target relative to X for classification or regression;

None for unsupervised learning.

ylim : tuple, shape(ymin, ymax), optional

Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional

If an integer is passed, it is the number of folds (defaults to 3).

Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects

‘’’’’’

plt.figure()

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes = train_sizes)

train_scores_mean = np.mean(train_scores, axis = 1)

train_scores_std = np.std(train_scores, axis = 1)

test_scores_mean = np.mean(test_scores, axis = 1)

test_scores_std = np.std(test_scores, axis = 1)

plt.fill_between(train_sizes, train_scores_mean – train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1, color = “r”)

plt.fill_between(train_sizes, test_scores_mean – test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1, color = “g”)

plt.plot(train_szies, train_scores_mean, ‘o-’, color = “r”, label = “Training score”)

plt.plot(train_szies, test_scores_mean, ‘o-’, color = “g”, label = “Cross-validation score”)

plt.xlabel(“Training examples”)

plt.ylabel(“Score”)

plt.legend(loc=”best”)

plt.grid(“on”)

if ylim:

plt.ylim(ylim)

plt.title(title)

数据集

我们使用sklearn的make_classification函数来生成一些简单的玩具数据：

In [4] :

from sklearn.datasets import make_classification

X, y = make_classification(1000, n_features=20, n_informative=2,

n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame

df = DataFrame(np.hstack((X, y[:, None])),

columns = range(20) + ["class"])

注意到我们为二分类生成了一个数据集，这个数据集包括1000个数据点，每个特征20维。我们已经使用pandas的DataFrame类把数据和类别封装到一个共同的数据结构中。我们来看一看前5个数据点：

In [5]:

df[:5]

Out[5]:

0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class

0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0

1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1

2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1

3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1

4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

通过直接查看原始特征值，我们很难获得该问题的任何线索，即使在这个低维的例子中。因此，有很多的提供数据的更容易视图的方法；其中的小部分将在接下来的部分中讨论。

可视化

当你接到一个新的问题，第一步几乎都是可视化，也就是说，观察你的数据。

Seaborn是一个不错的统计数据可视化包。我们使用它的一些函数来探索数据。

第一步是使用pairplot生成散点图和直方图。两种颜色对应了两个类别，我们使用了特征的一个子集、仅仅使用前50个数据点来简化问题。

In [6] :