这篇文章是以Bremen大学机器学习课程的教程为基础的。本文总结了使用机器学习解决新问题的一些建议。包括:
- 可视化数据的方法
- 选择一个适合当前问题的机器学习方法
- 鉴别和解决过拟合和欠拟合问题
- 处理大数据库问题(注意:不是非常小的)
- 不同损失函数的利弊
本文以Andrew Ng的《应用机器学习的建议 | Advice for applying Machine Learning》为基础。这个笔记的目的是用一个互动的方法解释这些观点。有些建议是可以讨论的。它们仅是建议,不是严格的规则。
In [1]:
1 2 3 |
import time import numpy as np np.random.seed(0) |
In [2]:
1 2 3 |
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline |
In [3] :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html From sklearn.learning_curve import learning_curve Def plot_learning_curve(estimator, title, x, ylim=None, cv=None, train_sizes=np.linspace(.1,1.0,5)): Generate a simple plot of the test and train learning curve. Parameters ---------------- estimator:object type that implements the “fit” and “predict” methods An object of that type which is cloned for each validation. title : string Title for the chart. x : array-like, shape(n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features y : array-like, shape (n_samples) or (n_samples, n_features) Target relative to X for classification or regression; None for unsupervised learning. ylim : tuple, shape(ymin, ymax), optional Defines minimum and maximum yvalues plotted. cv : integer, cross-validation generator, optional If an integer is passed, it is the number of folds (defaults to 3). Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects ‘’’’’’ plt.figure() train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=5, n_jobs=1, train_sizes = train_sizes) train_scores_mean = np.mean(train_scores, axis = 1) train_scores_std = np.std(train_scores, axis = 1) test_scores_mean = np.mean(test_scores, axis = 1) test_scores_std = np.std(test_scores, axis = 1) plt.fill_between(train_sizes, train_scores_mean – train_scores_std, train_scores_mean + train_scores_std, alpha = 0.1, color = “r”) plt.fill_between(train_sizes, test_scores_mean – test_scores_std, test_scores_mean + test_scores_std, alpha = 0.1, color = “g”) plt.plot(train_szies, train_scores_mean, ‘o-’, color = “r”, label = “Training score”) plt.plot(train_szies, test_scores_mean, ‘o-’, color = “g”, label = “Cross-validation score”) plt.xlabel(“Training examples”) plt.ylabel(“Score”) plt.legend(loc=”best”) plt.grid(“on”) if ylim: plt.ylim(ylim) plt.title(title) |
数据集
我们使用sklearn的make_classification函数来生成一些简单的玩具数据:
In [4] :
1 2 3 4 5 6 7 |
from sklearn.datasets import make_classification X, y = make_classification(1000, n_features=20, n_informative=2, n_redundant=2, n_classes=2, random_state=0) from pandas import DataFrame df = DataFrame(np.hstack((X, y[:, None])), columns = range(20) + ["class"]) |
注意到我们为二分类生成了一个数据集,这个数据集包括1000个数据点,每个特征20维。我们已经使用pandas的DataFrame类把数据和类别封装到一个共同的数据结构中。我们来看一看前5个数据点:
In [5]:
1 |
df[:5] |
Out[5]:
1 2 3 4 5 6 |
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class 0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0 1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1 2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1 3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1 4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0 |
通过直接查看原始特征值,我们很难获得该问题的任何线索,即使在这个低维的例子中。因此,有很多的提供数据的更容易视图的方法;其中的小部分将在接下来的部分中讨论。
可视化
当你接到一个新的问题,第一步几乎都是可视化,也就是说,观察你的数据。
Seaborn是一个不错的统计数据可视化包。我们使用它的一些函数来探索数据。
第一步是使用pairplot生成散点图和直方图。两种颜色对应了两个类别,我们使用了特征的一个子集、仅仅使用前50个数据点来简化问题。
In [6] :