交叉验证是一种模型评估方法,用于评估机器学习模型的泛化能力。它的基本思想是:将数据集分为训练集和验证集,用训练集训练模型,用验证集评估模型性能。
常见的交叉验证方法有:
- K折交叉验证:将数据集划分为K个相同大小的子集,每次取K-1个子集作为训练集,1个子集作为验证集,进行K次训练和验证,最后取平均性能。
- 留一交叉验证:每次取N-1个样本作为训练集,剩下的1个样本作为验证集,进行N次训练和验证,最后取平均性能。
- 重复随机子采样:随机选择一定比例的样本作为训练集,其余作为验证集,重复多次,最后取平均性能。
- 时间序列交叉验证:对于时间序列数据,每次取前面时间点的数据作为训练集,最后时间点的数据作为验证集。
代码示例:
K折交叉验证:
python
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
scores = []
for train_index, val_index in kf.split(x, y):
x_train, x_val = x[train_index], x[val_index]
y_train, y_val = y[train_index], y[val_index]
model.fit(x_train, y_train)
scores.append(model.score(x_val, y_val))
mean_score = np.mean(scores)
print(mean_score)
留一交叉验证:
python
scores = []
for i in range(len(x)):
x_train, x_val = x[:i], x[i]
y_train, y_val = y[:i], y[i]
model.fit(x_train, y_train)
scores.append(model.score(x_val, y_val))
mean_score = np.mean(scores)
print(mean_score)
重复随机子采样:
python
from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
scores = []
for train_index, val_index in rs.split(x, y):
x_train, x_val = x[train_index], x[val_index]
y_train, y_val = y[train_index], y[val_index]
model.fit(x_train, y_train)
scores.append(model.score(x_val, y_val))
mean_score = np.mean(scores)
print(mean_score)