训练集(train set) 验证集(validation set) 测试集(test set)

在有监督(supervise)的机器学习中,数据集常被分成2~3个,即:训练集(train set) 验证集(validation set) 测试集(test set)。

http://blog.sina.com.cn/s/blog_4d2f6cf201000cjx.html

一般需要将样本分成独立的三部分训练集(train set),验证集(validation set)和测试集(test set)。其中训练集用来估计模型,验证集用来确定网络结构或者控制模型复杂程度的参数,而测试集则检验最终选择最优的模型的性能如何。一个典型的划分是训练集占总样本的50%,而其它各占25%,三部分都是从样本中随机抽取。
样本少的时候,上面的划分就不合适了。常用的是留少部分做测试集。然后对其余N个样本采用K折交叉验证法。就是将样本打乱,然后均匀分成K份,轮流选择其中K-1份训练,剩余的一份做验证,计算预测误差平方和,最后把K次的预测误差平方和再做平均作为选择最优模型结构的依据。特别的K取N,就是留一法(leave one out)。

http://www.cppblog.com/guijie/archive/2008/07/29/57407.html

这三个名词在机器学习领域的文章中极其常见,但很多人对他们的概念并不是特别清楚,尤其是后两个经常被人混用。Ripley, B.D(1996)在他的经典专著Pattern Recognition and Neural Networks中给出了这三个词的定义。
Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. 
Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network. 
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.
显然,training set是用来训练模型或确定模型参数的,如ANN中权值等; validation set是用来做模型选择(model selection),即做模型的最终优化及确定的,如ANN的结构;而 test set则纯粹是为了测试已经训练好的模型的推广能力。当然,test set这并不能保证模型的正确性,他只是说相似的数据用此模型会得出相似的结果。但实际应用中,一般只将数据集分成两类,即training set 和test set,大多数文章并不涉及validation set。
Ripley还谈到了Why separate test and validation sets?
1. The error rate estimate of the final model on validation data will be biased (smaller than the true error rate) since the validation set is used to select the final model.
2. After assessing the final model with the test set, YOU MUST NOT tune the model any further.

http://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

Step 1) Training: Each type of algorithm has its own parameter options (the number of layers in a Neural Network, the number of trees in a Random Forest, etc). For each of your algorithms, you must pick one option. That’s why you have a validation set.

Step 2) Validating: You now have a collection of algorithms. You must pick one algorithm. That’s why you have a test set. Most people pick the algorithm that performs best on the validation set (and that's ok). But, if you do not measure your top-performing
algorithm’s error rate on the test set, and just go with its error rate on the validation set, then you have blindly mistaken the “best possible scenario” for the “most likely scenario.” That's a recipe for disaster.

Step 3) Testing: I suppose that if your algorithms did not have any parameters then you would not need a third step. In that case, your validation step would be your test step. Perhaps Matlab does not ask you for parameters or you have chosen not to use them
and that is the source of your confusion.

My Idea is that those option in neural network toolbox is for avoiding overfitting. In this situation the weights are specified for the training data only and don't show the global trend. By having a validation set, the iterations are adaptable to where decreases
in the training data error cause decreases in validation data and increases in validation data error; along with decreases in training data error, this demonstrates the overfitting phenomenon.

http://blog.sciencenet.cn/blog-397960-666113.html

http://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-networ

for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training

Once you're finished training, then you run against your testing set and verify that the accuracy is sufficient.

Training Set: this data set is used to adjust the weights on the neural network.

Validation Set: this data set is used to minimize overfitting. You're not adjusting the weights of the network with this data set, you're just verifying that any increase in accuracy over the training data set actually yields an increase in accuracy over a
data set that has not been shown to the network before, or at least the network hasn't trained on it (i.e. validation data set). If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases,
then you're overfitting your neural network and you should stop training.

Testing Set: this data set is used only for testing the final solution in order to confirm the actual predictive power of the network.

Validating set is used in the process of training. Testing set is not. The Testing set allows

1)to see if the training set was enough and 
2)whether the validation set did the job of preventing overfitting. If you use the testing set in the process of training then it will be just another validation set and it won't show what happens when new data is feeded in the network.

 

 

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier.

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

The error surface will be different for different sets of data from your data set (batch learning). Therefore if you find a very good local minima for your test set data, that may not be a very good point, and may be a very bad point in the surface generated
by some other set of data for the same problem. Therefore you need to compute such a model which not only finds a good weight configuration for the training set but also should be able to predict new data (which is not in the training set) with good error.
In other words the network should be able to generalize the examples so that it learns the data and does not simply remembers or loads the training set by overfitting the training data.

The validation data set is a set of data for the function you want to learn, which you are not directly using to train the network. You are training the network with a set of data which you call the training data set. If you are using gradient based algorithm
to train the network then the error surface and the gradient at some point will completely depend on the training data set thus the training data set is being directly used to adjust the weights. To make sure you don't overfit the network you need to input
the validation dataset to the network and check if the error is within some range. Because the validation set is not being using directly to adjust the weights of the netowork, therefore a good error for the validation and also the test set indicates that
the network predicts well for the train set examples, also it is expected to perform well when new example are presented to the network which was not used in the training process.

Early stopping is a way to stop training. There are different variations available, the main outline is, both the train and the validation set errors are monitored, the train error decreases at each iteration (backprop and brothers) and at first the validation
error decreases. The training is stopped at the moment the validation error starts to rise. The weight configuration at this point indicates a model, which predicts the training data well, as well as the data which is not seen by the network . But because
the validation data actually affects the weight configuration indirectly to select the weight configuration. This is where the Test set comes in. This set of data is never used in the training process. Once a model is selected based on the validation set,
the test set data is applied on the network model and the error for this set is found. This error is a representative of the error which we can expect from absolutely new data for the same problem.

时间: 2024-11-30 18:22:14

训练集(train set) 验证集(validation set) 测试集(test set)的相关文章

测试集 , 训练集和验证集的区别

最近在看机器学习的东西发现验证集的(Validation set) 有时候被提起到,以时间没明白验证集的真正用途. 首先,这三个名词在机器学习领域的文章中是很常见的,以下是这三个词的定义. Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. Validation set: A set of examples used

《写给程序员的数据挖掘实践指南》——5.1训练集和测试集

5.1训练集和测试集 前一章的最后部分中,我们使用了3个不同的数据集:女子运动员数据集.Iris数据集以及汽车MPG数据集.我们把每个数据集分成两个子集,一个用于构建分类器,该数据集称为训练集(training set).另一个数据集用于评估分类器,该数据集称为测试集(test set).训练集和测试集是数据挖掘中的常用术语. 数据挖掘领域的人永远不会在用于训练系统的数据上进行测试!下面以近邻算法为例来解释为什么不能使用训练数据来测试.如果上述例子中的篮球运动员Marissa Coleman在训

【Science】破解密码“AlphaGo”诞生,训练Gan破解27%LinkedIn测试集密码

本文讲的是破解密码"AlphaGo"诞生,训练Gan破解27%LinkedIn测试集密码,一项新的研究旨在使用生成对抗网络(GAN) 来加快密码破解的速度.斯蒂文斯理工学院的研究人员用类似"AlphaGo"的方法,利用超过 4300 万的LinkedIn 个人资料来训练模型,辅助 hashCat 这一目前最强大的密码猜测程序,破解了 LinkedIn 密码测试组中 27% 的密码.研究者确信,尽管在这次演示中,是PassGAN 在辅助hashCat ,但经过迭代的P

corel 5k数据集中怎么分训练集和测试集图片?

问题描述 corel 5k数据集中怎么分训练集和测试集图片? 它有50个主题,每个主题100张图像.那么4500张训练图像和500张测试图像怎么分?我感觉图像和标签怎么对不上? 解决方案 楼主,corel5k这个数据集,你有吗?现在我需要做一些实验,但是没有这个数据集,希望得到你的帮助.非常期待你的回复

《应用程序性能测试的艺术(第2版)》—第2章 2.3节性能测试工具集:概念验证

2.3 性能测试工具集:概念验证对于候选的性能测试工具,你需要对它们一一试用以验证工具的可行性,只有这样才能确保你最终选择的工具集能够满足你的需求.在验证过程中至少选择录制两个测试用例:一个只读用例(比如一个返回一条或者多条记录的搜索操作)和一个涉及插入和更新你的应用数据库的写用例.这样你就能验证录制下来的测试用例是否能够正确回放.如果你的应用是只读的,你也要检查脚本回放日志来确保回放过程中没有任何错误. 概念验证完成以下目标. 为验证性能测试工具是否适合目标应用提供了一次技术评估的机会技术兼容

Mongodb3.0.5 副本集搭建及spring和java连接副本集配置详细介绍_MongoDB

Mongodb3.0.5 副本集搭建及spring和java连接副本集配置详细介绍 一.基本环境: mongdb3.0.5数据库 spring-data-MongoDB-1.7.2.jar mongo-Java-driver-3.0.2.jar Linux-redhat6.3 tomcat7 二.搭建mongodb副本集: 1.  分别在三台linux系统机上安装mongodb,(为避免和机器上原有的mongodb端口冲突,这里设为57017): 192.168.0.160 192.168.0.

基于prototype.js验证框架(validation.js)的三个应用

最近对prototype.js用的比较多,同时发现了一个基于prototype.js的验证框架: validation.js really easy field validation with prototype,下面是我在开发中用到的三个例子. 前提条件: 首先要在html页面中引入几个js <script type='text/javascript' src='../script/prototype.js'></script> <script type='text/jav

红帽集群套件RHCS iSCSI GFS实现iscsi集群

红帽集群套件RHCS 虚拟fence实验 RHCS(Red Hat Cluster Suite)是一个能够提供高可用性.高可靠性.负载均衡.存储共享且经济廉价的集群工具集合. LUCI:是一个基于web的集群配置方式,通过luci可以轻松的搭建一个功能强大的集群系统. CLVM:Cluster逻辑卷管理,是LVM的扩展,这种扩展允许cluster中的机器使用LVM来管理共享存储. CMAN:分布式集群管理器. 实验规划:节点两台,管理主机一台 节点一:192.168.0.54 (desktop5

c语言-请问谁有错误定位测试集siemens suite或者告诉我下载地址啊?

问题描述 请问谁有错误定位测试集siemens suite或者告诉我下载地址啊? 如题所示,siemens suite是西门子测试一个程序错误定位能力的一个测试集,有132个c程序.看到很多论文使用,但是死活找不到下载位置,求大神帮忙,万分感谢!!!