cnn中的learning rate decay会导致overfitting吗

1714人阅读
深度学习(Deep Learning)(15)
本文为原创文章转载必须注明本文出处以及附上&本文地址超链接&&以及&博主博客地址:/qq_&&和&作者邮箱(
(如果喜欢本文,欢迎大家关注我的博客或者动手点个赞,有需要可以邮件联系我)
因为最近一直比较忙所以就有一段时间没有更新了。
终于今天开始着手写这篇关于参数的文章了,可以说我自己其实也在一直进行着各种调参而且有了一段时间了。作为自己的见解和相应的相关知识理论将会在这篇文章里为大家写出来,以做参考和学习资料。
我们在学习使用深度学习的过程中最最重要的就是调参,而调参的能力是需要长期积累才能形成的,因为每一种数据库和每一种网络结构都需要不同的参数以达到最佳效果。而寻找这个最佳的参数的过程就叫做调参。
参数是指那些?
参数其实是个比较泛化的称呼,因为它不仅仅包括一些数字的调整,它也包括了相关的网络结构的调整和一些函数的调整。下面列举一下各种参数:
1. 数据处理(或预处理)相关参数:
enrich data(丰富数据库),
feature normalization and scaling(数据泛化处理),
batch normalization(BN处理),
2. 训练过程与训练相关的参数:
momentum term(训练动量),
BGD, SGD, mini batch gradient descent(这里的理解可以看我之前的一篇博
number of epoch,
learning rate(学习率),
objective function(衰减函数),
weight initialization(权值初始化),
regularization(正则化相关方法),
3. 网络相关参数:
number of layers,
number of nodes,
number of filters,
classifier(分类器的选择),
二. 调参的目的以及会出现的问题
首先我们调参有两个直接的目的:1. 当网络出现训练错误的时候,我们自然需要调整参数。 2. 可以训练但是需要提高整个网络的训练准确度。
关于训练可能会出现的问题汇总:
1. 可能没法进行有效地训练,整个网络是错误的,完全不收敛(以下图片全部来源于我的实验,请勿盗用):
部分收敛:
全部收敛但是结果不好:
问题分析理解
在深度学习领域我上面的三个例子可以说是比较全面的概括了问题的种类:完全不收敛,部分收敛,全部收敛但是误差大。
下面我们先进行问题分析,然后我再介绍如何具体调参:
1. 完全不收敛:
这种问题的出现可以判定两种原因:1,错误的input data,网络无法学习。 2,错误的网络,网络无法学习.&
2. 部分收敛:
这种问题的出现是有多种多样的原因的,但是我们可以总结为:1,underfitting。 2, overfitting。
underfitting和overfitting在深度学习中是两个最最常见的现象,他们的本质是网络分类器的复杂度导致的。
一个过简单的分类器面对一个复杂的数据就会出现underfitting,举个例子:比如一个2类分类器他是无法实现XOR的问题的。
一个过复杂的分类器面对一个简单的数据就会出现overfitting。
说的更容易理解一点:
1.underfitting就是网络的分类太简单了没办法去分类,因为没办法分类就是没办法学到正确的知识。
2.overfitting就是网络的分类太复杂了以至于它可以学习数据中的每一个信息甚至是错误的信息他都可以学习。
如果我们放在实际运用中我们可以这样理解:
1.我们有两个数据A和B,是两个人的脸的不同环境下10张照片。A我们拿来进行训练,B则是用来测试,我们希望网络可以从A学习人脸的特征从而认识什么是人脸,然后再去判断B是不是人脸。
2.如果我们使用会导致underfitting的分类器时,这个网络在学习A的时候将根本学不到任何东西,我们本来希望它可以知道A是人脸,但是它却认为这是个汽车(把A分去不正确的label,错误的分类)。
3.而当我们使用会导致overfitting的分类器时,这个网络在学习A的时候不仅可以知道A是人脸,还认为只有A才是人脸,所以当它去判别B的时候它会认为B不是人脸(过度的学习了过于精确的信息导致无法判别其它数据)。
下面我们再用图来解释说明一下这两个问题:
以及用表格表示一下他们的训练结果:
3. 全部收敛:
这是个好的开始,接下来我们要做的就是微调一些参数。
针对问题来调参
我在许多其他人的文章看过别人的调参方法的分享和心得,其中有许多人提出使用暴力调参和微调配合使用。这里我不是很同意这种观点,因为所谓的暴力调参就是无规律性的盲目的调整,同时调整多个参数,以寻找一个相对能收敛的结果,再进行微调。这种做法是一种拼运气的做法,运气好的话这个网络的深度和重要的参数是可以被你找到,运气不好的好你可能要花更多的时间去寻找什么是好的结果。而且最终即使有好的结果你自己本身也是不能确认是不是最好的结果。
所以再次我建议我们在调参的时候切忌心浮气躁,为了好的实验结果,我们必须一步一步的结果分析,解决问题。
下面我们谈谈如和调参:
现有的可调参数我已经在(一)中写出了,而这些参数其实有一些是现在大家默认选择的,比如激活函数我们现在基本上都是采用Relu,而momentum一般我们会选择0.9-0.95之间,weight decay我们一般会选择0.005, filter的个数为奇数,而dropout现在也是标配的存在。这些都是近年来论文中通用的数值,也是公认出好结果的搭配。所以这些参数我们就没有必要太多的调整。下面是我们需要注意和调整的参数。
1. 完全不收敛:
请检测自己的数据是否存在可以学习的信息,这个数据集中的数值是否泛化(防止过大或过小的数值破坏学习)。
如果是错误的数据则你需要去再次获得正确的数据,如果是数据的数值异常我们可以使用zscore函数来解决这个问题(参见我的博客:
如果是网络的错误,则希望调整网络,包括:网络深度,非线性程度,分类器的种类等等。
2. 部分收敛:
underfitting:&
增加网络的复杂度(深度),
降低learning rate,
优化数据集,
增加网络的非线性度(ReLu),
采用batch normalization,
overfitting:&
丰富数据,
增加网络的稀疏度,
降低网络的复杂度(深度),
L1 regularization,
L2 regulariztion,
添加Dropout,
Early stopping,
适当降低Learning rate,
适当减少epoch的次数,
3. 全部收敛:
调整方法就是保持其他参数不变,只调整一个参数。这里需要调整的参数会有:
learning rate,
minibatch size,
filter size,
number of filter,(这里参见我前面两篇博客的相关filter的说明)
大的思路和现在的发展
其实我们知道现在有许多的成功的网络,比如VGGNet和GoogleNet,这两个就是很划时代的成功。我们也由此可以发现网络设计或者说网络调参的主方向其实只有两种:1. 更深层的网络, 2. 更加的复杂的结构。
我们可以参见下面图片以比较几个成功网络的独特之处:
1. AlexNet(深度学习划时代的成功设计):
2. VGGNet(用小的filter得到更好的非线性和降低网络的权值以及实现了更深的构造):
3. GoogleNet(新的结构的提出受启发于NiN,网络中的网络同时实现了更深和更非线性):
4. Latest(接受请参见:&):
5. 比较分析:
这里我们可以看出,不管你怎么调整网络,调整参数,你的目的只有一个就是让网络变得更加深层,更加非线性。
至此已经全部介绍完了有关结果分析和参数调整的相关知识和心得,最后我还是希望告诉每一个正在奋力调参的朋友们,调参是个辛苦和乏味的过程,一次次的失败只是为了更加好的结果。不要气馁不要放弃,坚持每一次都分析,调参的过程必须是有规律的,切忌不可乱调,这种既得利益的取巧行为是非常不好的习惯。
我们只看到了VGGNet和GoogleNet这种作品的成就,却没有看到背后的付出。同样我们必须保持着创新精神和扎实的理论基础。这些才是成功的不二法则。
后面我会不定期更新一些其他实用的知识和心得,谢谢大家长期以来的支持。
本文为原创文章转载必须注明本文出处以及附上&本文地址超链接&&以及&博主博客地址:/qq_&&和&作者邮箱(
(如果喜欢本文,欢迎大家关注我的博客或者动手点个赞,有需要可以邮件联系我)
&&相关文章推荐
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
访问:54069次
排名:千里之外
原创:28篇
评论:74条
(2)(3)(5)(14)(4)
(window.slotbydup = window.slotbydup || []).push({
id: '4740881',
container: s,
size: '200,200',
display: 'inlay-fix'&&Subscribe to
& 11 Clever Methods of Overfitting and how to avoid them (&&)
Overfitting is the bane of Data Science in the age of Big Data.
John Langford reviews "clever" methods of overfitting, including traditional, parameter tweak, brittle measures, bad statistics, human-loop overfitting, and gives suggestions and directions for avoiding overfitting.
By John Langford (Microsoft, Hunch.net)
(Gregory Piatetsky: I recently came across this classic post from 2005 by John Langford , which addresses one of the most critical issues in Data Science. The problem of overfitting is a major bane of big data, and the issues described below are perhaps even more relevant than before. I have made several of these mistakes myself in the past. John agreed to repost it in KDnuggets, so enjoy and please comment if you find new methods)
“Overfitting” is traditionally defined as training some flexible representation so that it memorizes the data but fails to predict well in the future. For this post, I will define overfitting more generally as over-representing the performance of systems. There are two styles of general overfitting: over-representing performance on particular datasets and (implicitly) over-representing performance of a method on future datasets.
We should all be aware of these methods, avoid them where possible, and take them into account otherwise. I have used “reproblem” and “old datasets”, and may have participated in “overfitting by review”—some of these are very difficult to avoid.
1. Traditional overfitting: Train a complex predictor on too-few examples.
Hold out pristine examples for testing.
Use a simpler predictor.
Get more training examples.
Integrate over many predictors.
Reject papers which do this.
2. Parameter tweak overfitting: Use a learning algorithm with many parameters. Choose the parameters based on the test set performance.
For example, choosing the features so as to optimize test set performance can achieve this.
Remedy: same as above
3. Brittle measure: Use a measure of performance which is especially brittle to overfitting.
Examples: “entropy”, “mutual information”, and leave-one-out cross-validation are all surprisingly brittle. This is particularly severe when used in conjunction with another approach.
Remedy: Prefer less brittle measures of performance.
4. Bad statistics: Misuse statistics to overstate confidences.
One common example is pretending that cross validation performance is drawn from an i.i.d. gaussian, then using standard confidence intervals. Cross validation errors are not independent. Another standard method is to make known-false assumptions about some system and then derive excessive confidence.
Remedy: Don’t do this. Reject papers which do this.
5. Choice of measure: Choose the best of Accuracy, error rate, (A)ROC, F1, percent improvement on the previous best, percent improvement of error rate, etc.. for your method. For bonus points, use ambiguous graphs.
This is fairly common and tempting.
Remedy: Use canonical performance measures. For example, the performance measure directly motivated by the problem.
6. Incomplete Prediction: Instead of (say) making a multiclass prediction, make a set of binary predictions, then compute the optimal multiclass prediction.
Sometimes it’s tempting to leave a gap filled in by a human when you don’t otherwise succeed.
Remedy: Reject papers which do this.
7. Human-loop overfitting: Use a human as part of a learning algorithm and don’t take into account overfitting by the entire human/computer interaction.
This is subtle and comes in many forms. One example is a human using a clustering algorithm (on training and test examples) to guide learning algorithm choice.
Remedy: Make sure test examples are not available to the human.
8. Data set selection: Chose to report results on some subset of datasets where your algorithm performs well.
The reason why we test on natural datasets is because we believe there is some structure captured by the past problems that helps on future problems. Data set selection subverts this and is very difficult to detect.
Remedy: Use comparisons on standard datasets. Select datasets without using the test set. Good Contest performance can’t be faked this way.
9. Reprobleming: Alter the problem so that your performance improves.
Remedy: For example, take a time series dataset and use cross validation. Or, ignore asymmetric false positive/false negative costs. This can be completely unintentional, for example when someone uses an ill-specified UCI dataset.
Remedy: Discount papers which do this. Make sure problem specifications are clear.
10. Old datasets: Create an algorithm for the purpose of improving performance on old datasets.
After a dataset has been released, algorithms can be made to perform well on the dataset using a process of feedback design, indicating better performance than we might expect in the future. Some conferences have canonical datasets that have been used for a decade…
Remedy: Prefer simplicity in algorithm design. Weight newer datasets higher in consideration. Making test examples not publicly available for datasets slows the feedback design process but does not eliminate it.
11. Overfitting by review: 10 people submit a paper to a conference. The one with the best result is accepted.
This is a systemic problem which is very difficult to detect or eliminate. We want to prefer presentation of good results, but doing so can result in overfitting.
Be more pessimistic of confidence statements by papers at high rejection rate conferences.
Some people have advocated allowing the publishing of methods with poor performance. (I have doubts this would work.)
I have personally observed all of these methods in action, and there are doubtless others.
Selected comments on John's post:
Negative results:
Aleks Jakulin: How about an index of negative results in machine learning? There’s a Journal of Negative Results in other domains: , , and there is . A section on negative results in machine learning conferences? This kind of information is very useful in preventing people from taking pathways that lead nowhere: if one wants to classify an algorithm into good/bad, one certainly benefits from unexpectedly bad examples too, not just unexpectedly good examples.
I visited the workshop on negative results at NIPS 2002. My impression was that it did not work well.
The difficulty with negative results in machine learning is that they are too easy. For example, there are a plethora of ways to say that “learning is impossible (in the worst case)”. On the applied side, it’s still common for learning algorithms to not work on simple-seeming problems. In this situation, positive results (this works) are generally more valuable than negative results (this doesn’t work).
Brittle measures
What do you mean by “brittle”? Why is mutual information brittle?
: What I mean by brittle: Suppose you have a box which takes some feature values as input and predicts some probability of label 1 as output. You are not allowed to open this box or determine how it works other than by this process of giving it inputs and observing outputs.
Let x be an input.
Let y be an output.
Assume (x,y) are drawn from a fixed but unknown distribution D.
Let p(x) be a prediction.
For classification error I(|y – p(x)| & 0.5) you can prove a theorem of the rough form:
for all D, with high probability over the draw of m examples independently from D,
expected classification error rate of the box with respect to D is bounded by a function of the observations.
What I mean by “brittle” is that no statement of this sort can be made for any unbounded loss (including log-loss which is integral to mutual information and entropy). You can of course open up the box and analyze its structure or make extra assumptions about D to get a similar but inherently more limited analysis.
The situation with leave-one-out cross validation is not so bad, but it’s still pretty bad. In particular, there exists a very simple learning algorithm/problem pair with the property that the leave-one-out estimate has the variance and deviations of a single coin flip. Yoshua Bengio and Yves Grandvalet in fact proved that there is . The paper that I pointed to above shows that for K-fold cross validation on m examples, all moments of the deviations might only be as good as on a test set of size $m/K$.
I’m not sure what a ‘valid summary’ is, but leave-one-out cross validation can not provide results I trust, because I know how to break it.
I have personally observed people using leave-one-out cross validation with feature selection to quickly achieve a severe overfit.
Top Stories Past 30 Days
Most Popular
Most Shared
More Recent Stories
& 11 Clever Methods of Overfitting and how to avoid them (&&)

我要回帖

更多关于 learning rate 的文章

 

随机推荐