Tensorfow2.0复习笔记2
梯度消失/爆炸
问题:Gradients often get smaller and smaller as the algorithm progresses down to the lower layers.
原因:通过神经网络输出后的参数方差(Variance)都要大于输入时的参数方差,层数越多,在经过顶层的激活函数后最终满溢(Saturation)
- 比如input尺度越大,被sigmoid函数激活后,不是1就是0,其导数也是接近0,所以做梯度下降时会trivival。
- 解决方法之一:Glorot and He Initialization:the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs(理想情况下);初始化为$\text{mean} = 0$,方差等于如下的权重:
- $\text{fan}{\text{in}}$ 与 $\text{fan}{\text{out}}$ 分别代表隐层的参数$W$的输入与输出大小:
——API:
1 | # 1. |
激活函数的问题
- ReLU:问题:dying ReLUs: during training, some neurons effectively “die,” meaning they stop outputting anything other than 0.
- 解决方法:
- leaky ReLU:
- 优点:避免神经元失活。【经验上:setting α = 0.2 (a huge leak) seemed to result in better performance than α = 0.01 (a small leak). 】
- randomized leaky ReLU (RReLU):
- 优点:如正则化器,避免训练数据时过拟合。
- parametric leaky ReLU (PReLU):
- 优点:大量数据时不错;
- 缺点:小数据上对训练数据过拟合)
- leaky ReLU:
exponential linear unit (ELU) :
- 优点:比ReLU及其变体都好
- 缺点:训练慢。
- Scaled ELU (SELU):
- 优点:
- 1、深的网络上表现好;
- 2、self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, ;
- 3、可用于CNNs。 缺点:不适用于RNNs、skip连接的网络。
- 要求1: The input features must be standardized (mean 0 and standard deviation 1).
- 要求2: Every hidden layer’s weights must be initialized with LeCun normal initialization.
经验:SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic.
Batch-normalization问题
问题:上述策略是保存在training的初期没问题,但在training的过程中还是有梯度不稳定的情况。
解决方法:Batch Normalization (BN) :an operation in the model just before or after the activation function of each hidden layer。
做法:evaluating the mean and standard deviation of the input over the current mini-batch:
* 每个样本计算一个均值和方差,然后对输入做normalization,然后参数化的平移。
- Batch Normalization acts like a regularizer, reducing the need for other regularization techniques