Initialization and normalization tricks for deep nets are practical methods that make training stable, fast, and reliable. Initialization sets the starting scale and orientation of weights, while normalization keeps activations and gradients in a healthy range during optimization. Together they prevent vanishing or exploding signals, reduce sensitivity to learning rates, and improve generalization. This guide explains theory with actionable tips that you can use in everyday models, from convolutional networks to transformers and recurrent networks. It is written for all levels, and centers on clean mental models and defaults that work. Top 10 Initialization and Normalization Tricks for Deep Nets anchors the structure.
#1 Xavier Glorot initialization
Use Xavier or Glorot initialization when activations are roughly symmetric around zero and saturating or linear. It balances variance by matching fan in and fan out so that forward and backward signal magnitudes stay comparable. For fully connected layers and tanh networks it is a simple default. For convolutional kernels compute fan terms from kernel size times input or output channels. Choose uniform or normal variants with the same variance, and keep biases at zero. This prevents early layer collapse, reduces learning rate sensitivity, and supports deeper stacks before adding normalization. It also makes gradient clipping rarely necessary in early training.
#2 He initialization for rectifiers
Prefer He or Kaiming initialization for rectified activations such as ReLU, GELU, and leaky ReLU. It scales variance by fan in and accounts for the drop in activation mass after rectification, keeping mean near zero and variance stable. Use the correct mode for your framework, often called fan in with a normal distribution for deep convolutional stacks. For leaky units follow the variant that includes the negative slope. Keep biases small or zero. This choice reduces dying units, supports larger learning rates, and shortens warmup schedules for image and language models. It also minimizes early layer output skew, which protects batch statistics.
#3 Orthogonal initialization
Orthogonal initialization creates weight matrices with orthonormal columns, preserving the norm of signals across layers at initialization. It is valuable for recurrent networks and deep linear or convolutional blocks where gradient norms drift over time. Use it when you stack many layers without residual connections or when you observe unstable singular values in weight matrices. Combine with a suitable gain factor that matches the activation function, such as the ReLU gain from a framework helper. Keep biases at zero. This method can improve conditioning, slow down exploding gradients, and stabilize long horizon credit assignment. It pairs well with small learning rates during the first epochs.
#4 LSUV data driven scaling
Layer Sequential Unit Variance is a data driven refinement that tunes scales after a standard initialization. You pass a mini batch through the network without training and rescale each layer so that its pre activation outputs have unit variance and zero mean. This aligns activation statistics early, reducing the burden on normalization layers and improving convergence on very deep nets. It is especially helpful when input distributions are unusual or when you mix activations. Apply it once before training, cache the scales, and keep biases at zero. It improves gradient flow and makes learning rates easier to choose.
#5 Bias initialization with priors
Bias initialization using prior probabilities can reduce early loss spikes and speed convergence in classification and detection. For a sigmoid or logistic output, set the bias so that the initial output equals the base rate of positives, using the logit of the prior. For softmax, start with small biases that reflect class imbalance or a temperature scaled prior. Hidden layer biases should usually be zero to avoid unintended shifts. This simple step stabilizes gradients, mitigates label imbalance shock, and reduces the need for long warmup. It also makes early metrics interpretable and less noisy. It is quick to compute.
#6 Batch normalization tips
Batch normalization standardizes layer inputs using mini batch statistics and learned scale and shift parameters. It reduces internal covariate shift and allows higher learning rates, improving depth and speed. Use momentum that matches your dataset size and keep epsilon small but not tiny to avoid numerical issues. During evaluation, freeze running averages and switch layers to evaluation mode to avoid data leakage. Place batch normalization after the linear transformation and before the activation in most residual blocks. Consider zero initializing the final gamma in residual branches to start as an identity. This improves early stability and training predictability.
#7 Layer normalization practice
Layer normalization normalizes across features for each example, which makes it independent of the mini batch size. It is effective in transformers, recurrent networks, and any setting with very small or variable batches. Place it as pre normalization before attention and feed forward blocks to ease optimization and enable deeper stacks. Tune the epsilon carefully to avoid degradation in half precision. When replacing batch normalization, consider learning rate and warmup adjustments because gradient noise changes. You can also zero initialize residual branch scales to start close to identity. The result is robust training across hardware and schedules.
#8 Group and instance normalization
Group normalization bridges batch and layer normalization by computing statistics over groups of channels, which works well for vision models and small batches. Choose the number of groups to balance flexibility and numerical stability, such as dividing channels into eight or sixteen groups. For style transfer or generative tasks, instance normalization can remove instance specific contrast and color trends. Both methods eliminate cross example coupling, so they are stable under distributed training with uneven batch splits. They also allow deterministic behavior, which simplifies debugging. Combine with weight decay and careful learning rate schedules for best regularization and speed.
#9 Weight normalization and centering
Weight normalization reparameterizes weights into a direction and a magnitude, which decouples scale from orientation and simplifies optimization. It can accelerate convergence when batch statistics are noisy or unavailable. Combine it with mean centering of activations to keep pre activation means near zero. This combination works well in reinforcement learning agents and sequence models where batch normalization is a poor fit. It also improves interpretability of learning rate effects because rescaling is explicit. Initialize magnitude parameters to match the desired variance and keep biases at zero. Monitor gradient norms to ensure the reparameterization is helping rather than hurting.
#10 FixUp for norm free ResNets
FixUp is an initialization strategy for residual networks that removes normalization while keeping training stable. It scales certain layers by depth dependent constants, sets final residual branch scales to zero, and includes learned scalar multipliers so that each block starts as an identity mapping. With proper learning rate and data augmentation, it matches batch normalization on many tasks and simplifies deployment because there are no running statistics. Use it when normalization costs are high or when sequence length or batch size is very small. It also reduces numerical coupling across devices, which helps distributed training.