Neural Networks

Last updated: December 2024

Disclaimer: These are my personal notes compiled for my own reference and learning. They may contain errors, incomplete information, or personal interpretations. While I strive for accuracy, these notes are not peer-reviewed and should not be considered authoritative sources. Please consult official textbooks, research papers, or other reliable sources for academic or professional purposes.

1. Introduction to Neural Networks

Neural networks are computational models inspired by biological neural networks. They consist of interconnected nodes (neurons) that process information.

2. Perceptron

The simplest neural network with a single neuron:

y = f\left(\sum_{i=1}^n w_i x_i + b\right) = f(\mathbf{w}^T\mathbf{x} + b)

where $f$ is the activation function, $\mathbf{w}$ are weights, $\mathbf{x}$ are inputs, and $b$ is the bias.

3. Activation Functions

3.1 Sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}

Output range: $(0, 1)$. Smooth but suffers from vanishing gradient problem.

3.2 Hyperbolic Tangent (tanh)

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Output range: $(-1, 1)$. Zero-centered but still has vanishing gradient issues.

3.3 ReLU (Rectified Linear Unit)

\text{ReLU}(z) = \max(0, z)

Most popular activation. Simple, efficient, but can suffer from "dying ReLU" problem.

3.4 Leaky ReLU

\text{LeakyReLU}(z) = \max(\alpha z, z)

where $\alpha$ is a small positive constant (e.g., 0.01).

4. Multi-Layer Perceptron (MLP)

Neural network with multiple layers:

Input layer: Receives input data
Hidden layers: Process information
Output layer: Produces final output

5. Forward Propagation

For a network with $L$ layers:

\mathbf{a}^{[0]} = \mathbf{x}$$ $$\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$$ $$\mathbf{a}^{[l]} = f^{[l]}(\mathbf{z}^{[l]})

for $l = 1, 2, \ldots, L$.

6. Loss Functions

6.1 Mean Squared Error (Regression)

L(\mathbf{y}, \hat{\mathbf{y}}) = \frac{1}{2m}\sum_{i=1}^m (y_i - \hat{y}_i)^2

6.2 Cross-Entropy (Classification)

L(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{m}\sum_{i=1}^m \sum_{j=1}^c y_{ij} \log(\hat{y}_{ij})

7. Backpropagation

Algorithm to compute gradients using chain rule:

\frac{\partial L}{\partial \mathbf{W}^{[l]}} = \frac{\partial L}{\partial \mathbf{z}^{[l]}} \cdot \frac{\partial \mathbf{z}^{[l]}}{\partial \mathbf{W}^{[l]}} = \delta^{[l]} (\mathbf{a}^{[l-1]})^T

\frac{\partial L}{\partial \mathbf{b}^{[l]}} = \delta^{[l]}

where $\delta^{[l]} = \frac{\partial L}{\partial \mathbf{z}^{[l]}}$ is the error at layer $l$.

8. Gradient Descent

8.1 Batch Gradient Descent

\mathbf{W} := \mathbf{W} - \alpha \frac{\partial L}{\partial \mathbf{W}}

8.2 Stochastic Gradient Descent (SGD)

Update parameters using one sample at a time.

8.3 Mini-batch Gradient Descent

Update parameters using small batches of samples.

9. Regularization Techniques

9.1 L2 Regularization (Weight Decay)

L_{\text{reg}} = L + \frac{\lambda}{2}\sum_{l=1}^L \|\mathbf{W}^{[l]}\|_F^2

9.2 Dropout

Randomly set some neurons to zero during training with probability $p$.

9.3 Batch Normalization

Normalize inputs to each layer:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}

10. Convolutional Neural Networks (CNNs)

Specialized for processing grid-like data (images).

10.1 Convolution Operation

(f * g)(t) = \sum_{m=-\infty}^{\infty} f(m)g(t-m)

10.2 Pooling

Reduce spatial dimensions:

Max pooling: Take maximum value in region
Average pooling: Take average value in region

11. Recurrent Neural Networks (RNNs)

Networks with memory for sequential data:

\mathbf{h}_t = f(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$ $$\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

11.1 LSTM (Long Short-Term Memory)

Addresses vanishing gradient problem in RNNs with gating mechanisms.

11.2 GRU (Gated Recurrent Unit)

Simplified version of LSTM with fewer parameters.

12. Deep Learning Architectures

12.1 Autoencoders

Learn compressed representations of data.

12.2 Generative Adversarial Networks (GANs)

Two networks competing: generator and discriminator.

12.3 Transformers

Attention-based models for sequence processing.

13. Training Tips

Weight initialization: Xavier/He initialization
Learning rate scheduling: Decay over time
Early stopping: Stop when validation loss increases
Data augmentation: Increase training data diversity

14. Code Example

# Python implementation of a simple neural network
import numpy as np

class NeuralNetwork:
    def __init__(self, layers):
        """
        layers: list of layer sizes [input_size, hidden1, hidden2, ..., output_size]
        """
        self.layers = layers
        self.num_layers = len(layers)
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        
        for i in range(1, self.num_layers):
            # Xavier initialization
            w = np.random.randn(layers[i], layers[i-1]) * np.sqrt(2.0 / layers[i-1])
            b = np.zeros((layers[i], 1))
            self.weights.append(w)
            self.biases.append(b)
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        """Derivative of sigmoid function"""
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def relu(self, z):
        """ReLU activation function"""
        return np.maximum(0, z)
    
    def relu_derivative(self, z):
        """Derivative of ReLU function"""
        return (z > 0).astype(float)
    
    def forward_propagation(self, X):
        """Forward pass through the network"""
        activations = [X]
        zs = []
        
        for i in range(self.num_layers - 1):
            z = np.dot(self.weights[i], activations[-1]) + self.biases[i]
            zs.append(z)
            
            if i < self.num_layers - 2:  # Hidden layers
                a = self.relu(z)
            else:  # Output layer
                a = self.sigmoid(z)
            
            activations.append(a)
        
        return activations, zs
    
    def backward_propagation(self, X, y, activations, zs):
        """Backward pass to compute gradients"""
        m = X.shape[1]  # Number of examples
        
        # Initialize gradients
        dW = [np.zeros(w.shape) for w in self.weights]
        db = [np.zeros(b.shape) for b in self.biases]
        
        # Output layer error
        delta = activations[-1] - y
        
        # Backpropagate the error
        for i in range(self.num_layers - 2, -1, -1):
            dW[i] = (1/m) * np.dot(delta, activations[i].T)
            db[i] = (1/m) * np.sum(delta, axis=1, keepdims=True)
            
            if i > 0:  # Not the first layer
                delta = np.dot(self.weights[i].T, delta) * self.relu_derivative(zs[i-1])
        
        return dW, db
    
    def update_parameters(self, dW, db, learning_rate):
        """Update weights and biases"""
        for i in range(len(self.weights)):
            self.weights[i] -= learning_rate * dW[i]
            self.biases[i] -= learning_rate * db[i]
    
    def compute_cost(self, y_pred, y_true):
        """Compute binary cross-entropy cost"""
        m = y_true.shape[1]
        cost = -(1/m) * np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        return cost
    
    def train(self, X, y, epochs, learning_rate):
        """Train the neural network"""
        costs = []
        
        for epoch in range(epochs):
            # Forward propagation
            activations, zs = self.forward_propagation(X)
            
            # Compute cost
            cost = self.compute_cost(activations[-1], y)
            costs.append(cost)
            
            # Backward propagation
            dW, db = self.backward_propagation(X, y, activations, zs)
            
            # Update parameters
            self.update_parameters(dW, db, learning_rate)
            
            if epoch % 100 == 0:
                print(f"Cost after epoch {epoch}: {cost}")
        
        return costs
    
    def predict(self, X):
        """Make predictions"""
        activations, _ = self.forward_propagation(X)
        return activations[-1] > 0.5

# Example usage
if __name__ == "__main__":
    # Generate sample data (XOR problem)
    X = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
    y = np.array([[0, 1, 1, 0]])
    
    # Create and train network
    nn = NeuralNetwork([2, 4, 1])
    costs = nn.train(X, y, epochs=1000, learning_rate=1.0)
    
    # Make predictions
    predictions = nn.predict(X)
    print("Predictions:", predictions.astype(int))
    print("Actual:", y.astype(int))

15. References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning.
Nielsen, M. A. (2015). Neural Networks and Deep Learning.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning.