Source: ‘Deep Learning’ course, by Andrew Ng in Coursera & deeplearning.ai
Shallow Networks are successful at solving many problems
But they are not free from limitations:
Limited ability to model complex patterns.
Struggle to capture non-linear relationships
Limited expressiveness for tasks like image recognition and natural language processing.
Prone to overfiting with small datasets
Inability to learn hierarchichal features
Why Deep Learning Now?
“Deep Neural networks” are NNs with several hidden layers.
Real shift from Shallow to DNNs did not (only) come arrive from noticing that newtorks with more layers were able to perform better, in spite of their huge number of parameters.
It mainly arrived by realizing that
While some tasks, such as digit recognition, could be solved decently well using a “brute force” approach,
Other, more complex, such as distinguishing a human face in an image, are hard to solve witht that “brute” force approach.
But can be solved using Deep Neural Networks
‘Source: Generative Deep Learning. David Foster (Fig. 2.1)’
Task: Distinguish human from non-human in an image
Source: ‘Neural Networkls and Deep Learning’ course, by Michael Nielsen
Can be attacked similarly to the digit identification,
Cost of training would be much higher.
Alternatively: try to solve the problem hierarchically.
Source: ‘Deep Learning’ course, by Andrew Ng in Coursera & deeplearning.ai
Each layer has a more complex task, but it receives better information.
If we can solve the sub-problems using ANNs,
We may be able to combine those NNs into a bigger network to solve the problem, here face-detection.
The success of Deep Neural Networks (DNNs) is largely due to their ability to automatically learn and adjust the complex hierarchy of representations, without requiring handcrafted feature extraction or manual tuning of weights and biases.
Shift from manual feature engineering to automatic tuning enabled by new or improved techniques such as:
Stochastic Gradient Descent (SGD) and Backpropagation: Allow learning optimal parameters through iterative updates.
Better Weight Initialization (e.g., Xavier/He initialization): Improve convergence and training stability.
Regularization techniques (Dropout, Batch Normalization, L2 penalties): Prevent overfitting and stabilized learning.
Source: ‘Deep Learning’ course, by Andrew Ng in Coursera & deeplearning.ai
The hierarchical approach to solve complex problems -especially with unstructured data- leads to the development of diverse deep learning architectures, each designed to tackle specific challenges.
CNNs are specialized for processing grid-like data, such as images.
They use convolutional layers to detect spatial hierarchies in data.
Primary use: Computer vision (image classification, object detection, segmentation, etc.).
CNN Architecture
Source: Wikipedia
Models: LeNet-5, AlexNet, VGG, ResNet, EfficientNet.
Applications:
Medical image analysis (tumor detection, X-ray analysis).
Facial recognition.
Autonomous vehicles (object detection & tracking).
Real-time video processing.
RNNs contain a hidden state, which allows them to retain information from previous time steps.
RNNs are able to handle sequential data by incorporating information from previous inputs.
RNNs are effective in capturing short-term dependencies in sequences.
Long term dependencies aren’t managed weel due to vanishing gradient problem.
A Recursive Neural Network
Source: From RNNs to Transformers
Models: Vanilla RNNs, Elman Networks, Jordan Networks.
Applications:
Speech recognition.
Handwriting recognition.
Predictive maintenance in industrial applications.
Time series forecasting (financial data, weather prediction).
LSTM Networks improve over traditional RNNs by solving the vanishing gradient problem.
They introduce memory cells that selectively retain important information over long sequences.
Primary use: Long-term dependency learning in sequential data.
Source: From RNNs to Transformers
Models: Vanilla LSTMs, Bidirectional LSTMs, Attention-based LSTMs.
Applications:
Transformers revolutionized deep learning by replacing sequential processing with self-attention mechanisms.
Self-attention allows the model to weigh the importance of different input tokens when making predictions.
It can capture long-range dependencies without the need for sequential processing.
Primary use: Natural Language Processing (NLP) & Large-scale sequence modeling.
Autoencoders learn to compress and reconstruct data by reducing input dimensionality and mapping it to a latent vector that represents essential features.
VAEs introduce probabilistic modeling, generating latent representations that allow for controlled data synthesis.
Unlike traditional autoencoders, VAEs do not map data to a fixed latent vector but to a probability distribution
This enables smoother interpolations and diverse outputs.
Primary use: Dimensionality reduction, anomaly detection, and generative modeling.
Source: AE & VE
Models: Denoising Autoencoders, Variational Autoencoders (VAEs).
Applications:
Anomaly detection (fraud detection, industrial defect detection).
Image denoising and inpainting.
Feature extraction and dimensionality reduction.
Generating synthetic data.
GANs consist of two networks: a generator and a discriminator.
The generator tries to create realistic data, while the discriminator evaluates authenticity.
Primary use: Data generation (images, videos, text, and music).
Models: DCGAN, CycleGAN, StyleGAN, BigGAN.
Applications:
Deepfake generation.
Super-resolution (enhancing image quality).
Data augmentation for training AI models.
Synthetic image and text generation.
Deep learning is filled with the word “tensor”,
What are Tensors any way?
See the Wikipedia for a nice article on tensors
Working with tensors has many benefits:
Vectors:rank-1 tensors.
Matrices: rank-2 tensors.
Arrays in layers.
Typic use: Sequence data
Examples
Layers of groups of arrays
Typic use: Image data
Typic use: Video data
Tensor shape (4, 240, 256, 144, 3)
Each DNN model has a given architecture which usually requires 2D/3D tensors.
If data is not in the expected form it can be reshaped.
See Deep learning with R for more.