Neural Networks, Deep Learning and
Artificial Intelligence
In the post-pandemic world, a lightning rise of AI, with a mess of realities and promises is impacting society.
Since ChatGPT entered the scene everybody has an experience, an opinion, or a fear on the topic.
Most tasks performed by AI can be described as Classification or Prediction used in applications as:
AI relies on machine learning algorithms, to make predictions based on large amounts of data.
AI has far-reaching implications beyond its predictive capabilities, including ethical, social or technological.
In many contexts, talking about AI means talking about Deep Learning (DL).
DL is a successful AI model which has powered many application such as self-driving cars, voice assistants, and medical diagnosis systems.
DL originates in the field of Artificial Neural Networks
But DL extends the basic principles of ANNs by:
We can see several hints worth to account for:
The Perceptron and the first Artificial Neural Network where the basic building block was introduced.
The Multilayered perceptron and back-propagation where complex architectures were suggested to improve the capabilities.
Deep Neural Networks, with many hidden layers, and auto-tunability capabilities.
Why Deep Learning Now?
Success stories such as
the development of self-driving cars,
the use of AI in medical diagnosis, and
online shopping personalized recommendations
have also contributed to the widespread adoption of AI.
AI also comes with fears from multiple sources from science fiction to religion
Mass unemployment
Loss of privacity
AI bias
AI fakes
Or, simply, AI takeover
Where/How does it all fit?
Artificial intelligence: Ability of a computer to perform tasks commonly associated with intelligent beings.
Machine learning: study of algorithms that learn from examples and experience instead of relying on hard-coded rules and make predictions on new data
Deep learning: sub field of ML focusing on learning data representations as successive successive layers of increasingly meaningful representations.
An illustration of the performance comparison between deep learning (DL) and other machine learning (ML) algorithms, where DL modeling from large amounts of data can increase the performance
Near-human-level image classification
Near-human-level speech transcription
Near-human-level handwriting transcription
Dramatically improved machine translation
Dramatically improved text-to-speech conversion
Digital assistants such as Google Assistant and Amazon Alexa
Near-human-level autonomous driving
Improved ad targeting, as used by Google, Baidu, or Bing
Improved search results on the web
Ability to answer natural language questions
Superhuman Go playing
According to F. Chollet, the developer of Keras,
This first attempt to emulate neurons succeeded but with limitations:
What about non-Boolean (say, real) inputs?
What if all inputs are not equal?
What if we want to assign more importance to some inputs?
What about functions which are not linearly separable? Say XOR function
To overcome these limitations Rosenblatt, proposed the perceptron model, or artificial neuron, in 1958.
Generalizes McCullough-Pitts neuron in that weights and thresholds can be learnt over time.
The Perceptron represents an improvement over McCullough-Pitts’ neuron:
However, there are still limitations:
The Perceptron can be generalized by Artificial Neurones which can use more general functions, called Activation Functions to produce their output.
It must be noted however that a single artificial neurone, even with a different activation function, still cannot model no-linear separable problems like XOR.
With all these ideas in mind we can now define an Artificial Neuron as a computational unit that :
takes as input \(x=(x_0,x_1,x_2,x_3),\ (x_0 = +1 \equiv bias)\),
outputs \(h_{\theta}(x) = f(\theta^\intercal x) = f(\sum_i \theta_ix_i)\),
where \(f:\mathbb{R}\mapsto \mathbb{R}\) is called the activation function.
Goal of activation function is to provide the neuron with the capability of producing the required outputs.
Flexible enough to produce
Usually chosen from a (small) set of possibilities.
tanh
, function\[ f(z)=\frac{1}{1+e^{-z}} \]
Output real values \(\in (0,1)\).
Natural interpretations as probability
Also called tanh
, function:
\[ f(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \]
outputs are zero-centered and bounded in −1,1
scaled and shifted Sigmoid
stronger gradient but still has vanishing gradient problem
Its derivative is \(f'(z)=1-(f(z))^2\).
rectified linear unit: \(f(z)=\max\{0,z\}\).
Close to a linear: piece-wise linear function with two linear pieces.
Outputs are in \((0,\infty)\) , thus not bounded
Half rectified: activation threshold at 0
No vanishing gradient problem
.
Softmax is an activation function used in the output layer of classification models, especially for multi-class problems.
It converts raw scores (logits) into probabilities, ensuring that \(\sum_{i=1}^{N} P(y_i) = 1\) where \(P(y_i)\) is the predicted probability for class \(i\).
Given an input vector \(z\), Softmax transforms it as: \[ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}} \]
An AN takes a vector of input values \(x_{1}, \ldots, x_{d}\) and combines it with some weights that are local to the neuron \(\left(w_{0}, w_{1}, . ., w_{d}\right)\) to compute a net input \(w_{0}+\sum_{i=1}^{d} w_{i} \cdot x_{i}\).
To compute its output, it then passes the net input through a possibly non-linear univariate activation function \(g(\cdot)\), usually chosen from a set of options such as Sigmoid, Tanh or ReLU functions
To deal with the bias, we create an extra input variable \(x_{0}\) with value always equal to 1 , and so the function computed by a single artificial neuron (parameterized by its weights \(\mathbf{w}\) ) is:
\[ y(\mathbf{x})=g\left(w_{0}+\sum_{i=1}^{d} w_{i} x_{i}\right)=g\left(\sum_{i=0}^{d} w_{i} x_{i}\right)=g\left(\mathbf{w}^{\mathbf{T}} \mathbf{x}\right) \]
The Perceptron Rule updates weights based on misclassified samples.
It ensures convergence only if data is linearly separable.
Weight Update Formula \[ w_j \leftarrow w_j + \eta(y - \hat{y})x_j \] where:
\(x_j\) = input feature; \(\quad \eta\) = learning rate;
\(y\) = true label; \(\quad \hat{y}\)=predicted class (\(\pm1\));
Key Features
Uuses gradient descent to minimize the Sum of Squared Errors (SSE).
It applies to models with differentiable activation functions.
Weight Update Formula \[ w_j \leftarrow w_j + \eta(y - h(x))x_j \] where:
\(x_j\) = input feature, \(\eta\) = learning rate;
\(y\) = true label; \(\quad h(x) =\) predicted output.
Key Features
Following with the brain analogy one can combine (artificial) neurons to create better learners.
A simple artificial neural network is usually created by combining two types of modifications to the basic perceptron (AN).
Stacking several neurons insteads of just one.
Adding an additional layer of neurons, which is call a hidden layer,
This yields a system where the output of a neuron can be the input of another in many different ways.
In this figure, we have used circles to also denote the inputs to the network.
Circles labeled +1 are bias units, and correspond to the intercept term.
The leftmost layer of the network is called the input layer.
The rightmost layer of the network is called the output layer.
The middle layer of nodes is called the hidden layer, because its values are not observed in the training set.
Bias nodes are not counted when stating the neuron size.
With all this in mind our example neural network has three layers with:
An ANN is a predictive model (a learner) whose properties and behaviour can be well characterized.
It operates through a process known as forward propagation, which encompasses the information flow from the input layer to the output layer.
Forward propagation is performed by composing a series of linear and non-linear (activation) functions.
These are characterized (parametrized) by their weights and biases, that need to be learnt.
The training process aims at finding the best possible parameter values for the learning task defined by the fnctions. This is done by
This is usually done using some iterative optimization procedure such as gradient descent.
The process that encompasses the computations required to go from the input values to the final output is known as forward propagation.
For the ANN with 3 input values and 3 neurons in the hidden layer we have:
\[\begin{eqnarray} a_1^{(2)}&=&f(\theta_{10}^{(1)}+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3)\\ a_2^{(2)}&=&f(\theta_{20}^{(1)}+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3)\\ a_3^{(2)}&=&f(\theta_{30}^{(1)}+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3)) \end{eqnarray}\]
\[ h_{\Theta}(x)=a_1^{(3)}=f(\theta_{10}^{(2)}+\theta_{11}^{(2)}a_1^{(2)}+\theta_{12}^{(2)}a_2^{(2)}+\theta_{13}^{(2)}a_3^{(2)} \]
Let \(z_i^{(l)}\) denote the total weighted sum of inputs to unit \(i\) in layer \(l\):
\[ z_i^{(2)}=\theta_{i0}^{(1)}+\theta_{i1}^{(1)}x_1+\theta_{i2}^{(1)}x_2+\theta_{i3}^{(1)}x_3, \] the output becomes: \(a_i^{(l)}=f(z_i^{(l)})\).
Extending the activation function \(f(\cdot)\) to apply elementwise to vectors:
\[ f([z_1,z_2,z_3]) = [f(z_1), f(z_2),f(z_3)], \] we can write the previous equations more compactly as:
\[\begin{eqnarray} z^{(2)}&=&\Theta^{(1)}x\nonumber\\ a^{(2)}&=&f(z^{(2)})\nonumber\\ z^{(3)}&=&\Theta^{(2)}a^{(2)}\nonumber\\ h_{\Theta}(x)&=&a^{(3)}=f(z^{(3)})\nonumber \end{eqnarray}\]More generally, recalling that we also use \(a^{(1)}=x\) to also denote the values from the input layer,
Given layer \(l\)’s activations \(a^{(l)}\), we can compute layer \(l+1\)’s activations \(a^{(l+1)}\) as:
\[\begin{equation} z^{(l+1)}=\Theta^{(l)}a^{(l)} \label{eqforZs} \end{equation}\]
\[\begin{equation} a^{(l+1)}=f(z^{(l+1)}) \label{eqforAs} \end{equation}\]
This can be used to provide a matrix representation for the weighted sum of inputs of all neurons:
\[ z^{(l+1)}= \begin{bmatrix} z_1^{(l+1)}\\ z_2^{(l+1)}\\ \vdots\\ z_{s_{l+1}}^{(l)} \end{bmatrix}= \begin{bmatrix} \theta_{10}^{(l)}& \theta_{11}^{(l)}&\theta_{12}^{(l)}&...&\theta_{1s_{l}}^{(l)}&\\ \theta_{20}^{(l)}& \theta_{21}^{(l)}&\theta_{22}^{(l)}&...&\theta_{2s_{l}}^{(l)}&\\ \vdots & \vdots& \vdots & \vdots & \vdots\\ \theta_{s_{l+1}0}^{(l)}& \theta_{s_{l+1}1}^{(l)}&\theta_{s_{l+1}2}^{(l)}&...&\theta_{s_{l+1}s_{l}}^{(l)}&\\ \end{bmatrix} \cdot\begin{bmatrix} 1\\ a_1^{(l)}\\ a_2^{(l)}\\ \vdots\\ a_{s_l}^{(l)} \end{bmatrix} \]
So that, the activation is then:
\[ a^{(l+1)}= \begin{bmatrix} a_1^{(l+1)}\\ a_2^{(l+1)}\\ \vdots\\ a_{s_{l+1}}^{(l)} \end{bmatrix}=f(z^{(l+1)})=\begin{bmatrix} f(z_1^{(l+1)})\\ f(z_2^{(l+1)})\\ \vdots\\ f(z_{s_{l+1}}^{(l)}) \end{bmatrix} \]
The way input data is transformed, through a series of weightings and transformations, until the ouput layer is called forward propagation.
By organizing parameters in matrices, and using matrix-vector operations, fast linear algebra routines can be used to perform the required calculations in a fast efficent way.
We have so far focused on a single hidden layer neural network of the example.
One can. however build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
We have so far focused on a single hidden layer neural network of the example
One can build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
An ANN is a predictive model whose properties and behaviour can be mathematically characterized.
In practice this means:
Training the network is done by
Training an ANN is usually done using some iterative optimization procedure such as Gradient Descent.
This requires evaluating derivatives in a huge number of points.
Depending on the activation function it may be advisable to use one or another form of loss function.
A typical choice may be quadratic (or square) error loss: \[ l(h_\theta(x),y)=\left (y-\frac{1}{1+e^{-\theta^\intercal x}}\right )^2 \]
Given a sigmoid AF, the squared error loss is not a convex problem so that MSE is not appropriate.
Quadratic loss may be used with ReLu activation.
\[ l(h_\theta(x),y)=\big{\{}\begin{array}{ll} -\log h_\theta(x) & \textrm{if }y=1\\ -\log(1-h_\theta(x))& \textrm{if }y=0 \end{array} \]
\[ l(h_\theta(x),y)=-y\log h_\theta(x) - (1-y)\log(1-h_\theta(x)) \]
\[ J(\theta)=-\frac{1}{n}\left[\sum_{i=1}^n (y^{(i)}\log h_\theta(x^{(i)})+ (1-y^{(i)})\log(1-h_\theta(x^{(i)}))\right] \]
Training a network corresponds to finding the parameters, that is, the weights and the biases, that minimize the cost function.
Althoug weights & biases are respectively matrices and vectors, it is convenient to represent them in a vectorized form stored as a single vector, that will be denoted here, by \(\theta\).
We may suppose \(\theta\in\mathbb{R}^p\), and write the cost function as \(J(\theta)\) to emphasize its dependence on the parameters, that is: \[ \begin{eqnarray*} J: \mathbb{R}^p & \rightarrow \mathbb{R}\\ \theta & \rightarrow J(\theta) \end{eqnarray*} \]
A classical method in optimization to minimize a convex function \(J(\theta)\).
It proceeds iteratively, computing a sequence of vectors \(\theta^1, \theta^2, ..., \theta^n\) in \(\mathbb{R}^p\) with the aim of converging to a vector that minimizes the cost function.
Suppose that our current vector is \(\theta\). How should we choose a perturbation, \(\Delta\theta\), so that the next vector, \(\theta+\Delta\theta\), represents an improvement, that is: \(J(\theta +\Delta\theta) < J(\theta)\)?
Linearize the cost function using a Taylor approximation.
If \(\Delta\theta\) is small, then ignoring terms of order \(||\Delta\theta||^2\) or higher: \[ J(\theta+\Delta\theta)\approx J(\theta)+\sum_{i=1}^p\frac{\partial J(\theta)}{\partial\theta_i}\Delta\theta_i \] or, equivalently: \[\begin{equation}\label{g2} J(\theta+\Delta\theta)\approx J(\theta)+\nabla J(\theta)^\intercal\Delta\theta \end{equation}\] where \(\nabla J(\theta)\in\mathbb{R}^p\) denote the gradient, i.e. the vector of partial derivatives: \[\begin{equation}\label{g1} \nabla J(\theta)=\left(\frac{\partial J(\theta)}{\partial\theta_1},...,\frac{\partial J(\theta)}{\partial\theta_p}\right)^\intercal \end{equation}\]
Goal: choose a perturbation, \(\Delta\theta\), s.t.: \(J(\theta +\Delta\theta) < J(\theta)\)
Taylor approximation above suggests that choosing \(\Delta\theta\) to make \(\nabla J(\theta)^\intercal\Delta\theta\) negative will make the value of \(J(\theta+\Delta\theta)\) smaller.
Indeed it can be shown that the highest possible negative value will come out when \(-\nabla J(\theta)=\Delta\theta\), which leads to the gradient descent formula: \[ \theta \rightarrow \theta-\eta\nabla J(\theta), \] where \(\eta\), the learning rate is the size of the step taken at each iteration, which should be small because of Taylor App.
The Cauchy-Schwarz inequality, states that for any \(f,g\in\mathbb{R}^p\), we have: \[ |f^\intercal g|\leq ||f||\cdot ||g||. \]
Moreover, the two sides are equal if and only if \(f\) and \(g\) are linearly dependent (meaning they are parallel).
By Cauchy-Schwarz,biggest possible value for \(\nabla J(\theta)^\intercal\Delta\theta\) is the upper bound, \(||\nabla J(\theta)||\cdot ||\Delta\theta||\).
This explains why we choose precisely \(-\nabla J(\theta)=\Delta\theta\)
In summary, givent a cost function \(J(\theta)\) to be optimized the gradient descent optimization proceeds as follows:
https://assets.yihui.org/figures/animation/example/grad-desc
Input: \(W\) (weight vectors), \(D\) (training dataset).
Repeat until error is below threshold:
Compute network output for each training instance.
For each neuron \(j\) in the output layer:
Compute \(\delta_j = \sigma^{\prime}(z_j) (c_j - y_j)\)
Update weights: \(\Delta w_i^j = \eta \delta_j x_i^j\)
For each neuron \(k\) in hidden layers:
Compute \(\delta_k = \sigma^{\prime}(z_k) \sum_{j \in S_k} \delta_j w_k^j\)
Update weights : \(\Delta w_i^k = \eta \delta_k x_i^k\)
Output: Updated weight vectors \(W\).
In multi-layer networks, the error signal must be propagated backward through multiple layers.
The weight update in hidden layers depends not only on the local error but also on how errors propagate from the output layer.
This is achieved using the chain rule of calculus, which decomposes derivatives into simpler components.
That is, the gradient descent update rule for the weight is:
\[ \Delta w_i^j = -\eta \frac{\partial J}{\partial w_{i}^{j}} = \eta \delta_j x_i^j \] where the derivative will be computed applying the chain rule.
\[ \frac{\partial J}{\partial w_{i}^{j}} = \frac{\partial J}{\partial y_j} \cdot \frac{\partial y_j}{\partial z_j} \cdot \frac{\partial z_j}{\partial w_{i}^{j}} \]
Modern deep learning frameworks do not compute gradients manually.
Instead, they use automatic differentiation and computational graphs to simplify and speed up backpropagation.
A computational graph represents the sequence of operations in a neural network as a directed graph.
Automatic differentiation (AD) relies the computational graph to apply the chain rule and compute gradients automatically in the Backwards pass.
Frameworks like TensorFlow, PyTorch, and JAX use reverse-mode differentiation, which is particularly efficient for functions with many parameters (like neural networks).
The learning porocess such as it has been derived may be improved in different ways.
This can be partially soved applying distinct approaches.
Network performance is affected by many hyperparameters
Traditionally considered that one layer may be enough
Posterior research showed that adding more layers increases efficency
Although there is also risk of overfitting
It has been shown that using the whole training set only once may not be enough for training an ANN.
One iteration of the training set is known as an epoch.
The number of epochs \(N_E\), defines how many times we iterate along the whole training set.
\(N_E\) can be fixed, determined by cross-validation or left open and stop the training when it does not improve anymore.
A complementary strategy to increasing the number of epochs is decreasing the number of instances in each iteration.
That is, the training set is broken in a number of batches that are trained separately.
Batch learning allows weights to be updated more frequently per epoch.
The advantage of batch learning is related to the gradient descent approach used.
Training speed can be improved by adjusting key factors that influence convergence.
Weight Initialization: Properly initializing weights helps prevent vanishing or exploding gradients, leading to faster convergence.
Adjusting Learning Rate: A well-tuned learning rate accelerates training while avoiding instability or slow convergence.
Using Efficient Cost Functions: Choosing an appropriate loss function (e.g., cross-entropy for classification) speeds up gradient updates.
Overfitting occurs when a model learns noise instead of general patterns. Common strategies to prevent it include:
L2 Regularization: Penalizes large weights to reduce model complexity and improve generalization.
Early Stopping: Stops training when validation loss starts increasing, preventing unnecessary overfitting.
Dropout: Randomly disables neurons during training to make the model more robust.
Data Augmentation: Expands the training set by applying transformations (e.g., rotations, scaling) to improve generalization.
Techniques | Performance Improvement | Learning Speed | Overfitting | Description |
---|---|---|---|---|
Network Architecture | X | X | X | Adjust layers, neurons andconnections |
Epochs, Iterations, and Batch Size | X | Controls updates per epoch to improve efficiency. | ||
Softmax | X | Turns outputs into probabilities | ||
Training Algorithms | X | X | GD Improvements | |
Learning Rate | X | X | Step size in gradient updates. | |
Cross-Entropy Loss | X | Optimized for classification | ||
L2 Regularization | X | X | Penalizes large weights to prevent overfitting. | |
Early Stopping | X | Stops training when validation loss worsens. | ||
Dropout | X | X | Randomly disables neurons to enhance generalization. | |
Data Augmentation | X | Expands training data by applying transformations. |
Lab Lab-C3.1-IrisANN-Modular contains a detailed example and a Python notebook on building a NN from scratch.
Lab Lab-C3.1-Dividend Prediction shows how to use R to build and use a Neural network.