Neural Networks, Deep Learning and
Artificial Intelligence
In the post-pandemic world, a lightning rise of AI, with a mess of realities and promises is impacting society.
Since the emergence -and rapid growth- of Large Language Models and Generative AI everybody has an experience, an opinion, or a fear on the topic.
Many tasks performed by AI can be described as Predictive as seen in Recommendation systems, Image recognition or Natural language processing.
However, AI is broader than just prediction systems
Generative AI is also able to generate new content.
Both, predictive and generative capabilities of AI have far-reaching implications beyond technologies, including ethical or social aspects.
In many contexts, talking about AI means talking about Deep Learning (DL).
DL is the dominant paradigm in modern (2026) AI and is behind applications such as self-driving cars, voice assistants, and medical diagnosis systems.
DL originates in the field of Artificial Neural Networks
DL extends the basic principles of ANNs by:
Why Deep Learning Now?
Performance comparison between Deep Learning and other ML algorithms
DL modeling from large amounts of data can increase the performance
The neuron can be divided into two parts:
The neuron computes: \[ y = f(g(x_1,\dots,x_n)). \]
Mc Cullogh’s neuron has important limitations:
To build more flexible the perceptron was introduced.
To overcome Mc Cullogh’s neuron limitations Rosenblatt, proposed the perceptron model, or artificial neuron, in 1958.
It generalizes previous one in that weights and thresholds can be learnt over time.
The perceptron represents an improvement:
However, it still has important limitations:
The Perceptron can be generalized by Artificial Neurones which use functions, called Activation Functions (AFs) to produce their output.
AFs are built in a way that they allow neurons to produce continuous and non-linear outputs.
It must be noted however that a single AN, even with a different AF, still cannot model non-linear separable problems: Non-linearity is in the output, no in the input
With all these ideas in mind we can now define an Artificial Neuron as a computational unit that :
takes as input \(x=(x_0,x_1,x_2,x_3),\ (x_0 = +1 \equiv bias)\),
outputs \(h_{\theta}(x) = f(\theta^\intercal x) = f(\sum_i \theta_ix_i)\),
where \(f:\mathbb{R}\mapsto \mathbb{R}\) is called the activation function.
Goal of activation function is to provide the neuron with the capability of producing the required outputs.
Flexible enough to produce
Usually chosen from a (small) set of possibilities.
tanh, function\[ f(z)=\frac{1}{1+e^{-z}} \]
Output real values \(\in (0,1)\).
Natural interpretations as probability


Also called tanh, function:
\[ f(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \]
outputs are zero-centered and bounded in −1,1
scaled and shifted Sigmoid
stronger gradient but still has vanishing gradient problem
Its derivative is \(f'(z)=1-(f(z))^2\).

rectified linear unit: \(f(z)=\max\{0,z\}\).
Close to a linear: piece-wise linear function with two linear pieces.
Outputs are in \((0,\infty)\) , thus not bounded
Half rectified: activation threshold at 0
No vanishing gradient problem

.
Softmax is an activation function used in the output layer of classification models, especially for multi-class problems.
It converts raw scores (logits) into probabilities, ensuring that \(\sum_{i=1}^{N} P(y_i) = 1\) where \(P(y_i)\) is the predicted probability for class \(i\).
Given an input vector \(z\), Softmax transforms it as: \[ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{N} e^{z_j}} \]
An AN takes a vector of input values \(x_{1}, \ldots, x_{d}\) and combines it with some weights that are local to the neuron \(\left(w_{0}, w_{1}, . ., w_{d}\right)\) to compute a net input \(w_{0}+\sum_{i=1}^{d} w_{i} \cdot x_{i}\).
To compute its output, it then passes the net input through a possibly non-linear univariate activation function \(g(\cdot)\), usually chosen from a set of options such as Sigmoid, Tanh or ReLU functions
To deal with the bias, we create an extra input variable \(x_{0}\) with value always equal to 1 , and so the function computed by a single artificial neuron (parameterized by its weights \(\mathbf{w}\) ) is:
\[ y(\mathbf{x})=g\left(w_{0}+\sum_{i=1}^{d} w_{i} x_{i}\right)=g\left(\sum_{i=0}^{d} w_{i} x_{i}\right)=g\left(\mathbf{w}^{\mathbf{T}} \mathbf{x}\right) \]
Continuing with the brain analogy one can combine (artificial) neurons to create better learners.
A simple artificial neural network is created by two types of modifications to the basic Artificial Neuron.
Stacking several neurons insteads of just one.
Adding an additional layer of neurons, which is call a hidden layer,
This yields a system where the output of a neuron can be the input of another in many different ways.
In this figure, we have used circles to also denote the inputs to the network.
Circles labeled +1 are bias units, and correspond to the intercept term.
The leftmost layer of the network is called the input layer.
The rightmost layer of the network is called the output layer.
The middle layer of nodes is called the hidden layer, because its values are not observed in the training set.
Bias nodes are not counted when stating the neuron size.
With all this in mind our example neural network has three layers with:
An ANN is a predictive model (a learner) whose properties and behaviour can be well characterized.
It operates through a process known as forward propagation, which encompasses the information flow from the input layer to the output layer.
Forward propagation is performed by composing a series of linear and non-linear (activation) functions.
These are characterized (parametrized) by their weights and biases, that need to be learned from data.
The process that encompasses the computations required to go from the input values to the final output is known as forward propagation.
For the ANN with 3 input values and 3 neurons in the hidden layer we have:
\[\begin{eqnarray} a_1^{(2)}&=&f(\theta_{10}^{(1)}+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3)\\ a_2^{(2)}&=&f(\theta_{20}^{(1)}+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3)\\ a_3^{(2)}&=&f(\theta_{30}^{(1)}+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3) \end{eqnarray}\]
Forward propagation can be written compactly as:
\[ z^{(l+1)} = W^{(l)} a^{(l)} + b^{(l)} \]
\[ a^{(l+1)} = f(z^{(l+1)}) \]
where:
\(W^{(l)}\) contains the weights,
\(b^{(l)}\) contains the bias terms,
\(f(\cdot)\) is applied elementwise.
This form is used in most implementations.
An alternative compact notation incorporates the bias into the weight matrix.
Standard form (explicit bias)
\[ z^{(l+1)} = W^{(l)} a^{(l)} + b^{(l)} \]
\[ a^{(l+1)} = f(z^{(l+1)}) \]
Augmented form (bias included)
\[ z^{(l+1)} = \Theta^{(l)} \tilde{a}^{(l)} \]
\[ \tilde{a}^{(l)} = \begin{bmatrix} 1 \\ a^{(l)} \end{bmatrix} \]
\[ a^{(l+1)} = f(z^{(l+1)}) \]
In short, a neural network defines a parametric function:
\[ \hat{y} = f(x; \theta) \]
where \(f(\cdot)\) is obtained by composing a sequence of transformations:
\[ f(x) = f^{(L)} \circ f^{(L-1)} \circ \cdots \circ f^{(1)}(x) \]
Each layer defines a transformation of the form:
\[ f^{(l)}(a) = f\big(W^{(l)} a + b^{(l)}\big) \]
so that the activations propagate as:
\[ a^{(l+1)} = f^{(l)}(a^{(l)}) \]
with: \(a^{(1)} = x\) (input layer) and \(a^{(L)} = \hat{y}\) (output layer).
The way input data is transformed, through a series of weightings and transformations, until the ouput layer is called forward propagation.
By organizing parameters in matrices, and using matrix-vector operations, fast linear algebra routines can be used to perform the required calculations in a fast efficent way.
We have so far focused on a single hidden layer neural network of the example.
One can. however build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
An ANN is a predictive model whose properties and behaviour can be mathematically characterized.
The ANN acts by composing a series of linear and non-linear (activation) functions.
These transformations are characterized by their weights and biases, which need to be learned from data.
Training the network consists in adjusting these parameters so that predictions match -as best as possible- the observed outputs.
To learn the parameters, we need to measure how good the predictions are.
For a given observation \((x, y)\), we use a loss function, \(\ell(y, \hat{y})\) to compare:
Given a dataset \(\{(x_i,y_i)\}_{i=1}^n\), we define the average loss:
\[ J(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, \hat{y}_i) \]
The choice of loss function depends on the type of problem.
For regression problems a common choice is the squared error: \[ \ell(y, \hat{y}) = (y - \hat{y})^2 \]
For classification problems, we often use loss functions based on probabilities. In particular:
The loss function should reflect how we measure prediction quality.
\[ l(h_\theta(x),y)=\big{\{}\begin{array}{ll} -\log h_\theta(x) & \textrm{if }y=1\\ -\log(1-h_\theta(x))& \textrm{if }y=0 \end{array} \]
\[ l(h_\theta(x),y)=-y\log h_\theta(x) - (1-y)\log(1-h_\theta(x)) \]
\[ J(\theta)=-\frac{1}{n}\left[\sum_{i=1}^n (y^{(i)}\log h_\theta(x^{(i)})+ (1-y^{(i)})\log(1-h_\theta(x^{(i)}))\right] \]
\[ \hat{y} \approx P(y = 1 \mid x) \]
\[ P(y \mid x) = \hat{y}^y (1 - \hat{y})^{1-y} \]
This probabilistic view is not required, but provides a useful way to motivate the choice of loss function.
Given a dataset \(\{(x_i, y_i)\}_{i=1}^n\), we can measure how well the model fits the data (how likely is the model given the data) through the likelihood: \[ L(\theta) = \prod_{i=1}^n \hat{y}_i^{y_i} (1 - \hat{y}_i)^{1-y_i} \]
Maximizing this likelihood is equivalent to minimizing \(- \log L(\theta)\), which leads to the cross-entropy loss: \[ \ell(y, \hat{y}) = -y \log(\hat{y}) - (1 - y)\log(1 - \hat{y}) \] Which should provide a better intuition for this loss function.
\[ J(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(y_i, \hat{y}_i) \]
which measures how well the network fits the data.
\[ \theta = \{W^{(l)}, b^{(l)}\}_{l=1}^{L} \]
The key question is: How should we adjust \(\theta\) to reduce \(J(\theta)\)?
Training a network consists in finding the parameters (weights and biases) that minimize the cost function \(J(\theta)\).
The cost function, \(J(\theta)\), depends on all model parameters.
To reduce it, we need to understand how it changes when we modify \(\theta\).
The gradient of \(J\) is a vector of partial derivatives defined as:
\[ \nabla J(\theta) = \left( \frac{\partial J}{\partial \theta_1}, \dots, \frac{\partial J}{\partial \theta_p} \right) \]
It indicates how \(J(\theta)\) changes with each parameter.
The gradient vector points in the direction of steepest increase so:
To reduce the cost, we move in the direction of steepest decrease, given by \[-\nabla J(\theta)\]
To minimize a cost function \(J(\theta)\), we proceeds as follows:
https://assets.yihui.org/figures/animation/example/grad-desc
To apply gradient descent, we need to compute \(\nabla J(\theta)\).
The cost depends on the parameters through multiple layers: \[ x \rightarrow a^{(2)} \rightarrow a^{(3)} \rightarrow \cdots \rightarrow \hat{y} \]
Therefore, we must compute derivatives through a composition of functions
Backpropagation, an algorithm introduced in the 1970s in an MSc thesis applies the chain rule to compute these derivatives efficiently
This enabled neural networks to solve previously intractable problems.
\[ z^{(l)} = W^{(l-1)} a^{(l-1)} + b^{(l-1)} \]
\[ a^{(l)} = f(z^{(l)}) \]
\[ \delta^{(l)} = \frac{\partial J}{\partial z^{(l)}} \]
Recall: \(\delta^{(L)} = \frac{\partial J}{\partial z^{(L)}}\)
The cost depends on \(z^{(L)}\) through the activation \(a^{(L)}\)
Applying the chain rule:
\[ \delta^{(L)} = \frac{\partial J}{\partial a^{(L)}} \odot f'(z^{(L)}) \]
\[ z^{(l)} \rightarrow a^{(l)} \rightarrow z^{(l+1)} \rightarrow J \]
\[ \delta^{(l)} = (W^{(l)})^T \delta^{(l+1)} \odot f'(z^{(l)}) \]
The term \((W^{(l)})^T \delta^{(l+1)}\) propagates the effect of the cost backwards
The term \(f'(z^{(l)})\) accounts for the activation function
\[ x \rightarrow a^{(2)} \rightarrow a^{(3)} \rightarrow \cdots \rightarrow \hat{y} \]
\[ J \rightarrow \delta^{(L)} \rightarrow \delta^{(L-1)} \rightarrow \cdots \rightarrow \delta^{(1)} \]
Forward: how the network produces predictions
Backward: how each parameter affects the cost
After propagating the derivatives backwards, we can compute the gradients with respect to the model parameters
Using the chain rule:
\[ \frac{\partial J}{\partial W^{(l)}} = \delta^{(l+1)} (a^{(l)})^T \]
\[ \frac{\partial J}{\partial b^{(l)}} = \delta^{(l+1)} \]
sigmoid activation
cross-entropy loss
Output layer: \(a^{(L)} = \hat{y} = \sigma(z^{(L)})\)
Loss: \(\ell(y,\hat{y}) = -y \log(\hat{y}) - (1-y)\log(1-\hat{y})\)
\[ \delta^{(L)} = \frac{\partial J}{\partial z^{(L)}} = \frac{\partial J}{\partial a^{(L)}} \cdot \frac{\partial a^{(L)}}{\partial z^{(L)}} \]
\[ \frac{\partial J}{\partial a^{(L)}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}} \]
\[ \frac{\partial a^{(L)}}{\partial z^{(L)}} = \hat{y}(1-\hat{y}) \]
\[ \delta^{(L)} = \hat{y} - y \]
Modern deep learning frameworks do not compute gradients manually.
Instead, they use automatic differentiation and computational graphs to simplify and speed up backpropagation.
A computational graph represents the sequence of operations in a neural network as a directed graph.
Automatic differentiation (AD) relies the computational graph to apply the chain rule and compute gradients automatically in the Backwards pass.
Frameworks like TensorFlow, PyTorch, and JAX use reverse-mode differentiation, which is particularly efficient for functions with many parameters (like neural networks).


















The learning porocess such as it has been derived may be improved in different ways.
This can be partially soved applying distinct approaches.
Network performance is affected by many hyperparameters
Traditionally considered that one layer may be enough
Posterior research showed that adding more layers increases efficency
Although there is also risk of overfitting
It has been shown that using the whole training set only once may not be enough for training an ANN.
One iteration of the training set is known as an epoch.
The number of epochs \(N_E\), defines how many times we iterate along the whole training set.
\(N_E\) can be fixed, determined by cross-validation or left open and stop the training when it does not improve anymore.
A complementary strategy to increasing the number of epochs is decreasing the number of instances in each iteration.
That is, the training set is broken in a number of batches that are trained separately.
Batch learning allows weights to be updated more frequently per epoch.
The advantage of batch learning is related to the gradient descent approach used.
Training speed can be improved by adjusting key factors that influence convergence.
Weight Initialization: Properly initializing weights helps prevent vanishing or exploding gradients, leading to faster convergence.
Adjusting Learning Rate: A well-tuned learning rate accelerates training while avoiding instability or slow convergence.
Using Efficient Cost Functions: Choosing an appropriate loss function (e.g., cross-entropy for classification) speeds up gradient updates.
Overfitting occurs when a model learns noise instead of general patterns. Common strategies to prevent it include:
L2 Regularization: Penalizes large weights to reduce model complexity and improve generalization.
Early Stopping: Stops training when validation loss starts increasing, preventing unnecessary overfitting.
Dropout: Randomly disables neurons during training to make the model more robust.
Data Augmentation: Expands the training set by applying transformations (e.g., rotations, scaling) to improve generalization.
| Techniques | Performance Improvement | Learning Speed | Overfitting | Description |
|---|---|---|---|---|
| Network Architecture | X | X | X | Adjust layers, neurons andconnections |
| Epochs, Iterations, and Batch Size | X | Controls updates per epoch to improve efficiency. | ||
| Softmax | X | Turns outputs into probabilities | ||
| Training Algorithms | X | X | GD Improvements | |
| Learning Rate | X | X | Step size in gradient updates. | |
| Cross-Entropy Loss | X | Optimized for classification | ||
| L2 Regularization | X | X | Penalizes large weights to prevent overfitting. | |
| Early Stopping | X | Stops training when validation loss worsens. | ||
| Dropout | X | X | Randomly disables neurons to enhance generalization. | |
| Data Augmentation | X | Expands training data by applying transformations. |
Lab Lab-C3.1-IrisANN-Modular contains a detailed example and a Python notebook on building a NN from scratch.
Lab Lab-C3.1-Dividend Prediction shows how to use R to build and use a Neural network.