Neural Networks, Deep Learning
and Artificial Intelligence
In the post-pandemic world, a lightning rise of AI, with a mess of realities and promises is impacting society.
Since ChatGPT entered the scene everybody has an experience, an opinion, or a fear on the topic.
Most tasks performed by AI can be described as Classification or Prediction used in applications as:
AI relies on machine learning algorithms, to make predictions based on large amounts of data.
AI has far-reaching implications beyond its predictive capabilities, including ethical, social or technological.
In many contexts, talking about AI means talking about Deep Learning (DL).
DL is a successful AI model which has powered many application such as self-driving cars, voice assistants, and medical diagnosis systems.
DL originates in the field of Artificial Neural Networks
But DL extends the basic principles of ANNs by:
We can see several hints worth to account for:
The Perceptron and the first Artificial Neural Network where the basic building block was introduced.
The Multilayered perceptron and back-propagation where complex architectures were suggested to improve the capabilities.
Deep Neural Networks, with many hidden layers, and auto-tunability capabilities.
Success stories such as
the development of self-driving cars,
the use of AI in medical diagnosis, and
online shopping personalized recommendations
have also contributed to the widespread adoption of AI.
AI also comes with fears from multiple sources from science fiction to religion
Mass unemployment
Loss of privacity
AI bias
AI fakes
Or, simply, AI takeover
Where/How does it all fit?
Artificial intelligence: Ability of a computer to perform tasks commonly associated with intelligent beings.
Machine learning: study of algorithms that learn from examples and experience instead of relying on hard-coded rules and make predictions on new data
Deep learning: sub field of ML focusing on learning data representations as successive successive layers of increasingly meaningful representations.
Near-human-level image classification
Near-human-level speech transcription
Near-human-level handwriting transcription
Dramatically improved machine translation
Dramatically improved text-to-speech conversion
Digital assistants such as Google Assistant and Amazon Alexa
Near-human-level autonomous driving
Improved ad targeting, as used by Google, Baidu, or Bing
Improved search results on the web
Ability to answer natural language questions
Superhuman Go playing
According to F. Chollet, the developer of Keras,
See the source of this picture for an illustration on how this can be used to emulate logical operations such as AND, OR or NOT, but not XOR.
To overcome these limitations Rosenblatt, proposed the perceptron model, or artificial neuron, in 1958.
Generalizes McCullough-Pitts neuron in that weights and thresholds can be learnt over time.
With all these ideas in mind we can now define an Artificial Neuron as a computational unit that :
takes as input \(x=(x_0,x_1,x_2,x_3),\ (x_0 = +1 \equiv bias)\),
outputs \(h_{\theta}(x) = f(\theta^\intercal x) = f(\sum_i \theta_ix_i)\),
where \(f:\mathbb{R}\mapsto \mathbb{R}\) is called the activation function.
Goal of activation function is to provide the neuron with the capability of producing the required outputs.
Flexible enough to produce
Usually chosen from a (small) set of possibilities.
tanh
, function\[ f(z)=\frac{1}{1+e^{-z}} \]
Output real values \(\in (0,1)\).
Natural interpretations as probability
Also called tanh
, function:
\[ f(z)=\frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} \]
outputs are zero-centered and bounded in −1,1
scaled and shifted Sigmoid
stronger gradient but still has vanishing gradient problem
Its derivative is \(f'(z)=1-(f(z))^2\).
rectified linear unit: \(f(z)=\max\{0,z\}\).
Close to a linear: piece-wise linear function with two linear pieces.
Outputs are in %(0,)$ , thus not bounded
Half rectified: activation threshold at 0
No vanishing gradient problem
.
An ANN takes a vector of input values \(x_{1}, \ldots, x_{d}\) and combines it with some weights that are local to the neuron \(\left(w_{0}, w_{1}, . ., w_{d}\right)\) to compute a net input \(w_{0}+\sum_{i=1}^{d} w_{i} \cdot x_{i}\).
To compute its output, it then passes the net input through a possibly non-linear univariate activation function \(g(\cdot)\), usually vchosen from a set of options such as Sigmoid, Tanh or ReLU functions
To deal with the bias, we create an extra input variable \(x_{0}\) with value always equal to 1 , and so the function computed by a single artificial neuron (parameterized by its weights \(\mathbf{w}\) ) is:
\[ y(\mathbf{x})=g\left(w_{0}+\sum_{i=1}^{d} w_{i} x_{i}\right)=g\left(\sum_{i=0}^{d} w_{i} x_{i}\right)=g\left(\mathbf{w}^{\mathbf{T}} \mathbf{x}\right) \]
Following with the brain analogy one can combine (artificial) neurons to create better learners.
A simple artificial neural network is usually created by combining two types of modifications to the basic perceptron (AN).
This yields a system where the output of a can be the input of another in many different ways.
In this figure, we have used circles to also denote the inputs to the network.
Circles labeled +1 are bias units, and correspond to the intercept term.
The leftmost layer of the network is called the input layer.
The rightmost layer of the network is called the output layer.
The middle layer of nodes is called the hidden layer, because its values are not observed in the training set.
Bias nodes are not counted when stating the neuron size.
With all this in mind our example neural network has three layers with:
An ANN is a predictive model (a learner) whose properties and behaviour can be well characterized.
It operates through a process known as forward propagation, which encompasses the information flow from the input layer to the output layer.
Forward propagation is performed by composing a series of linear and non-linear (activation) functions.
These are characterized (parametrized) by their weights and biases, that need to be learnt.
In order for the ANN to perform well, the training process aims at finding the best possible parameter values for the learning task defined by the fnctions. This is done by
This is usually done using some iterative optimization procedure such as gradient descent.
We have so far focused on a single hidden layer neural network of the example.
One can. however build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
We have so far focused on a single hidden layer neural network of the example
One can build neural networks with many distinct architectures (meaning patterns of connectivity between neurons), including ones with multiple hidden layers.
We use the neuralnet
package to build a simple neural network to predict if a type of stock pays dividends or not.
And use the dividendinfo.csv
dataset from https://github.com/MGCodesandStats/datasets
mydata <- read.csv("https://raw.githubusercontent.com/MGCodesandStats/datasets/master/dividendinfo.csv")
str(mydata)
'data.frame': 200 obs. of 6 variables:
$ dividend : int 0 1 1 0 1 1 1 0 1 1 ...
$ fcfps : num 2.75 4.96 2.78 0.43 2.94 3.9 1.09 2.32 2.5 4.46 ...
$ earnings_growth: num -19.25 0.83 1.09 12.97 2.44 ...
$ de : num 1.11 1.09 0.19 1.7 1.83 0.46 2.32 3.34 3.15 3.33 ...
$ mcap : int 545 630 562 388 684 621 656 351 658 330 ...
$ current_ratio : num 0.924 1.469 1.976 1.942 2.487 ...
Finally we break our data in a test and a training set:
We train a simple NN with two hidden layers, with 4 and 2 neurons respectively.
The output of the procedure is a neural network with estimated weights
temp_test <- subset(testset, select =
c("fcfps","earnings_growth",
"de", "mcap", "current_ratio"))
nn.results <- compute(nn, temp_test)
results <- data.frame(actual =
testset$dividend,
prediction = nn.results$net.result)
head(results)
actual prediction
9 1 0.9919213885
19 1 0.9769206123
22 0 0.0002187144
26 0 0.6093330933
27 1 0.7454164893
29 1 0.9515431416
“Deep Neural networks” are NNs with several hidden layers.
The real shift, from Shallow to Deep NNs, is not (only) the number of layers.
The difference comes from realizing that
This is often associated to working with structured vs unstructured data
Task: Distinguish human from non-human in an image
This can be attacked as the digit recognition problem (output of “yes” and “no”), although the cost of training the network would be much higher.
An alternative approach may be to try to solve the problem hierarchically.
In order for these networks to succeed it is important not having to hand-craft the complicated structure of weights and biases required for such hierarchy of layers and functions.
In 2006 techniques enabling learning in Deep Neural Nets were developed.
These deep learning techniques are based on stochastic gradient descent and backpropagation, but also introduce new ideas.
It turns out that equiped with such techniques, deep neural networks perform much better on many problems than shallow neural networks.