Neural Network Architecture: all you need to know as an MLE [2023 edition]
Explore how to develop a high-performance neural network and best practices. This guide helps MLEs expand their insight and play key roles in the growing AI/ML field.
Deep learning technologies are fueling the emergence of next-generational AI tools today. These tools will and are shaping how businesses in various industries operate. According to Statista, deep learning in the software market will hit the $930 million mark by 2025, a substantial 15-fold growth compared to 2015.
As intriguing as AI software is, the real brilliance lies in the data scientists and machine learning engineers who design, train and improve its neural network architecture. Today’s intelligent chatbots, search engines, fraud detectors, and recommendation engines are powered by different neural network models.
In this article, we’ll explore what a neural network is, why it’s important, the common neural network architectures, and ways to improve their implementations.
What is a Neural Network?
A neural network is a software function in deep learning that processes data in ways similar to a human brain. Like the human brain, a neural network consists of neurons, or nodes, that are interconnected and stacked in several layers. Each node processes the input and passes the result to the next one until it is refined with great accuracy at the final layer.
Unlike conventional machine learning approaches, neural networks enable deep learning, allowing computer systems to learn and improve like humans. A deep neural network can handle and process larger volumes of data with greater precision. It now plays a vital role in various AI technologies, including visual recognition, machine translation, speech recognition, and conversational AI.
What is a Neuron?
The neuron is the fundamental unit of the neural network. Each neuron, or a node, has the following components.
The neuron’s input accepts raw information from the outside world. For example, a neural network’s node for NLP software takes in a string of text as input.
Weight is a metric that indicates the importance of each input. For example, question words like 'when', 'who', or 'how' are assigned with a higher weight score when evaluating if a phrase is a legit question.
Bias is a value that allows the neural network model to shift the output in accordance with realistic expectations. It enables the machine learning algorithm to fit the model to the training data.
The activation function, or transfer function, turns linear results into realistic or non-linear ones. There are different types of activation functions, such as binary, sigmoid, tanh, or a simple clipping of a linear graph at a specific value.
The relationship between the input, weight, output, and bias of the neural network is expressed as:
Output = Activation Function (Weight * Input + Bias)
How does a neural network work?
Basically, a neural network has 3 layers.
Nodes at this layer receive and processes information from external sources. The input layer might have one or several nodes, depending on the training data.
The hidden layer is sandwiched between the input and output layers. It’s invisible to external systems and is what separates deep neural networks from basic machine learning models. The hidden layer receives the result from the input layer and continues the processing, where each hidden layer propagates the result to the next.
For example, the hidden layers might parse, classify or analyze specific text prepared by the input layer. There are no strict rules about the number of middle layers. A neural network model that runs complex processing will have more hidden layers, which consume more computing power and time.
The output layer consists of neurons that deliver the final outcome to the external system. There might be one or several neurons at the output layer. For example, a regression model has a single output neuron, but a classification model might have an output for each class.
Why Understanding Neural Network is Important for MLE?
Understanding neural networks, specifically different architectures, allows machine learning engineers to develop systems capable of intelligently processing unstructured data. Moreover, you learn the strengths and weaknesses of each deep learning model and apply the appropriate ones for different problems.
For example, GoogLeNet is highly efficient in computing video frames at scale, while machine learning engineers use AlexNet to train the AI with complex objects. When computing power is limited, LeNet5 might be a good fit, as it was created in an era where CPUs were not as powerful as they are today. These famous machine learning models are developed from various neural network architectures.
Let’s explore the common ones.
Most Common Types of Neural Network Architectures
Perceptron is a simple neural network that uses a single neuron to turn raw input into output. Introduced in the 1950s, the perceptron is a fundamental building block for other advanced linear classification machine learning models. It is a type of single-layer neural network that applies unsupervised learning of binary classifiers.
While the perceptron has a simple architecture, it handles both small and large datasets well and with consistent accuracy. Machine learning engineers use this model to obtain quick training predictions.
A perceptron is a linear binary classifier. Therefore, this architecture is best suited for segregating raw data into two distinct values. For example,
you can use the perceptron to determine if the pixel data from an image is either dark or bright.
Likewise, the model is useful in NLP for classifying text as positive or negative.
A perceptron is limited in its capability to handle real-world data. It is more useful when expanded as a multi-layer perceptron. However, the model is commonly found in learning materials or machine learning libraries. For example,
Coursera uses perceptron as an example in several machine learning courses.
2. Single-layer neural network
Single layer neural network has one layer of interlinking input and output neurons. The perceptron, described above, is a type of single-layer network. However, the opposite might not be true, as some single-layer networks are not necessarily perceptrons.
While bearing a similar architecture, the single-layer neural network offers a more versatile output. Besides linear classification, it provides MLEs with activation functions like rectified linear unit (ReLu) or sigmoid.
These are probable use cases of a single-layer neural network.
Binary classification is what a perceptron does.
Regression and classification tasks, such as sales forecasting, real estate prices, and sentiment analysis.
Like the perceptron, a single-layer neural network has limited capabilities in processing complex real-world data. Hence, its application is limited to basic learning models. Instead, MLEs turn the single-layer model into a multilayer architecture for practical usage.
3. Multilayer perceptron
Multilayer perceptron (MLP), a feed-forward neural network, is a deep learning model consisting of multiple hidden layers. Unlike a single-layer perceptron, the MLP can process diverse input data and produce complex non-linear outputs. This makes multilayer perceptrons an excellent neural network model for real-life applications.
To continuously learn from the training data, the MLP applies backpropagation. Backpropagation is a process that calculates the mean square error of each layer and propagates it back to the first layer. This ensures the weights in all layers are updated through multiple iterations to reduce the loss function. In other words, the MLP becomes more accurate by learning from its own mistakes.
The MLP’s additional hidden layer allows it to be trained with the complex data set. This widens its usage to several applications, including.
Medical AI software diagnoses the patient’s diseases based on the prevailing symptoms.
Speech recognition software that transcribes verbal audio into written text.
Classifications in sentiment analysis, where emails or feedback are segregated into neutral, positive, or negative.
Stock price prediction. MLE runs regression analysis with multiplayer perceptrons to predict the probate outcomes.
A recommender system, where the underlying hidden layer processes various parameters before suggesting a viable output.
Multilayer perceptron is prevalent in many well-known software and services.
Facebook uses the MLP architecture to personalize news feeds.
Google released a multilayer perceptron model in 2021, which aids computer vision tasks.
Knime is an open-source data analytics and visualization platform. It uses multilayer perceptron as part of its AI modeling approach.
Machine learning engineers can train MLP models with PyTorch, an open-source machine learning framework.
You’ll also find MLP models in TensorFlow, a free machine-learning platform.
4. Convolutional Neural Network (CNN)
A convolutional neural network is a forward-feed neural network specifically designed to identify, process, and classify images. Like a multilayer perceptron, the CNN has multiple hidden layers. However, all the layers are meant to detect a specific feature instead of different ones. Thus, all neurons in the hidden layer share a common bias and weight.
The CNN model uses three types of hidden layers to process images: convolutional, ReLU activation, and pooling.
The convolutional layer extracts and process specific features from the image. For example, it filters basic shapes or edges or performs complex recognition of animals, vehicles, and other inanimate objects in an advanced CNN.
At each layer, the convolutional filter passes the feature to the ReLU activation layer, which brings non-linearity to the result.
The pooling layer is essential to reduce the required parameters and ensure the subsequent layers can learn efficiently.
At the final layer, the CNN classifies the output according to the previous results.
Popular variants of convolutional neural networks
Throughout the years, data scientists have developed various deep learning models from the CNN. These are some renowned ones.
LeNet was the first convolutional neural network. It was then used for image classification and led to future works of evolving CNN variants.
AlexNet was trained with millions of images and is a highly accurate CNN. It introduces dropout to prevent overfitting and enables overlapping pooling to retain output accuracy.
Overfeat was a modification of the Alexnet, capable of predicting bounding boxes at specific spatial locations.
VGG was introduced to improve upon AlexNet’s capability. It does so by adding blocks of convolutional filters in the hidden layers.
Network-in-network. While not strictly CNN by definition, the network-in-network (NiN) model combines 1x1 convolutional layers and MLP to keep feature parameters at a minimum.
GoogleNet and Inception. Inception is a deep neural network featuring multiple stacked layers of convolutional filters. GoogleNet is a model inspired by Inception, designed to improve the computational efficiency of deep learning.
SqueezeNet attempts to further reduce the number of parameters. It uses ‘fire modules’, which are 1x1 convolution filters. This allows MLEs to use SqueezeNet accurately on systems with lower computing power.
Resnet is a deep-learning CNN that uses ‘skip connections’ to prevent gradient loss. They are activation layers that may opt to pass the gradient forward during the initial training.
Xception is an extension of the Inception model and uses separable depthwise convolutions in order of pointwise followed by depthwise. It also lacks an intermediate ReLU filter, which results in improved accuracy.
MobileNets are used in mobile devices and embedded controllers as a lightweight neural network. It uses separable convolutions like Xception to reduce computational loads.
The convolutional neural network’s capability to process immense image data accurately enables its usage in these areas.
CNN allows medical experts to identify cancerous cells or other anomalies in medical imaging systems.
It provides software with object-recognition features. For example, it can identify vehicles and people on the street in self-driving cars.
CNN enables speech detection and analysis, particularly in challenging or noisy environments.
Biometric devices use CNN to analyze and verify facial features for identification and security clearance.
These are examples of how these organizations apply convolutional neural networks in their solutions.
Google explains how convolutional neural network works in image classification.
Tesla applies CNN in its self-driving vehicles and feeds the model with data from cameras and sensors.
Facebook uses the convolutional neural network to improve its image recognition capability.
OpenCV, which provides open-source libraries for computer vision programming, uses CNN for object recognition.
5. Recurrent Neural Network (RNN)
The recurrent neural network (RNN) is a machine learning model capable of remembering its past decision. Unlike the forward-feed network, the RNN has a hidden state in each neuron, which holds data or decisions from the previous step. In machine learning, engineers use RNN to process sequential data, such as speech, text, or video.
When processing data, the neuron evaluates the input and the retained state when activating the output. Besides holding the past decision, the RNN’s hidden state is also assigned a weight. This allows the network to consider the importance of both parameters when making subsequent decisions.
The RNN’s ability to relate between sequential data makes it useful in these applications.
Natural language processing. RNN allows accurate prediction in NLP disciplines like language translation and sentiment analysis. For example, its memory state enables processing phrases like ‘not good’ and categorizes them accurately as negative.
RNN is also used in speech recognition to understand the verbal context of the audio and make clever predictions.
RNN’s sequential processing is helpful in stock price prediction and weather forecasting.
Video processing software uses RNN to categorize specific actions, such as punching, running, or jumping, by analyzing the frames sequentially.
RNN is the driving force behind many AI technologies and applications, including.
Google published a research paper on a novel RNN, which introduced the possibility of improving local search.
Hugging Face, a US-based company offering open-source AI tools, lists RNN models contributed by community members.
RNN models are found in machine learning frameworks like TensorFlow or PyTorch, which can be used to build a news headline generator like this.
6. Long Short Term Memory (LSTM)
Long Short-Term Memory (LSTM) is a variant of a recurrent neural network capable of retaining and using past information for a longer period. Unlike the RNN, the LSTM can selectively use or discard information it remembers from the present calculation. This allows the LSTM to solve long-standing problems the RNN face – exploding or vanishing gradient, which affects the output's accuracy.
To overcome that, the LSTM uses a memory cell, where sequential data flows through it. The cell might add new information or discard certain data that no longer serves its purpose. To do that, it activates the following gates that control information flows.
The input gate decides which information is added to the memory cell. It also multiples the data with the associated weight.
The forget gate chooses which data to be removed from the cell based on the given timestamp and previous cell state.
After removing or adding new states to the memory cell, the output gate produces new information for the subsequent cell.
LSTM's approach with memory cells allows it to process data differently from the basic RNN model. Instead of changing the existing data entirely, LTSM makes minor adjustments by performing multiplication or addition on the data that passes through. This provides finer control and accuracy, particularly for time series and complex sequential data.
LSTM is useful in fields where RNNs are used, but the former is more robust when processing diverse data. For example,
LSTM helps predict sales and revenue based on previous results. It can store important records and forget redundant ones to produce accurate predictions.
The model proves useful in handwriting recognition. It allows the software to detect and predict the context of certain words when it can't fully recognize the entire text—for example, differentiating the number' 0' and the letter 'o'.
It allows predictive analysis and real-time anomaly detection in industrial facilities. The cascading memory cells can detect outliers when comparing historical sensor data. Such data might indicate faults in the system.
These are interesting examples of LSTM.
Google used the LTSM to improve its voice recognition system in 2012. It allows the module to interpret the spoken words better as it can remember previous phrases.
LSTM is a part of OpenAI's neural network in enabling better dexterity for robotic arms.
Uber applies the LSTM architecture to forecast major events and scale its growth strategies accordingly.
7. Autoencoder (AE)
The autoencoder (AE) is a neural network that deconstructs data into its lowest-level form and reconstructs it to a near-perfect result. It is designed as a forward-feed neural network with gradually converging layers, which expand outward after passing the center point or ‘bottleneck.
We call these stages the encoder, code, and decoder.
The encoder extracts and compresses the feature data through a series of convolutional and pooling nodes.
The code, or bottleneck, stores the basic representation of the raw data as discrete values. Despite its limited size, the code contains enough information to reconstruct the output data with minimum or no losses.
Lastly, the encoder recreates the information to match the original input data.
There are different types of autoencoder architecture that MLEs can build on, including:
Undercomplete autoencoder. - It aims to extract a compact representation of key features from the input data. Therefore, its hidden layers bear lesser neurons than the input layer.
Sparse autoencoders. - These autoencoders activate only some of the neurons in the hidden layer when data passes through. Doing so allows the neural network to discard redundant knowledge when extracting feature data.
Denoising autoencoders. - By introducing noise in the training samples, denoising autoencoders learn how to separate noise in real-world data.
Contractive autoencoders. - This model limits changes to the output if the input information changes only slightly.
Due to their versatility, autoencoders are a good fit for these applications.
Autoencoders facilitate feature extraction as it trains to retain minimal but necessary input data.
Data scientists use AE for denoising images. With sufficient training, the model can accurately reconstruct a corrupted image. This also applies when repainting parts of an image with colors.
AE is also helpful for detecting anomalies. If it fails to reconstruct the input data based on previous training, chances are, the data is corrupted or abnormal.
Here, you’ll find companies using autoencoders in various services.
Amazon offers a deep autoencoder on its marketplace to secure cloud workloads from fraud.
Facebook’s object recognition neural network includes autoencoders. Besides, the social media giant also uses autoencoders to trick facial recognition systems.
Matlab, a numeric and programming software for engineers, provides autoencoders as modules for data compression and feature extraction.
8. Variational Autoencoder (VAE)
The variational autoencoder (VAE) is a generative deep neural network. Like the autoencoder (AE), the VAE consists of an encoder, bottleneck, and decoder. However, the variational autoencoder is designed to overcome the limitation of the autoencoder in performing generative tasks. Specifically, the reconstruction losses that the autoencoder suffers when it tries to reproduce new information from the extracted features.
Instead of storing the extracted data in the bottleneck as discrete values, the variational autoencoder distributes them across a specific range. In other words, the VAE supports probabilistic distribution in the latent space or bottleneck. This, in turn, allows the VAE to generate new samples with greater accuracy and reduced reconstruction losses.
For example, a VAE is trained with several samples of human faces. It extracts certain hair colors, including black, brown, and blonde. Then, the VAE is tasked to reproduce random human images. Instead of being limited to the 3 colors, the model can create new colors from the distribution curve of the respective color. On the other hand, an autoencoder can only choose from the available colors.
As a deep generative machine learning model, the VAE proves helpful in these areas:
Image generation. AI software trains the VAE model with images. Then, the software reconstructs new and realistic pictures from the distributed data points.
Music generation. Similarly, VAE enables software to compose new melodies from the learned distribution.
Data scientists use VAE for analyzing and detecting pathological symptoms in imaging data via unsupervised machine learning.
These are interesting applications of variational autoencoders by tech companies.
SenseTime, which focuses on facial recognition technologies, was researching the use of variational autoencoders to combat face forgery.
NVIDIA provides a VAE model for recommendation systems for TensorFlow users.
9. Generative Adversarial Network (GAN)
The generative adversarial network (GAN) is a deep learning neural network that uses two separate but collaborative machine learning models for generative purposes. A GAN consists of a generator and discriminator, and here’s how they work.
The generator model produces a fake sample based on the distributed data points in the latent space. Its goal is to trick the discriminator into passing the sample as a genuine output. Meanwhile, the discriminator analyzes the output and determines if it is real or fake.
During training, the generator is penalized if it fails to trick the discriminator by a specific margin. In such cases, the MLE updates the generator with new parameters. Likewise, the MLEs train the discriminator continuously to increase its accuracy in detecting fake samples.
To get a better idea of GAN, consider this example. A graphic designer turns a photo captured in the daytime into the night with GAN-powered software. The generator replaces the sunny sky with a fake but realistic dark starry version that tricks the discriminator. The result is a believable professional photo taken with night shot vision.
GAN’s ability to produce highly-realistic samples makes it a valuable deep-learning model in these areas.
Style transfer. Graphic design software uses GAN to analyze the style in an original photo and reproduce other images in a similar style.
Image or video upscaler. GAN allows the software to increase the media’s resolution size while maintaining sharpness. It does so by generating probable new images that pass the discriminator’s check.
MLEs may also use the generative adversarial network to produce more realistic datasets in fields that lack training samples, such as the medical space.
Here are how companies and software use GANs.
OpenAI uses the generative adversarial network for several purposes, including image generation and training its VAE models.
GanLAB is a visualization tool that allows users to experience and understand generative adversarial networks on the browser.
10. Transformer Neural Network
The transformer is a deep-learning model introduced in 2017. Today, it powers many advanced AI software, particularly in the NLP field. Considered a major breakthrough in machine learning, the transformer enables the model to overcome limitations faced by recurrent neural networks (RNN) - namely slow training duration and vanishing gradient.
As the RNN processes data sequentially, handling a large volume of data takes a substantial duration. Also, the multiple hidden layers the data passes through contribute to gradient loss. For example, the RNN model cannot accurately predict the contextual meaning of bank in ‘I reached the bank after crossing the…”. According to the proposed paper, it can only determine the meaning after learning the remaining word, ‘road’ or ‘river’.
Transformer overcomes speed limitations and lack of accuracy in previous NLP machine learning models. It introduces a self-attention mechanism and a variation of the feed-forward hidden layers. This way, the transformer can process all tokens simultaneously and assign an attention score to them.
To do that, the encoder no longer passes the final output to the decoder, which was practiced in RNN and prior models. Instead, the encoder passes all the output in the intermediate states to the decoder. This lets the decoder focus on relevant information and make more accurate decisions.
Popular variants of transformer neural networks
These transformer variants are the core of many advanced AI solutions.
Bert is a deep learning architecture that Google developed for handling NLP tasks. Bert uses the masked language model, a unique way to predict a covered word in a sentence.
GPT is based on the transformer model, which enables it to understand and process long, complex sentences. Subsequent versions, GPT2 and GPT3 adopt the autoregressive decoder model, allowing it to perform a diverse range of NLP tasks.
Momentum Contrast, or MoCo, is a deep learning model that practices contrastive loss. It maintains an extensive library of negative samples to enable improved accuracy in computer vision tasks.
SimClr is also a contrastive loss unsupervised machine learning model like MoCo. However, its negative sample representation is limited by the dataset batch size, while MoCo uses a dynamic library bank.
Because of its speed, accuracy, and robustness, the transformer model is used in these fields.
Language translation software uses the transformer model to translate text accurately between different languages.
Conversational AI technologies, such as chatbots, use transformers to understand questions and respond in a shorter duration.
Documentation summarization. Enterprise software uses the transformer to identify key points in an extensive text and produce an accurate summary.
These are how companies and software apply transformer neural networks in practical applications.
Google recently developed OptFormer, a transformer-powered algorithm capable of hyperparameter tuning at great accuracy.
The transformer neural network has helped Facebook reach greater heights in its features, including speech recognition, machine translation, and symbolic mathematics.
OpenAI ChatGPT uses the decoder part of the transformer model in autoregressive mode to predict word sequences accurately.
Best Practices for Designing Neural Network Architectures
We showed you various neural network architectures that might fit your needs. However, each model has its strengths and weaknesses. So, you’ll need to consider several factors when building, training, and testing a specific deep learning model. These are our recommendations.
Use the appropriate number of layers for the complexity of the problem
Start with a simple model instead of using a complex, multilayer neural network from the start. Fit your model to the problem you’re trying to solve, not the other way around. Furthermore, a basic machine learning model is easier to train, test and evaluate.
For example, most applications need less than 5 hidden layers, which simple neural networks like MLP will suffice. However, you’ll need a deeper learning model to process images and videos. In such cases, models like ResNet and VGG might be helpful because pre-trained versions are available.
Use the appropriate number of units in each layer
Then, determine the number of neurons for each layer.
The input layer has the same number of neurons as the features in the training data.
Depending on its purpose, the output layer might have one or several neurons.
The question lies in the neuron counts within the hidden layers. While there are no strict rules, some MLEs limit the neuron between the size of the input and output layer. Again, you can solve most problems by having the same number of neurons across the hidden layer.
However, you should gradually decrease the neuron counts for feature extraction. Or increase them for regenerative tasks. The key is experimenting with different neuron sizes in different applications for the best results.
Use the appropriate activation function
ReLU, sigmoid, and tanh are commonly used in the hidden and output layer, but experimenting with different combinations might lead to better performance. For example, PReLU is optimized for handling large training datasets, while ELU is a non-saturating activation function that shortens training duration.
Use regularization techniques to prevent overfitting
Sometimes, your model performs accurately with training data but not in real-world scenarios. This is a symptom of overfitting, where the model is overly trained and influenced by the learned results. These are practical measures that regularize such behavior.
Dropout turns off a small section of neurons randomly in the model. This prevents the neural network from being overdependent on specific nodes to make predictions.
Early stopping is where the ML stops training the model before overfitting happens. It might take several attempts before stopping at the point before the model starts memorizing data noise.
Max-norm regularization constrains the weights in the neural network. This is helpful when the model consists of large weights that cause overfitting.
Neural Network Architectures: Final Thoughts
As MLEs, it’s intriguing to explore new neural network architectures that promise better performance. They are, after all, crucial technologies behind today’s remarkable progress in the machine-learning industry. Still, it’s essential to understand how different neural network models work and how to choose the right one for your needs.
We hope we’ve helped you to do so with this extensive guide of commonly-used deep learning models.
Resources on Neural Network Architectures