Options
Make Deep Networks Shallow Again
Type
forthcoming
Date Issued
2023-09-18
Author(s)
Abstract
Deep neural networks have a good success record and are thus viewed as the
best architecture choice for complex applications. Their main shortcoming has
been, for a long time, the vanishing gradient which prevented the numerical
optimization algorithms from acceptable convergence. A breakthrough has been
achieved by the concept of residual connections -- an identity mapping parallel
to a conventional layer. This concept is applicable to stacks of layers of the
same dimension and substantially alleviates the vanishing gradient problem. A
stack of residual connection layers can be expressed as an expansion of terms
similar to the Taylor expansion. This expansion suggests the possibility of
truncating the higher-order terms and receiving an architecture consisting of a
single broad layer composed of all initially stacked layers in parallel. In
other words, a sequential deep architecture is substituted by a parallel
shallow one. Prompted by this theory, we investigated the performance
capabilities of the parallel architecture in comparison to the sequential one.
The computer vision datasets MNIST and CIFAR10 were used to train both
architectures for a total of 6912 combinations of varying numbers of
convolutional layers, numbers of filters, kernel sizes, and other meta
parameters. Our findings demonstrate a surprising equivalence between the deep
(sequential) and shallow (parallel) architectures. Both layouts produced
similar results in terms of training and validation set loss. This discovery
implies that a wide, shallow architecture can potentially replace a deep
network without sacrificing performance. Such substitution has the potential to
simplify network architectures, improve optimization efficiency, and accelerate
the training process.
best architecture choice for complex applications. Their main shortcoming has
been, for a long time, the vanishing gradient which prevented the numerical
optimization algorithms from acceptable convergence. A breakthrough has been
achieved by the concept of residual connections -- an identity mapping parallel
to a conventional layer. This concept is applicable to stacks of layers of the
same dimension and substantially alleviates the vanishing gradient problem. A
stack of residual connection layers can be expressed as an expansion of terms
similar to the Taylor expansion. This expansion suggests the possibility of
truncating the higher-order terms and receiving an architecture consisting of a
single broad layer composed of all initially stacked layers in parallel. In
other words, a sequential deep architecture is substituted by a parallel
shallow one. Prompted by this theory, we investigated the performance
capabilities of the parallel architecture in comparison to the sequential one.
The computer vision datasets MNIST and CIFAR10 were used to train both
architectures for a total of 6912 combinations of varying numbers of
convolutional layers, numbers of filters, kernel sizes, and other meta
parameters. Our findings demonstrate a surprising equivalence between the deep
(sequential) and shallow (parallel) architectures. Both layouts produced
similar results in terms of training and validation set loss. This discovery
implies that a wide, shallow architecture can potentially replace a deep
network without sacrificing performance. Such substitution has the potential to
simplify network architectures, improve optimization efficiency, and accelerate
the training process.
Language
English (United States)
Keywords
cs.LG
HSG Classification
contribution to scientific community
Event Title
KDIR2023
Event Location
Rome, Italy
Event Date
November 2023
Official URL
Contact Email Address
bernhard.bermeitinger@unisg.ch
Additional Information
To be published in proceedings of KDIR2023
File(s)
Loading...
open access
Name
2309.08414.pdf
Size
235.11 KB
Format
Adobe PDF
Checksum (MD5)
33bb5eeea1738356241a985b8162a33a