Deep Learning

作者：Yann Lecun, Yoshua Bengio, Geoffrey Hinton

Abstract

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。这些方法极大地提高了语音识别、视觉对象识别、对象检测以及药物发现和基因组学等许多其他领域的最先进水平。深度学习通过使用反向传播算法来发现大数据集中的复杂结构，以指示机器应如何更改其内部参数，这些参数用于根据前一层的表示计算每层的表示。深度卷积网络在处理图像、视频、语音和音频方面带来了突破，而循环网络则在文本和语音等序列数据方面取得了突破。

Introduction

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.

机器学习技术为现代社会的许多方面提供动力：从网络搜索到社交网络上的内容过滤，再到电子商务网站上的推荐，并且它越来越多地出现在相机和智能手机等消费产品中。机器学习系统用于识别图像中的对象，将语音转录为文本，将新闻、帖子或产品与用户的兴趣进行匹配，并选择相关的搜索结果。这些应用程序越来越多地使用一类称为深度学习的技术。

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.

传统的机器学习技术处理原始形式的自然数据的能力受到限制。几十年来，构建模式识别或机器学习系统需要仔细的工程设计和大量的领域专业知识来设计特征提取器，将原始数据（例如图像的像素值）转换为合适的内部表示或特征向量，学习子系统（通常是分类器）可以检测或分类输入中的模式。

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations.An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts.The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.

表示学习是一组方法，允许机器输入原始数据并自动发现检测或分类所需的表示。深度学习方法是具有多个表示级别的表示学习方法，通过组合简单但非线性的模块获得，每个模块将一个级别的表示（从原始输入开始）转换为更高、更抽象级别的表示。通过足够多的此类变换的组合，可以学习非常复杂的函数。对于分类任务，更高层的表示会放大输入中对于区分很重要的方面并抑制不相关的变化。例如，图像采用像素值数组的形式，第一层表示中学习到的特征通常表示图像中特定方向和位置处边缘的存在或不存在。第二层通常通过发现边缘的特定排列来检测图案，而不管边缘位置的微小变化。第三层可以将图案组装成与熟悉对象的部分相对应的更大的组合，并且后续层将检测对象作为这些部分的组合（解释：层数的不断深入，会使感受野也不断增大）。深度学习的关键在于，这些特征层不是由人类工程师设计的：它们是使用通用学习过程从数据中学习的。

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition1–4 and speech recognition5–7, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules8, analysing particle accelerator data9,10, reconstructing brain circuits11, and predicting the effects of mutations in non-coding DNA on gene expression and disease12,13. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding14, particularly topic classification, sentiment analysis, question answering15 and language translation16,17.

高维数据中的复杂结构，因此适用于科学、商业和政府的许多领域。除了打破图像识别和语音识别方面的记录外，它还在预测潜在药物分子的活性、分析粒子加速器数据、重建大脑回路以及预测突变影响方面击败了其他机器学习技术非编码DNA对基因表达和疾病的影响。也许更令人惊讶的是，深度学习在自然语言理解的各种任务中产生了非常有希望的结果，特别是主题分类、情感分析、问答和语言翻译。

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.

我们认为深度学习在不久的将来将会取得更多成功，因为它需要很少的手工工程，因此它可以轻松地利用可用计算和数据量的增加。目前正在为深度神经网络开发的新学习算法和架构只会加速这一进程。

Supervised learning

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores.The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine.In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

机器学习最常见的形式，无论深度与否，都是监督学习。想象一下，我们想要构建一个系统，可以将图像分类为包含房屋、汽车、人或宠物。我们首先收集大量房屋、汽车、人和宠物的图像数据集，每个数据集都标有其类别。在训练过程中，机器会看到一张图像，并以分数向量的形式产生输出，每个分数对应一个类别。我们希望所需的类别在所有类别中得分最高，但这在训练之前不太可能发生。我们计算一个目标函数，用于测量输出分数与所需分数模式之间的误差（或距离）。然后机器修改其内部可调参数以减少该误差。这些可调节参数通常称为权重，是实数，可以视为定义机器输入输出功能的“旋钮”。在典型的深度学习系统中，可能有数亿个可调整权重，以及数亿个用于训练机器的标记示例。

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.

为了正确调整权重向量，学习算法计算一个梯度向量，对于每个权重，该梯度向量指示如果权重增加一点点，误差将增加或减少多少。然后将权重向量沿与梯度向量相反的方向调整。

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

目标函数对所有训练样本进行平均，可以被视为权重值高维空间中的一种丘陵景观（见图1）。负梯度向量表示该景观中最陡下降的方向，使其更接近最小值，此时输出误差平均较低。

Hügellandschaft - Chengming als Kunstdruck oder Gemälde.

丘陵景观

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

在实践中，大多数从业者使用称为随机梯度下降（SGD）的过程。这包括显示几个示例的输入向量、计算输出和误差、计算这些示例的平均梯度以及相应地调整权重。对训练集中的许多小样本集重复该过程，直到目标函数的平均值停止下降。之所以称为随机，是因为每一小组示例都给出了所有示例的平均梯度的噪声估计。与更为复杂的优化技术相比，这个简单的过程通常会以惊人的速度找到一组好的权重18。训练后，系统的性能在称为测试集的不同示例集上进行测量。这用于测试机器的泛化能力——它根据训练期间从未见过的新输入产生合理答案的能力。

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

当前机器学习的许多实际应用在手工设计的特征之上都使用线性分类器。二类线性分类器计算特征向量分量的加权和。如果加权和高于阈值，则输入被分类为属于特定类别。

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). **At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category.This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. **The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

自 20 世纪 60 年代以来，我们知道线性分类器只能将其输入空间划分为非常简单的区域，即由超平面分隔的半空间19。但是图像和语音识别等问题要求输入输出函数对输入的不相关变化不敏感，例如物体的位置、方向或照明的变化，或者语音的音调或口音的变化，同时又非常不敏感。对特定的微小变化敏感（例如，白狼和一种称为萨摩耶的类似狼的白狗品种之间的区别）。要求的特征提取的能力。在像素级别上，两只萨摩耶犬在不同姿势和不同环境下的图像可能彼此非常不同，而萨摩耶犬和狼在相同位置和相似背景下的两张图像可能彼此非常相似。线性分类器，或任何其他运行的“浅”分类器原始像素不可能区分后两者，而将前两者归为同一类别。这就是为什么浅层分类器需要一个好的特征提取器来解决选择性不变性困境——生成的表示对图像中对区分很重要的方面具有选择性，但对不相关的方面（例如物体的姿势）保持不变。动物。为了使分类器更强大，可以使用通用的非线性特征，如核方法20，但通用特征（例如高斯核所产生的特征）不允许学习者在远离训练示例的情况下很好地泛化。传统的选择是手工设计良好的特征提取器，这需要大量的工程技能和领域专业知识。但如果可以使用通用学习程序自动学习好的特征，这一切都可以避免。这是深度学习的关键优势。

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

深度学习架构是简单模块的多层堆栈，所有（或大部分）模块都需要学习，其中许多模块计算非线性输入输出映射。堆栈中的每个模块都会转换其输入，以增加表示的选择性和不变性。通过多个非线性层，例如深度为 5 到 20，系统可以实现极其复杂的输入功能，这些功能同时对微小细节敏感（区分萨摩耶犬和白狼），并且对大的不相关变化不敏感，例如背景、姿势、灯光和周围物体。

Backpropagation to train multilayer architectures

From the earliest days of pattern recognition22,23, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s24–27.

从模式识别的早期开始22,23，研究人员的目标一直是用可训练的多层网络取代手工设计的特征，但尽管该解决方案很简单，但直到 20 世纪 80 年代中期才被广泛理解。事实证明，多层架构可以通过简单的随机梯度下降来训练。只要模块是其输入及其内部权重的相对平滑的函数，就可以使用反向传播过程来计算梯度。 24-27 世纪 70 年代和 80 年代，几个不同的团体独立发现了这种方法可以实现并且有效的想法。

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

计算目标函数相对于多层模块堆栈权重的梯度的反向传播过程只不过是该链的实际应用衍生品规则。关键的见解是，目标相对于模块输入的导数（或梯度）可以通过从相对于该模块的输出（或后续模块的输入）的梯度逆向计算来计算（图1）。反向传播方程可以重复应用，以将梯度传播到所有模块，从顶部的输出（网络产生预测的位置）开始一直到底部（输入外部输入的位置）。一旦计算出这些梯度，就可以直接计算相对于每个模块的权重的梯度。

图 1 - 多层神经网络和反向传播。 a，多层神经网络（由连接的点所示）可以扭曲输入空间，使数据类别（其示例在红线和蓝线上）线性可分。请注意输入空间中的常规网格（如左侧所示）如何通过隐藏单元进行转换（如中间面板所示）。这是一个只有两个输入单元、两个隐藏单元和一个输出单元的说明性示例，但用于对象识别或自然语言处理的网络包含数万或数十万个单元。经 C. Olah (http://colah.github.io/) 许可转载。 b，导数的链式法则告诉我们两个小效应（x 对 y 的微小变化，以及 y 对 z 的微小变化）是如何组成的。 x 中的微小变化 Δx 首先乘以 ∂y/∂x（即偏导数的定义），转换为 y 中的微小变化 Δy。类似地，变化 Δy 在 z 中产生变化 Δz。将一个方程代入另一个方程可得出导数的链式法则 — Δx 如何通过乘以 ∂y/∂x 和 ∂z/∂x 的乘积变成 Δz。当 x、y 和 z 是向量（且导数是雅可比矩阵）时，它也适用。 c，用于计算具有两个隐藏层和一个输出层的神经网络中的前向传递的方程，每个方程构成一个模块。哪一个可以反向传播梯度。在每一层，我们首先计算每个单元的总输入 z，它是下一层单元输出的加权和。然后将非线性函数 f(.) 应用于 z 以获得单元的输出。为了简单起见，我们省略了偏差项。神经网络中使用的非线性函数包括近年来常用的修正线性单元（ReLU）f(z) = max(0,z)，以及更传统的 sigmoid，例如双曲正切 f (z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) 和逻辑函数logistic，f(z) = 1/(1 + exp(−z)) 。 d，用于计算反向传播的方程。在每个隐藏层，我们计算相对于每个单元的输出的误差导数，它是相对于上一层单元的总输入的误差导数的加权和。然后，我们将相对于输出的误差导数乘以 f(z) 的梯度，将其转换为相对于输入的误差导数。在输出层，通过对成本函数求导来计算相对于单元输出的误差导数。如果单位 l 的成本函数为 0.5(yl − tl)2，则给出 yl − tl，其中 tl 是目标值。一旦 ∂E/∂zk 已知，来自下层单元 j 的连接上的权重 wjk 的误差导数就是 yj ∂E/∂zk。

Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

深度学习的许多应用都使用前馈神经网络架构（图 1），它学习将固定大小的输入（例如图像）映射到固定大小的输出（例如，几个类别中每个类别的概率）。为了从一层到下一层，一组单元计算前一层输入的加权和，并将结果传递给非线性函数。目前最流行的非线性函数是整流线性单元（ReLU），简单来说就是半波整流器f(z) = max(z, 0)。在过去的几十年中，神经网络使用更平滑的非线性，例如 tanh(z) 或 1/(1 + exp(−z))，但 ReLU 通常在多层网络中学习速度要快得多，从而允许训练深度监督网络无需无监督预训练28。不在输入或输出层中的单元通常称为隐藏单元。隐藏层可以被视为以非线性方式扭曲输入，以便类别可以通过最后一层线性分离（图 1）。

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

20 世纪 90 年代末，神经网络和反向传播在很大程度上被机器学习界抛弃，并被计算机视觉和语音识别界忽视。人们普遍认为，在先验知识很少的情况下学习有用的多级特征提取器是不可行的。特别是，人们普遍认为简单的梯度下降会陷入不良的局部极小值——权重配置中，任何小的改变都不会减少平均误差。

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder29,30. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

实际上，对于大型网络来说，较差的局部最小值很少会成为问题。无论初始条件如何，系统几乎总能获得质量非常相似的解决方案。最近的理论和实证结果强烈表明，局部极小值一般来说并不是一个严重的问题。相反，景观中充满了梯度为零的组合大量鞍点，并且表面在大多数维度上向上弯曲，在维度上向下弯曲。分析似乎表明，只有少数向下弯曲方向的鞍点数量非常多，但几乎所有鞍点的目标函数值都非常相似。因此，算法卡在这些鞍点中的哪一个并不重要。

Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). **The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. **The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation33–35. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36.

2006 年左右，加拿大高级研究所 (CIFAR) 召集的一组研究人员重新燃起了对深度前馈网络的兴趣（参考文献 31-34）。研究人员引入了无监督学习程序，可以创建多层特征检测器，而不需要标记数据。（玻尔兹曼机）学习每一层特征检测器的目标是能够重建或建模下层特征检测器（或原始输入）的活动。通过使用此重建目标“预训练”几层逐渐复杂的特征检测器，深层网络的权重可以初始化为合理的值。然后可以将最后一层输出单元添加到网络的顶部，并且可以使用标准反向传播33-35对整个深层系统进行微调。这对于识别手写数字或检测行人非常有效，特别是当标记数据量非常有限时36。

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary38 and was quickly developed to give record-breaking results on a large vocabulary task39. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting40, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

这种预训练方法的第一个主要应用是语音识别，它是由于快速图形处理单元 (GPU) 的出现而成为可能的，这些单元易于编程37，并且允许研究人员以 10 或 20 倍的速度训练网络。 2009 年，该方法被用来将从声波中提取的系数的短时间窗口映射到可能由窗口中心的帧表示的各种语音片段的一组概率。它在使用小词汇量的标准语音识别基准测试中取得了破纪录的结果38，并很快被开发出来，在大词汇量任务上取得了破纪录的结果39。到 2012 年，许多主要语音团体 6 都在开发 2009 年的深网版本，并且已经部署在 Android 手机中。对于较小的数据集，无监督预训练有助于防止过度拟合40，当标记示例数量较少时，或者在传输设置中，某些“源”任务有很多示例，但某些“源”任务示例很少时，泛化效果显着提高“目标”任务。一旦深度学习得到恢复，事实证明预训练阶段只需要小数据集。

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community.

然而，有一种特殊类型的深度前馈网络，它比相邻层之间完全连接的网络更容易训练并且泛化得更好。这就是卷积神经网络（ConvNet）41,42。它在神经网络失宠的时期取得了许多实际成功，并且最近被计算机视觉社区广泛采用。

Convolutional neural networks

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

卷积网络旨在处理以多个数组形式出现的数据，例如由三个二维数组组成的彩色图像，其中包含三个颜色通道中的像素强度。许多数据模态采用多个数组的形式：一维用于信号和序列，包括语言； 2D 图像或音频频谱图；以及用于视频或体积图像的 3D。 ConvNet 背后有四个利用自然信号特性的关键思想：本地连接、共享权重、池化和多层的使用。

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

典型 ConvNet 的架构（图 2）由一系列阶段组成。前几个阶段由两种类型的层组成：卷积层和池化层。卷积层中的单元以特征映射的形式组织，其中每个单元通过一组称为滤波器组的权重连接到前一层特征映射中的局部补丁。然后，该局部加权和的结果会通过 ReLU 等非线性函数。特征图中的所有单元共享相同的滤波器组。层中的不同特征图使用不同的滤波器组。原因是他的建筑是双重的。首先，在图像等数组数据中，局部值组通常高度相关，形成易于检测的独特局部图案。其次，图像和其他信号的本地统计数据对于位置而言是不变的。换句话说，如果一个主题可以出现在图像的一个部分，它就可以出现在任何地方，因此不同位置的单元共享相同的权重并在阵列的不同部分检测相同的模式的想法。从数学上讲，特征图执行的过滤操作是离散卷积，因此得名。（卷积平移不变性）

图 2 - 卷积网络内部。应用于萨摩耶狗图像的典型卷积网络架构每层（水平）的输出（不是滤波器）（左下；RGB（红、绿、蓝）输入，右下）。每个矩形图像都是一个特征图对应于在每个图像位置检测到的学习特征之一的输出。信息自下而上流动，较低级别的特征充当定向边缘检测器，并为输出中的每个图像类别计算分数。 ReLU，修正线性单元。

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

对应于在每个图像位置检测到的学习特征之一的输出。信息自下而上流动，较低级别的特征充当定向边缘检测器，并为输出中的每个图像类别计算分数。 ReLU，修正线性单元。虽然卷积层的作用是检测前一层特征的局部连接，但池化层的作用是将语义相似的特征合并为一个。由于形成主题的特征的相对位置可能会有所不同，因此可以通过粗粒度化每个特征的位置来可靠地检测主题。典型的池单元计算一个特征图中（或几个特征图中）的局部单元块的最大值。相邻的池化单元从移动超过一行或一列的补丁获取输入，从而减少表示的维度并创建对小移动和扭曲的不变性。两级或三级卷积、非线性和池化堆叠起来，后面是更多的卷积层和全连接层。通过 ConvNet 反向传播梯度就像通过常规深度网络一样简单，允许训练所有滤波器组中的所有权重。

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

深度神经网络利用了许多自然信号是组合层次结构的特性，其中通过组合较低级别的特征来获得较高级别的特征。在图像中，边缘的局部组合形成图案，图案组装成部件，部件形成物体。类似的层次结构存在于语音和文本中，从声音到音素、音素、音节、单词和句子。当前一层中的元素位置和外观发生变化时，池化允许表示变化很小。

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience43, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway44. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex45. ConvNets have their roots in the neocognitron46, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words47,48.

ConvNet 中的卷积层和池化层直接受到视觉神经科学中简单细胞和复杂细胞的经典概念的启发43，整体架构让人想起视觉皮层腹侧通路中的 LGN-V1-V2-V4-IT 层次结构44。当 ConvNet 模型和猴子看到相同的图片时，ConvNet 中高级单元的激活解释了猴子下颞叶皮层 160 个神经元随机组的一半方差45。卷积网络起源于 neocognitron46，其架构有些相似，但没有反向传播等端到端监督学习算法。称为时延神经网络的原始一维卷积网络用于识别音素和简单单词47,48。

There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition47 and document reading42. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft49. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands50,51, and for face recognition52.

卷积网络的应用可以追溯到 20 世纪 90 年代初期，首先是用于语音识别 47 和文档阅读 42 的时延神经网络。文档阅读系统使用了与实现语言约束的概率模型联合训练的 ConvNet。到 20 世纪 90 年代末，该系统读取了美国 10% 以上的支票。 Microsoft49 后来部署了许多基于 ConvNet 的光学字符识别和手写识别系统。 20 世纪 90 年代初，ConvNet 也被尝试用于自然图像中的对象检测，包括面部和手部 50,51，以及面部识别。

Image understanding with deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abun-dant, such as traffic sign recognition53, the segmentation of biological images54 particularly for connectomics55, and the detection of faces, text, pedestrians and human bodies in natural images36,50,51,5658. A major recent practical success of ConvNets is face recognition59. Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7.

自 2000 年代初以来，ConvNet 已成功应用于图像中对象和区域的检测、分割和识别。这些都是标记数据相对丰富的任务，例如交通标志识别53、生物图像分割54（特别是连接组学55）以及自然图像中的人脸、文本、行人和人体检测36,50,51,5658 。卷积网络最近取得的一项重大实际成功是人脸识别59。重要的是，图像可以在像素级别进行标记，这将在技术中得到应用，包括自主移动机器人和自动驾驶汽车60,61。 Mobileye 和 NVIDIA 等公司正在其即将推出的汽车视觉系统中使用此类基于 ConvNet 的方法。其他越来越重要的应用包括自然语言理解14和语音识别7。

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches1. **This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; **ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

尽管取得了这些成功，但在 2012 年 ImageNet 竞赛之前，ConvNet 在很大程度上被主流计算机视觉和机器学习社区所抛弃。当深度卷积网络应用于来自网络的约 100 万张图像（包含 1,000 个不同类别）的数据集时，他们取得了惊人的成果，几乎将最佳竞争方法的错误率减半1。这一成功得益于 GPU、ReLU、一种名为 dropout62 的新正则化技术以及通过变形现有训练示例来生成更多训练示例的技术的有效使用。这一成功带来了计算机视觉领域的一场革命；卷积网络现在是几乎所有识别和检测任务的主要方法4,58,59,63–65，并且在某些任务上接近人类的表现。最近的一个令人惊叹的演示结合了 ConvNet 和循环网络模块来生成图像标题（图 3）。

图 3 - 从图像到文本。由循环神经网络 (RNN) 生成的字幕，将深度卷积神经网络 (CNN) 从测试图像中提取的表示作为额外输入，并训练 RNN 将图像的高级表示“翻译”为字幕（顶部）。转载经参考文献许可。 102. 当 RNN 能够在生成每个单词（粗体）时将注意力集中在输入图像中的不同位置（中间和底部；较浅的色块受到更多关注）时，我们发现 86 它利用这一点来更好地将图像“翻译”成字幕。

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

最近的 ConvNet 架构具有 10 到 20 层 ReLU、数亿个权重以及单元之间数十亿个连接。两年前，训练如此大型的网络可能需要数周时间，而硬件、软件和算法并行化的进步已将训练时间缩短至几个小时。

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

基于ConvNet的视觉系统的性能引起了大多数主要科技公司的关注，包括谷歌、Facebook、Microsoft、IBM、Yahoo!、Twitter 和 Adobe 以及越来越多的初创企业启动研发项目并部署基于 ConvNet 的图像理解产品和服务。

ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

卷积网络很容易适应芯片或现场可编程门阵列中的高效硬件实现66,67。 NVIDIA、Mobileye、英特尔、高通和三星等多家公司正在开发 ConvNet 芯片，以支持智能手机、相机、机器人和自动驾驶汽车中的实时视觉应用。

Distributed representations and language processing

Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth).（这块不是很懂）

深度学习理论表明，与不使用分布式表示的经典学习算法相比，深度网络具有两种不同的指数优势。这两个优点都源于组合的力量，并取决于具有适当组件结构的底层数据生成分布40。首先，学习分布式表示能够泛化到训练期间看到的学习特征值的新组合（例如，n 个二进制特征可能有 2n 个组合）68,69。其次，在深层网络中组合表示层带来了另一个指数优势的潜力70（深度上的指数）。

The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words71. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated27 in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable71. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications14,17,72–76.

多层神经网络的隐藏层学习以一种易于预测目标输出的方式表示网络的输入。通过训练多层神经网络来预测序列中的下一个单词，可以很好地证明这一点。前面单词的上下文71。上下文中的每个单词都作为 N 个向量之一呈现给网络，即一个分量的值为 1，其余的值为 0。在第一层中，每个单词创建不同的激活模式，或者词向量（图4）。在语言模型中，网络的其他层学习将输入词向量转换为预测下一个词的输出词向量，该输出词向量可用于预测词汇表中任何词作为下一个词出现的概率。网络学习包含许多活动组件的单词向量，每个组件都可以解释为单词的单独特征，正如在学习符号的分布式表示的背景下首次演示的那样27。这些语义特征并未明确存在于输入中。学习过程发现它们是将输入和输出符号之间的结构化关系分解为多个“微规则”的好方法。当单词序列来自大量真实文本并且单个微规则不可靠时，学习单词向量也能发挥很好的作用71。例如，当训练预测新闻报道中的下一个单词时，学习到的星期二和星期三的单词向量非常相似，瑞典和挪威的单词向量也非常相似。这种表示被称为分布式表示，因为它们的元素（特征）不是相互排斥的，并且它们的许多配置对应于观察到的数据中看到的变化。这些词向量由学习到的特征组成，这些特征不是由专家提前确定的，而是由神经网络自动发现的。从文本中学习的单词的向量表示现在在自然语言应用中广泛使用14,17,72–76。

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

表征问题是逻辑启发的认知范式和神经网络启发的认知范式之间争论的核心。在受逻辑启发的范式中，符号实例的唯一属性是它与其他符号实例相同或不同。它没有与其用途相关的内部结构；为了用符号进行推理，它们必须在明智地选择的推理规则中与变量相联系。相比之下，神经网络仅使用大活动向量、大权重矩阵和标量非线性来执行快速“直观”推理，从而支持轻松的常识推理。

Before the introduction of neural language models71, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

图 4 - 可视化学习到的词向量。左边是为建模语言学习的单词表示的图示，使用 t-SNE 算法非线性投影到 2D 以进行可视化103。右侧是英语到法语编码器-解码器循环神经网络75学习的短语的二维表示。人们可以观察到语义相似的词或单词序列被映射到附近的表示。单词的分布式表示是通过使用反向传播共同学习每个单词的表示和预测目标量的函数来获得的，例如序列中的下一个单词（用于语言建模）或翻译单词的整个序列（用于机器翻译） )18,75。

Recurrent neural networks

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

当反向传播首次被引入时，它最令人兴奋的用途是训练循环神经网络（RNN）。对于涉及顺序输入的任务，例如语音和语言，通常最好使用 RNN（图 5）。 RNN 一次处理一个输入序列，在其隐藏单元中维护一个“状态向量”，该向量隐式包含有关序列中所有过去元素的历史信息。当我们将隐藏单元在不同离散时间步长的输出视为深层多层网络中不同神经元的输出时（图 5，右），我们就清楚如何应用反向传播来训练 RNN。

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish77,78.

RNN 是非常强大的动态系统，但事实证明训练它们是有问题的，因为反向传播的梯度在每个时间步要么增长要么收缩，因此在许多时间步中它们通常会爆炸或消失77,78。

Thanks to advances in their architecture79,80 and ways of training them81,82, RNNs have been found to be very good at predicting the next character in the text83 or the next word in a sequence75, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen17,72,76. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion84,85.

由于其架构79,80 和训练方法81,82 的进步，RNN 被发现非常擅长预测文本中的下一个字符83 或序列中的下一个单词75，但它们也可以用于更复杂的任务。例如，在一次一个单词地阅读英语句子后，可以训练英语“编码器”网络，使其隐藏单元的最终状态向量能够很好地表示句子所表达的思想。然后，该思想向量可以用作联合训练的法语“解码器”网络的初始隐藏状态（或作为额外输入），该网络输出法语翻译的第一个单词的概率分布。如果从该分布中选择特定的第一个单词并将其作为输入提供给解码器网络，则它将输出翻译的第二个单词的概率分布，依此类推，直到选择句号17,72,76。总的来说，这个过程根据取决于英语句子的概率分布生成法语单词序列。这种相当幼稚的机器翻译方式很快就与最先进的技术相媲美，这引发了人们对理解句子是否需要类似于使用推理规则操纵的内部符号表达之类的东西的严重怀疑。它更符合日常推理涉及许多同时类比的观点每个都有助于结论的合理性84,85。

图 5 - 循环神经网络及其前向计算中涉及的计算的时间展开。人工神经元（例如，在节点 s 下分组的隐藏单元，在时间 t 时值为 st）从先前时间步长的其他神经元获取输入（这用黑色方块表示，表示一个时间步长的延迟，位于左侧）。通过这种方式，循环神经网络可以将具有元素 xt 的输入序列映射到具有元素 ot 的输出序列，每个 ot 取决于所有先前的 xtʹ（对于 tʹ ≤ t）。每个时间步都使用相同的参数（矩阵 U,V,W ）。许多其他架构都是可能的，包括网络可以生成一系列输出（例如单词）的变体，每个输出都用作下一个时间步骤的输入。反向传播算法（图 1）可以直接应用于右侧展开网络的计算图，以计算总误差的导数（例如，生成正确输出序列的对数概率）所有状态 st 和所有参数。

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86)

人们可以学习将图像的含义“翻译”成英语句子，而不是将法语句子的含义翻译成英语句子（图3）。这里的编码器是一个深度 ConvNet，它在最后一个隐藏层将像素转换为活动向量。解码器是一个类似于机器翻译和神经语言建模中使用的 RNN。最近人们对此类系统的兴趣激增（参见参考文献 86 中提到的示例）

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long78.

RNN 一旦及时展开（图 5），就可以被视为非常深的前馈网络，其中所有层共享相同的权重。尽管它们的主要目的是学习长期依赖关系，但理论和经验证据表明，学习长期存储信息是很困难的78。

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time79. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

为了纠正这个问题，一个想法是用显式记忆来增强网络。这种类型的第一个提议是使用特殊隐藏单元的长短期记忆（LSTM）网络，其自然行为是长时间记住输入79。称为记忆单元的特殊单元的作用类似于累加器或门控泄漏神经元：它在下一个时间步与自身建立连接，权重为 1，因此它复制自己的实值状态并累积外部信号，但这种自连接是由另一个单元乘法门控的，该单元学习决定何时清除内存内容。

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step87, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation17,72,76.

随后证明 LSTM 网络比传统 RNN 更有效，特别是当它们每个时间步长有多个层时 87，从而实现从声学到转录中的字符序列的整个语音识别系统。 LSTM 网络或相关形式的门控单元目前也用于在机器翻译方面表现出色的编码器和解码器网络17,72,76。

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to88, and memory networks, in which a regular network is augmented by a kind of associative memory89. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

在过去的一年里，几位作者提出了使用记忆模块增强 RNN 的不同建议。提案包括神经图灵机，其中网络通过 RNN 可以选择读取或写入的“类磁带”存储器进行增强 88，以及记忆网络，其中常规网络通过一种关联存储器进行增强 89。记忆网络在标准问答基准测试中表现出色。记忆用于记住随后要求网络回答问题的故事。

Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output

除了简单的记忆之外，神经图灵机和记忆网络还被用于通常需要推理和符号操作的任务。神经图灵机可以学习“算法”。除此之外，他们可以学习输出。当符号的输入由未排序的序列组成时，符号的排序列表，其中每个符号都附有一个实值，该值指示其在列表中的优先级88。记忆网络可以被训练来在类似于文本冒险游戏的环境中跟踪世界的状态，并且在阅读故事后，它们可以回答需要复杂推理的问题90。在一个测试示例中，网络显示了 15 句话版本的《指环王》，并正确回答了诸如“佛罗多现在在哪里？”等问题89。

The future of deep learning

Unsupervised learning91–98 had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

无监督学习91-98在重振人们对深度学习的兴趣方面发挥了催化作用，但此后被纯粹监督学习的成功所掩盖。尽管我们在本次审查中没有重点关注它，但我们预计从长远来看，无监督学习将变得更加重要。人类和动物的学习在很大程度上是无人监督的：我们通过观察来发现世界的结构，而不是通过被告知每个物体的名称。

Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-toend and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems99 at classification tasks and produce impressive results in learning to play many different video games100.

人类视觉是一个主动过程，它使用一个小的、高分辨率的中央凹和一个大的、低分辨率的周围区域，以智能的、特定于任务的方式对光学阵列进行顺序采样。我们预计视觉领域未来的大部分进展将来自于经过端到端训练的系统，并将 ConvNet 与 RNN 相结合，使用强化学习来决定看向何处。深度学习和强化学习相结合的系统还处于起步阶段，但它们在分类任务上已经超越了被动视觉系统99，并在学习玩许多不同的视频游戏100方面产生了令人印象深刻的结果。

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time76,86.

自然语言理解是深度学习有望在未来几年产生巨大影响的另一个领域。我们预计，当使用 RNN 来理解句子或整个文档的系统学习一次有选择地关注某一部分的策略时，它们会变得更好76,86。

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors101.

最终，人工智能的重大进步将通过将表示学习与复杂推理相结合的系统来实现。尽管深度学习和简单推理长期以来一直用于语音和手写识别，但仍需要新的范例来通过对大型向量的操作来取代基于规则的符号表达式操作101。

《Nature》深度学习