深度学习的随机矩阵理论模型_v0.1_第1页
深度学习的随机矩阵理论模型_v0.1_第2页
深度学习的随机矩阵理论模型_v0.1_第3页
深度学习的随机矩阵理论模型_v0.1_第4页
深度学习的随机矩阵理论模型_v0.1_第5页
已阅读5页,还剩64页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、深度学习的随机矩阵理论模型深度学习的随机矩阵理论模型邱才明5/9/2022深度学习理论深度学习理论- A Review- A Review3Page . 神经网络将许多单一的神经元连接在一起 一个神经元的输出作为另一个神经元的输入 多层神经网络模型可以理解为多个非线性函数“嵌套” 多层神经网络层数可以无限叠加 具有无限建模能力, 可以拟合任意函数多层神经网络多层神经网络前向传播前向传播1122( ( ( , ), ) ), )nnff f f z w b w bw b4Page . Sigmoid Tanh Rectified linear units (ReLU)常用激活函数常用激活函数1(

2、 )1zf ze( )zzzzeef zee0( )00zzf zz5Page .层数逐年增加层数逐年增加误差逐年下降层数逐年增加今天今天1000 layers1000 layers6Page . Features are learned rather than hand-crafted More layers capture more invariances More data to train deeper networks More computing (GPUs) Better regularization: Dropout New nonlinearities Max pooling

3、, Rectified linear units (ReLU) Theoretical understanding of deep networks remains shallowTheoretical understanding of deep networks remains shallow为什么深度学习性能如此之好为什么深度学习性能如此之好? ?1 Razavian, Azizpour, Sullivan, Carlsson, CNN Features off-the-shelf: an Astounding Baseline for Recognition. CVPRW142 Sriv

4、astava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):19291958.3 Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training

5、by reducing internal covariate shift. In International Conference on Machine Learning, pages 448456.7Page . Experimental Neuroscience uncovered: Neural architecture of Retina/LGN/V1/V2/V3/ etc Existence of neurons with weights and activation functions (simple cells) Pooling neurons (complex cells) A

6、ll these features are somehow present in Deep Learning systems神经科学带来的启示神经科学带来的启示NeuroscienceNeuroscienceDeep NetworkDeep NetworkSimple cellsFirst layerComplex cellsPooling LayerGrandmother cellsLast layer8Page . Olshausen and Field demonstrated that receptive fields learned from image patches. Olsha

7、usen and Field showed that optimization process can drive learning image representations.OlshausenOlshausen and Fields Work (Nature, and Fields Work (Nature, 1996) 1996) 9Page . Olshausen-Field representations bear strong resemblance to defined mathematical objects from harmonic analysis wavelets, r

8、idgelets, curvelets. Harmonic analysis: long history of developing optimal representations via optimization Research in 1990s: Wavelets etc are optimal sparsifying transforms for certain classes of imagesHarmonic analysisHarmonic analysis10Page . Class prediction rule can be viewed as function f(x)

9、of high-dimensional argument Curse of Dimensionality Traditional theoretical obstacle to high-dimensional approximation Functions of high dimensional x can wiggle in too many dimensions to be learned from finite datasetsApproximation TheoryApproximation Theory11Page . Approximation theoryApproximati

10、on theory Perceptrons and multilayer feedforward networks are universal approximators: Cybenko 89, Hornik 89, Hornik 91, Barron 93 Optimization theoryOptimization theory No spurious local optima for linear networks: Baldi & Hornik 89 Stuck in local minima: Brady 89 Stuck in local minima, but con

11、vergence guarantees for linearly separable data: Gori & Tesi 92 Manifold of spurious local optima: Frasconi 97Early Theoretical Results on Deep Early Theoretical Results on Deep LearningLearning1 Cybenko. Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, a

12、nd Systems, 2 (4), 303-314, 1989.2 Hornik, Stinchcombe and White. Multilayer feedforward networks are universal approximators, Neural Networks, 2(3), 359-366, 1989.3 Hornik. Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, 4(2), 251257, 1991.4 Barron. Universal approxi

13、mation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930945, 1993.5 P Baldi, K Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural networks, 1989.6 Brady, Raghavan, Slawny. Back propagation

14、fails to separate where perceptrons succeed. IEEE Trans Circuits & Systems, 36(5):665674, 1989.7 Gori, Tesi. On the problem of local minima in backpropagation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):7686, 1992.8 Frasconi, Gori, Tesi. Successes and failures of backpropaga

15、tion: A theoretical. Progress in Neural Networks: Architecture, 5:205, 1997.12Page . Invariance, stability, and learning theoryInvariance, stability, and learning theory Scattering networks: Bruna 11, Bruna 13, Mallat 13 Deformation stability for Lipschitz non-linearities: Wiatowski 15 Distance and

16、margin-preserving embeddings: Giryes 15, Sokolik 16 Geometry, generalization bounds and depth efficiency: Montufar 15,Neyshabur 15, Shashua 14 15 16 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 1 Bruna-Mallat. Classification with scattering operators, CVPR1

17、1. Invariant scattering convolution networks, arXiv12. Mallat Waldspurger. Deep Learning by Scattering, arXiv13.2 Wiatowski, Blcskei. A mathematical theory of deep convolutional neural networks for feature extraction. arXiv 2015.3 Giryes, Sapiro, A Bronstein. Deep Neural Networks with Random Gaussia

18、n Weights: A Universal Classification Strategy? arXiv:1504.08291.4 Sokolic. Margin Preservation of Deep Neural Networks, 20155 Montufar. Geometric and Combinatorial Perspectives on Deep Neural Networks, 2015.6 Neyshabur. The Geometry of Optimization and Generalization in Neural Networks: A Path-base

19、d Approach, 2015. 13Page . Optimization theory and algorithmsOptimization theory and algorithms Learning low-degree polynomials from random initialization: Andoni14 Characterizing loss surface and attacking the saddle point problem: Dauphin 14 , Choromanska 15, Chaudhuri 15 Global optimality in neur

20、al network training: Haeffele 15 Non-convex optimization: Dauphin14 Training NNs using tensor methods: Janzamin 15 Recent Theoretical Results on Deep Learning Recent Theoretical Results on Deep Learning 7 Andoni, Panigraphy, Valiant, Zhang. Learning Polynomials with Neural Networks. ICML 2014.8 Daup

21、hin, Pascanu, Gulcehre, Cho, Ganguli, Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014.9 Choromanska, Henaff, Mathieu, Arous, LeCun, “The Loss Surfaces of Multilayer Networks,” AISTAT 2015.10 Chaudhuri and Soatto The Effect of Gradient

22、 Noise on the Energy Landscape of Deep Networks, arXiV 2015.11 Haeffele, Vidal. Global Optimality in Tensor Factorization, Deep Learning and Beyond, arXiv, 2015.12 Janzamin, Sedghi, Anandkumar, Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods, arxiv 20

23、15. 13 Dauphin, Yann N., et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems. 2014.RMT of Deep LearningRMT of Deep Learning15Page . Pearson, Fisher, Neyman, 经典统计学(1900-1940s) 无穷向量的相关性(Karl Pearson, 1

24、905)有限向量的相关性(Fisher, 1924)低维问题,随机变量维数 N=2-10 高维假设检验基因检测:N=6033基因组, n=102人 电网检测:N=300010000 PMUs, n次采样观测 A. N. Kolmogorov,渐近理论 (1970-1974)高维协方差矩阵新统计模型:,n, 传统意义上中心极限定理不再适用!随机矩阵理论, E. Wigner (1955), Marchenko-Pastur (1967)估计误差累积, 有限的偏差Whats Random Matrix Theory(RMT)Whats Random Matrix Theory(RMT)111212

25、122212T120,110001010011 ?TTN TijNNNTN NNXXXXXXXXXXXXXT N16Page . The eigenvalues of a non-Hermitia random matrix follow a ring law The unit circles are predicted by free probability theory. Product of Non-Product of Non-HermitianHermitian Random Matrices: Random Matrices: Noise OnlyNoise Only165/9/2

26、022121LLiiYX XXXL=5L=117Page . The Stieltjes transform G of a probability distribution is The distribution can be recovered using the inversion formula Given the Stieltjes transform G, the R transform is defined as the solution to the functional equation The benefit of the R transform. If A and B ar

27、e freely independent, thenStieltjesStieltjes transform, R transform transform, R transform18Page . Gradient descent types algorithms. Use one order information Under sufficient regularity conditions Newton type methods. Use two order information (curvature)Hessian matrixHessian matrixHessian Hessian

28、 MatrixMatrixGradient descent (green) and Newtons method (red) for minimizing a function.19Page . Hessian decomposition: LeCun98 G is the sample covariance matrix of the gradients of model outputs H is the Hessian of the model outputs Generalized Gauss-Newton decomposition of the Hessian: Sagun17 Mo

29、del function: Loss function: The gradient of the loss per fixed sample is Hessian can be written asHessian decompositionHessian decomposition1 Y LeCun, L Bottou, GB ORR, and K-R Mller. Efficient backprop. Lecture notes in computer science, pages 950, 1998.2 Sagun, L., Evci, U., Guney, V. U., Dauphin

30、, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.20Page . Define Hessian of the loss can be written asHessian decompositionHessian decomposition21Page . 增加神经元个数“成比例”地增加bulk , 但outliers取决于数据Empirical resultsEmpirical results1 Sagun, L., Evci, U., G

31、uney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical analysis of the hessian of over-parametrized neural networks.22Page . H = wishart + wigner H0 is positive semi-definite H1 comes from second derivatives and contains all of the explicit dependence on the residualsGeometry of Loss Surfaces

32、via Geometry of Loss Surfaces via RMTRMT01HHH1 Pennington, Jeffrey, and Yasaman Bahri. Geometry of Neural Network Loss Surfaces via Random Matrix Theory. International Conference on Machine Learning. 2017.23Page . Under very weak assumption With R transform Stieltjes transform G is the solution of t

33、he cubic equation Index:负特征值个数 , 临界值(特征值全为0)Spectral distributionSpectral distributionc0( , )cc Page .25Page . Hessian 矩阵的特征值分布直观上可以分成两部分:bulk & outlier Bulk is concentrated around zero Outliers are scattered away from zero bulk 反应了网络参数的冗余 outliers 反应了输入数据的复杂度Singularity of the Singularity of th

34、e H Hessian in deep essian in deep learninglearningthe bulk of the eigenvalues depend on the the bulk of the eigenvalues depend on the architecturearchitecturethe top discrete eigenvalues depend on the top discrete eigenvalues depend on the datathe data1 Sagun, Levent, Lon Bottou, and Yann LeCun. Si

35、ngularity of the Hessian in Deep Learning.arXiv preprint arXiv:1611.07476 (2016).26Page . Using RMT and exact solution in linear models , the authors derive the generation error and train dynamics of learning : 0特征值,no learning dynamics,此时初值直接影响泛化性能 Non-zero but smaller特征值,导致学习很慢,以及严重的过拟合 防止过拟合时,参数与

36、样本数差不多是最坏的情况,必须要早停 对于很大的网络(参数很多),过度训练影响不大 减小 , 有利于减小泛化误差,即选择较小初值Errors of shallow networksErrors of shallow networks1 Advani, Madhu S., and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017).1 Advani, Madhu S., and Andrew M. S

37、axe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667 (2017). Notation and assumption The Gram matrix , W is a random weight matrix, X is a random data matrix, and f is a pointwise nonlinear activation function Assumption:Nonlinear random matrix t

38、heory for deep Nonlinear random matrix theory for deep learninglearningMoment methodMoment method The moment generate function: The Stieltjes transform: The k-th moment of the LSD: The idea behind the moment method is to compute the k-th by expanding out powers of M inside the trace as: Define The S

39、tieltjes transform of the spectral density of M satisfies whereThe The StieltjesStieltjes transform of M transform of M , then which is precisely the equation satisfied by the Stieltjes transform of the Marchenko-Pastur distribution with shape parameterLimiting casesLimiting casesHow to calculate mo

40、mentHow to calculate momentPage .34Page . Background Weight initialization in deep networks can have a dramatic impact on learning speed. Ensuring the mean squared singular value of a networks input-output Jacobian is O(1) is essential for avoiding exponentially vanishing or exploding gradients. In

41、deep linear networks, ensuring that all singular values of the Jacobian are concentrated near 1 can yield a dramatic additional speed-up in learning; this is a property known as dynamical isometry. It is unclear how to achieve dynamical isometry in nonlinear deep networks.Dynamical isometryDynamical

42、 isometry1 Saxe, Andrew M., James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120 (2013).35Page . ReLU networks are incapable of dynamical isometry; Sigmoidal networks can achieve isometry, but onl

43、y with orthogonal weight initialization; Controlling the entire distribution of Jacobian singular values is an important design consideration in deep learningDynamical isometry Dynamical isometry 1 Pennington, Jeffrey, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning t

44、hrough dynamical isometry: theory and practice. Advances in neural information processing systems. 2017.深度学习应用于医疗数据深度学习应用于医疗数据Page . 问题-分析已知结论对象(健康人/病患者)的EEG数据,提出对精神疾病有效的判别依据 数据-测试者5分钟的脑电波数据,容量为64*304760,常规统计方法难以提炼出有价值的信息 方法1-采用LES指标,可很好判别测试者 方法2-采用深度学习效果显著精神病脑电波数据分析精神病脑电波数据分析Raw DataRaw Data40 High

45、-risk individuals (CHR)40 Healthy controls (HC)40 First-episode patients (FES)Data FormatData Format64*(1000*60*5)=64*300000 Sampling frequencySampling frequency1000HzPage . 随机矩阵 深度学习 网络共7层; 每类人选取75%人做训练,25%人做测试,交叉验证;随机矩阵随机矩阵v.sv.s. .深度学习深度学习HCHCFESFESCHRCHRAvgAvg0.949 0.895 0.72985.798.1%39Page .吸毒

46、成瘾吸毒成瘾MRIMRI数据分析数据分析Raw DataRaw Data(Resting State)(Resting State)30 Methamphetamine (MA)29 Healthy controls (HC)SamplingSampling Time Time8 minsSampling Sampling frequencyfrequency0.5HzData SizeData Size(64*64*31)*240Page . 每个MRI文件切分为31张64x64图片 建立31个相应CNN模型,进行分类 最终判别结果由31个分类结果投票表决 训练集46人(MA:24,HC:2

47、2) 测试集12人(MA:6,HC:6) 深度学习结果 随机矩阵结果深度学习深度学习HCHCMAMA100%70.7%85.33%脑部脑部CTCT图像识别脑出血图像识别脑出血/ /脑梗脑梗 正常、脑梗、脑出血三类数据总共90个样本 输入为256*256, jpg格式 CNN网络共7层卷积层 每类样本取80%做训练,20%做测试正常正常梗塞梗塞出血、挫伤出血、挫伤AvgAvg100%100%50%83.33%脑出血深度学习应用于微波图像深度学习应用于微波图像43Page .微波遥感图像目标检测识别微波遥感图像目标检测识别44Page .毫米波人体安检仪毫米波人体安检仪微波图像安检仪危险/可疑物品45Page . 有效数据 被检测物体能够清晰可见 陶瓷刀20组, 有效数据70个 水瓶30组, 有效数据153个 枪30组, 有效数据145个 金属刀30组, 有效数据108个 迁移学习迁移学习- -利用小样本集利用小样本集46Page .深度学习性能卓越深度学

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论