Communications on Applied Mathematics and Computation ›› 2024, Vol. 6 ›› Issue (2): 1217-1240.doi: 10.1007/s42967-023-00322-5
• ORIGINAL PAPERS • Previous Articles Next Articles
Yongqiang Cai1, Qianxiao Li2, Zuowei Shen2
Received:
2022-08-26
Revised:
2023-09-15
Accepted:
2023-09-16
Online:
2024-03-01
Published:
2024-03-01
Contact:
Yongqiang Cai,E-mail:caiyq.math@bnu.edu.cn;Qianxiao Li,E-mail:qianxiao@nus.edu.sg;Zuowei Shen,E-mail:matzuows@nus.edu.sg
E-mail:caiyq.math@bnu.edu.cn;qianxiao@nus.edu.sg;matzuows@nus.edu.sg
Supported by:
Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217-1240.
[1] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242-252 (2019) [2] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244-253 (2018) [3] Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629-681 (2017) [4] Bacharoglou, A.: Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc. 138(7), 2619-2628 (2010) [5] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930-945 (1993) [6] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018) [7] Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123-130 (2006) [8] Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123-140 (1996) [9] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036-3046 (2018) [10] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303-314 (1989) [11] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933-2941 (2014) [12] Dean, D.S.: Langevin equation for the density of a system of interacting Langevin processes. J. Phys. A: Math. Gen. 29(24), L613 (1996) [13] Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018) [14] Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1-19 (2019) [15] Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272-279 (2008) [16] E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019) [17] Eschrig, H.: The Fundamentals of Density Functional Theory. Teubner, Stuttgart (1996) [18] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. Int. Conf. Mach. Learn. 96, 148-156 (1996) [19] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256 (2010) [20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672-2680 (2014) [21] Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1-29 (2017) [22] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359-366 (1989) [23] Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3/4), 142-336 (2017) [24] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014) [25] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014) [26] Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017) [27] LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998) [28] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861-867 (1993) [29] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157-8166 (2018) [30] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325-3334 (2018) [31] Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284-4293 (2019) [32] Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1-73 (2021) [33] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), 7665-7671 (2018) [34] Nestoridis, V., Schmutzhard, S., Stefanopoulos, V.: Universal series induced by approximate identities and some relevant applications. J. Approx. Theory 163(12), 1783-1797 (2011) [35] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951-4960 (2019) [36] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177-1184 (2008) [37] Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313-1320 (2009) [38] Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361-5369 (2019) [39] Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1/2), 1-39 (2010) [40] Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146-7155 (2018) [41] Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298-1306 (2016) [42] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725-752 (2020) [43] Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019) [44] Sorenson, H.W., Alspach, D.L.: Recursive Bayesian estimation using Gaussian sums. Automatica 7(4), 465-479 (1971) [45] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929-1958 (2014) [46] Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058-1066 (2013) [47] Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013) [48] Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018) [49] Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017) [50] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393-5402 (2018) [51] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017) [52] Zeevi, A.J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Networks 10(1), 99-109 (1997) [53] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017) |
[1] | Daniil Bochkov, Frederic Gibou. A Non-parametric Gradient-Based Shape Optimization Approach for Solving Inverse Problems in Directed Self-Assembly of Block Copolymers [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1472-1489. |
[2] | Kevin Bui, Yifei Lou, Fredrick Park, Jack Xin. An Efficient Smoothing and Thresholding Image Segmentation Framework with Weighted Anisotropic-Isotropic Total Variation [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1369-1405. |
[3] | Samira Kabri, Alexander Auras, Danilo Riccio, Hartmut Bauermeister, Martin Benning, Michael Moeller, Martin Burger. Convergent Data-Driven Regularizations for CT Reconstruction [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1342-1368. |
[4] | Bj?rn Engquist, Kui Ren, Yunan Yang. Adaptive State-Dependent Diffusion for Derivative-Free Optimization [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1241-1269. |
[5] | Jane Wu, Michael Bao, Xinwei Yao, Ronald Fedkiw. Deep Energies for Estimating Three-Dimensional Facial Pose and Expression [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 837-861. |
[6] | Howard Heaton, Samy Wu Fung, Stanley Osher. Global Solutions to Nonconvex Problems by Evolution of Hamilton-Jacobi PDEs [J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 790-810. |
[7] | Akiva Wernick, Jen-Ping Chen. A Blade Altering Toolbox for Automating Rotor Design Optimization [J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 688-704. |
[8] | Weitao Chen, Chiu-Yen Kao. Review of Computational Approaches to Optimization Problems in Inhomogeneous Rods and Plates [J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 236-256. |
[9] | Liqun Qi. Standard Dual Quaternion Optimization and Its Applications in Hand-Eye Calibration and SLAM [J]. Communications on Applied Mathematics and Computation, 2023, 5(4): 1469-1483. |
[10] | Jiefu Zhang, Leonardo Zepeda-Núñez, Yuan Yao, Lin Lin. Learning the Mapping x ↦ $\sum\limits_{i = 1}^d {x_i^2}$: the Cost of Finding the Needle in a Haystack [J]. Communications on Applied Mathematics and Computation, 2021, 3(2): 313-335. |
[11] | Zhiping Mao, Zhen Li, George Em Karniadakis. Nonlocal Flocking Dynamics: Learning the Fractional Order of PDEs from Particle Simulations [J]. Communications on Applied Mathematics and Computation, 2019, 1(4): 597-619. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||