Communications on Applied Mathematics and Computation ›› 2024, Vol. 6 ›› Issue (2): 1217-1240.doi: 10.1007/s42967-023-00322-5
Yongqiang Cai1, Qianxiao Li2, Zuowei Shen2
收稿日期:2022-08-26
修回日期:2023-09-15
接受日期:2023-09-16
出版日期:2024-03-01
发布日期:2024-03-01
通讯作者:
Yongqiang Cai,E-mail:caiyq.math@bnu.edu.cn;Qianxiao Li,E-mail:qianxiao@nus.edu.sg;Zuowei Shen,E-mail:matzuows@nus.edu.sg
E-mail:caiyq.math@bnu.edu.cn;qianxiao@nus.edu.sg;matzuows@nus.edu.sg
基金资助:Yongqiang Cai1, Qianxiao Li2, Zuowei Shen2
Received:2022-08-26
Revised:2023-09-15
Accepted:2023-09-16
Online:2024-03-01
Published:2024-03-01
Contact:
Yongqiang Cai,E-mail:caiyq.math@bnu.edu.cn;Qianxiao Li,E-mail:qianxiao@nus.edu.sg;Zuowei Shen,E-mail:matzuows@nus.edu.sg
E-mail:caiyq.math@bnu.edu.cn;qianxiao@nus.edu.sg;matzuows@nus.edu.sg
Supported by:摘要: We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.
Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217-1240.
Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217-1240.
| [1] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242-252 (2019) [2] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244-253 (2018) [3] Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629-681 (2017) [4] Bacharoglou, A.: Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc. 138(7), 2619-2628 (2010) [5] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930-945 (1993) [6] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018) [7] Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123-130 (2006) [8] Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123-140 (1996) [9] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036-3046 (2018) [10] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303-314 (1989) [11] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933-2941 (2014) [12] Dean, D.S.: Langevin equation for the density of a system of interacting Langevin processes. J. Phys. A: Math. Gen. 29(24), L613 (1996) [13] Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018) [14] Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1-19 (2019) [15] Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272-279 (2008) [16] E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019) [17] Eschrig, H.: The Fundamentals of Density Functional Theory. Teubner, Stuttgart (1996) [18] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. Int. Conf. Mach. Learn. 96, 148-156 (1996) [19] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256 (2010) [20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672-2680 (2014) [21] Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1-29 (2017) [22] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359-366 (1989) [23] Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3/4), 142-336 (2017) [24] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014) [25] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014) [26] Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017) [27] LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998) [28] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861-867 (1993) [29] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157-8166 (2018) [30] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325-3334 (2018) [31] Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284-4293 (2019) [32] Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1-73 (2021) [33] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), 7665-7671 (2018) [34] Nestoridis, V., Schmutzhard, S., Stefanopoulos, V.: Universal series induced by approximate identities and some relevant applications. J. Approx. Theory 163(12), 1783-1797 (2011) [35] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951-4960 (2019) [36] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177-1184 (2008) [37] Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313-1320 (2009) [38] Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361-5369 (2019) [39] Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1/2), 1-39 (2010) [40] Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146-7155 (2018) [41] Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298-1306 (2016) [42] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725-752 (2020) [43] Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019) [44] Sorenson, H.W., Alspach, D.L.: Recursive Bayesian estimation using Gaussian sums. Automatica 7(4), 465-479 (1971) [45] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929-1958 (2014) [46] Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058-1066 (2013) [47] Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013) [48] Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018) [49] Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017) [50] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393-5402 (2018) [51] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017) [52] Zeevi, A.J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Networks 10(1), 99-109 (1997) [53] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017) |
| [1] | Violeta Migallón, Héctor Penadés, José Penadés. Design and Comparison of Parallel Dynamic Matérn Kernel-Based Regression Models and Machine Learning Approaches: Application to Bias Correction in Numerical Weather Prediction[J]. Communications on Applied Mathematics and Computation, 2026, 8(3): 1115-1156. |
| [2] | Zhao-Zheng Liang, Jun-Lin Tian, Hong-Yi Wan. A Note on Chebyshev Accelerated PMHSS Iteration Method for Block Two-by-Two Linear Systems[J]. Communications on Applied Mathematics and Computation, 2025, 7(4): 1242-1263. |
| [3] | Zhi-Qin John Xu, Yaoyu Zhang, Tao Luo. Overview Frequency Principle/Spectral Bias in Deep Learning[J]. Communications on Applied Mathematics and Computation, 2025, 7(3): 827-864. |
| [4] | Liqun Qi. Motion, Dual Quaternion Optimization and Motion Optimization[J]. Communications on Applied Mathematics and Computation, 2025, 7(1): 228-238. |
| [5] | Sergio Torregrosa, Victor Champaney, Amine Ammar, Vincent Herbert, Francisco Chinesta. Physics-Based Active Learning for Design Space Exploration and Surrogate Construction for Multiparametric Optimization[J]. Communications on Applied Mathematics and Computation, 2024, 6(3): 1899-1923. |
| [6] | Vitaly Gyrya, Mikhail Shashkov, Alexei Skurikhin, Svetlana Tokareva. Machine Learning Approaches for the Solution of the Riemann Problem in Fluid Dynamics: a Case Study[J]. Communications on Applied Mathematics and Computation, 2024, 6(3): 1832-1859. |
| [7] | Vitaliy Gyrya, Evan Lieberman, Mark Kenamond, Mikhail Shashkov. Optimization of Artificial Viscosity in Production Codes Based on Gaussian Regression Surrogate Models[J]. Communications on Applied Mathematics and Computation, 2024, 6(3): 1521-1550. |
| [8] | Daniil Bochkov, Frederic Gibou. A Non-parametric Gradient-Based Shape Optimization Approach for Solving Inverse Problems in Directed Self-Assembly of Block Copolymers[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1472-1489. |
| [9] | Kevin Bui, Yifei Lou, Fredrick Park, Jack Xin. An Efficient Smoothing and Thresholding Image Segmentation Framework with Weighted Anisotropic-Isotropic Total Variation[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1369-1405. |
| [10] | Samira Kabri, Alexander Auras, Danilo Riccio, Hartmut Bauermeister, Martin Benning, Michael Moeller, Martin Burger. Convergent Data-Driven Regularizations for CT Reconstruction[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1342-1368. |
| [11] | Bj?rn Engquist, Kui Ren, Yunan Yang. Adaptive State-Dependent Diffusion for Derivative-Free Optimization[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1241-1269. |
| [12] | Jane Wu, Michael Bao, Xinwei Yao, Ronald Fedkiw. Deep Energies for Estimating Three-Dimensional Facial Pose and Expression[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 837-861. |
| [13] | Howard Heaton, Samy Wu Fung, Stanley Osher. Global Solutions to Nonconvex Problems by Evolution of Hamilton-Jacobi PDEs[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 790-810. |
| [14] | Akiva Wernick, Jen-Ping Chen. A Blade Altering Toolbox for Automating Rotor Design Optimization[J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 688-704. |
| [15] | Weitao Chen, Chiu-Yen Kao. Review of Computational Approaches to Optimization Problems in Inhomogeneous Rods and Plates[J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 236-256. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||