Optimization in Machine Learning: a Distribution-Space Approach

doi:10.1007/s42967-023-00322-5

Communications on Applied Mathematics and Computation ›› 2024, Vol. 6 ›› Issue (2): 1217-1240.doi: 10.1007/s42967-023-00322-5

• ORIGINAL PAPERS • 上一篇下一篇

Optimization in Machine Learning: a Distribution-Space Approach

Yongqiang Cai¹, Qianxiao Li², Zuowei Shen²

1. School of Mathematical Sciences, Laboratory of Mathematics and Complex Systems, MOE, Beijing Normal University, Beijing 100875, China;
2. Department of Mathematics, National University of Singapore, 21 Lower Kent Ridge Road, Singapore 119077, Singapore

收稿日期:2022-08-26 修回日期:2023-09-15 接受日期:2023-09-16 出版日期:2024-03-01 发布日期:2024-03-01
通讯作者: Yongqiang Cai,E-mail:caiyq.math@bnu.edu.cn;Qianxiao Li,E-mail:qianxiao@nus.edu.sg;Zuowei Shen,E-mail:matzuows@nus.edu.sg E-mail:caiyq.math@bnu.edu.cn;qianxiao@nus.edu.sg;matzuows@nus.edu.sg
基金资助:
Yongqiang Cai is supported by the National Natural Science Foundation of China (Grant No. 12201053). Qianxiao Li is supported by the National Research Foundation, Singapore, under the NRF fellowship (Project No. NRF-NRFF13-2021-0005).

Optimization in Machine Learning: a Distribution-Space Approach

Yongqiang Cai¹, Qianxiao Li², Zuowei Shen²

1. School of Mathematical Sciences, Laboratory of Mathematics and Complex Systems, MOE, Beijing Normal University, Beijing 100875, China;
2. Department of Mathematics, National University of Singapore, 21 Lower Kent Ridge Road, Singapore 119077, Singapore

Received:2022-08-26 Revised:2023-09-15 Accepted:2023-09-16 Online:2024-03-01 Published:2024-03-01
Contact: Yongqiang Cai,E-mail:caiyq.math@bnu.edu.cn;Qianxiao Li,E-mail:qianxiao@nus.edu.sg;Zuowei Shen,E-mail:matzuows@nus.edu.sg E-mail:caiyq.math@bnu.edu.cn;qianxiao@nus.edu.sg;matzuows@nus.edu.sg
Supported by:
Yongqiang Cai is supported by the National Natural Science Foundation of China (Grant No. 12201053). Qianxiao Li is supported by the National Research Foundation, Singapore, under the NRF fellowship (Project No. NRF-NRFF13-2021-0005).

摘要/Abstract

摘要： We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.

关键词: Machine learning, Convex relaxation, Optimization, Distribution space

Abstract: We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.

Key words: Machine learning, Convex relaxation, Optimization, Distribution space

Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217-1240.

参考文献

[1] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242-252 (2019)
[2] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244-253 (2018)
[3] Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629-681 (2017)
[4] Bacharoglou, A.: Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc. 138(7), 2619-2628 (2010)
[5] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930-945 (1993)
[6] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018)
[7] Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123-130 (2006)
[8] Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123-140 (1996)
[9] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036-3046 (2018)
[10] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303-314 (1989)
[11] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933-2941 (2014)
[12] Dean, D.S.: Langevin equation for the density of a system of interacting Langevin processes. J. Phys. A: Math. Gen. 29(24), L613 (1996)
[13] Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018)
[14] Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1-19 (2019)
[15] Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272-279 (2008)
[16] E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019)
[17] Eschrig, H.: The Fundamentals of Density Functional Theory. Teubner, Stuttgart (1996)
[18] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. Int. Conf. Mach. Learn. 96, 148-156 (1996)
[19] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256 (2010)
[20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672-2680 (2014)
[21] Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1-29 (2017)
[22] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359-366 (1989)
[23] Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3/4), 142-336 (2017)
[24] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[25] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[26] Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017)
[27] LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998)
[28] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861-867 (1993)
[29] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157-8166 (2018)
[30] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325-3334 (2018)
[31] Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284-4293 (2019)
[32] Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1-73 (2021)
[33] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), 7665-7671 (2018)
[34] Nestoridis, V., Schmutzhard, S., Stefanopoulos, V.: Universal series induced by approximate identities and some relevant applications. J. Approx. Theory 163(12), 1783-1797 (2011)
[35] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951-4960 (2019)
[36] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177-1184 (2008)
[37] Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313-1320 (2009)
[38] Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361-5369 (2019)
[39] Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1/2), 1-39 (2010)
[40] Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146-7155 (2018)
[41] Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298-1306 (2016)
[42] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725-752 (2020)
[43] Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019)
[44] Sorenson, H.W., Alspach, D.L.: Recursive Bayesian estimation using Gaussian sums. Automatica 7(4), 465-479 (1971)
[45] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929-1958 (2014)
[46] Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058-1066 (2013)
[47] Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013)
[48] Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018)
[49] Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017)
[50] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393-5402 (2018)
[51] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
[52] Zeevi, A.J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Networks 10(1), 99-109 (1997)
[53] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017)

Optimization in Machine Learning: a Distribution-Space Approach

Optimization in Machine Learning: a Distribution-Space Approach

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 11

编辑推荐

Metrics

本文评价

[1]	Daniil Bochkov, Frederic Gibou. A Non-parametric Gradient-Based Shape Optimization Approach for Solving Inverse Problems in Directed Self-Assembly of Block Copolymers[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1472-1489.
[2]	Kevin Bui, Yifei Lou, Fredrick Park, Jack Xin. An Efficient Smoothing and Thresholding Image Segmentation Framework with Weighted Anisotropic-Isotropic Total Variation[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1369-1405.
[3]	Samira Kabri, Alexander Auras, Danilo Riccio, Hartmut Bauermeister, Martin Benning, Michael Moeller, Martin Burger. Convergent Data-Driven Regularizations for CT Reconstruction[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1342-1368.
[4]	Bj?rn Engquist, Kui Ren, Yunan Yang. Adaptive State-Dependent Diffusion for Derivative-Free Optimization[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1241-1269.
[5]	Jane Wu, Michael Bao, Xinwei Yao, Ronald Fedkiw. Deep Energies for Estimating Three-Dimensional Facial Pose and Expression[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 837-861.
[6]	Howard Heaton, Samy Wu Fung, Stanley Osher. Global Solutions to Nonconvex Problems by Evolution of Hamilton-Jacobi PDEs[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 790-810.
[7]	Akiva Wernick, Jen-Ping Chen. A Blade Altering Toolbox for Automating Rotor Design Optimization[J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 688-704.
[8]	Weitao Chen, Chiu-Yen Kao. Review of Computational Approaches to Optimization Problems in Inhomogeneous Rods and Plates[J]. Communications on Applied Mathematics and Computation, 2024, 6(1): 236-256.
[9]	Liqun Qi. Standard Dual Quaternion Optimization and Its Applications in Hand-Eye Calibration and SLAM[J]. Communications on Applied Mathematics and Computation, 2023, 5(4): 1469-1483.
[10]	Jiefu Zhang, Leonardo Zepeda-Núñez, Yuan Yao, Lin Lin. Learning the Mapping x ↦ $\sum\limits_{i = 1}^d {x_i^2}$: the Cost of Finding the Needle in a Haystack[J]. Communications on Applied Mathematics and Computation, 2021, 3(2): 313-335.
[11]	Zhiping Mao, Zhen Li, George Em Karniadakis. Nonlocal Flocking Dynamics: Learning the Fractional Order of PDEs from Particle Simulations[J]. Communications on Applied Mathematics and Computation, 2019, 1(4): 597-619.