Optimization in Machine Learning: a Distribution-Space Approach

doi:10.1007/s42967-023-00322-5

Abstract

Abstract: We present the viewpoint that optimization problems encountered in machine learning can often be interpreted as minimizing a convex functional over a function space, but with a non-convex constraint set introduced by model parameterization. This observation allows us to repose such problems via a suitable relaxation as convex optimization problems in the space of distributions over the training parameters. We derive some simple relationships between the distribution-space problem and the original problem, e.g., a distribution-space solution is at least as good as a solution in the original space. Moreover, we develop a numerical algorithm based on mixture distributions to perform approximate optimization directly in the distribution space. Consistency of this approximation is established and the numerical efficacy of the proposed algorithm is illustrated in simple examples. In both theory and practice, this formulation provides an alternative approach to large-scale optimization in machine learning.

Key words: Machine learning, Convex relaxation, Optimization, Distribution space

Yongqiang Cai, Qianxiao Li, Zuowei Shen. Optimization in Machine Learning: a Distribution-Space Approach[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 1217-1240.

TrendMD

References

[1] Allen-Zhu, Z., Li, Y., Song, Z.: A convergence theory for deep learning via over-parameterization. In: International Conference on Machine Learning, pp. 242-252 (2019)
[2] Arora, S., Cohen, N., Hazan, E.: On the optimization of deep networks: implicit acceleration by overparameterization. In: International Conference on Machine Learning, pp. 244-253 (2018)
[3] Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(1), 629-681 (2017)
[4] Bacharoglou, A.: Approximation of probability distributions by convex mixtures of Gaussian measures. Proc. Am. Math. Soc. 138(7), 2619-2628 (2010)
[5] Barron, A.R.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inf. Theory 39(3), 930-945 (1993)
[6] Bassily, R., Belkin, M., Ma, S.: On exponential convergence of SGD in non-convex over-parametrized learning. arXiv:1811.02564 (2018)
[7] Bengio, Y., Roux, N.L., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural networks. In: Advances in Neural Information Processing Systems 18, pp. 123-130 (2006)
[8] Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123-140 (1996)
[9] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Advances in Neural Information Processing Systems 31, pp. 3036-3046 (2018)
[10] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Systems 2(4), 303-314 (1989)
[11] Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems 27, pp. 2933-2941 (2014)
[12] Dean, D.S.: Langevin equation for the density of a system of interacting Langevin processes. J. Phys. A: Math. Gen. 29(24), L613 (1996)
[13] Deutsch, L.: Generating neural networks with neural networks. arXiv:1801.01952 (2018)
[14] Du, S.S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neural networks. In: International Conference on Learning Representations, vol. 7, pp. 1-19 (2019)
[15] Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T.: Efficient projections onto the l1-ball for learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, pp. 272-279 (2008)
[16] E, W.N., Ma, C., Wu, L.: Barron spaces and the compositional function spaces for neural network models. arXiv:1906.08039 (2019)
[17] Eschrig, H.: The Fundamentals of Density Functional Theory. Teubner, Stuttgart (1996)
[18] Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. Int. Conf. Mach. Learn. 96, 148-156 (1996)
[19] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249-256 (2010)
[20] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27, pp. 2672-2680 (2014)
[21] Ha, D., Dai, A., Le, Q.V.: Hypernetworks. In: International Conference on Learning Representations, vol. 5, pp. 1-29 (2017)
[22] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Networks 2(5), 359-366 (1989)
[23] Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3/4), 142-336 (2017)
[24] Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[25] Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: Proceedings of the 3rd International Conference on Learning Representations (2014)
[26] Krueger, D., Huang, C.-W., Islam, R., Turner, R., Lacoste, A., Courville, A.: Bayesian hypernetworks. arXiv:1710.04759 (2017)
[27] LeCun, Y.: The MNIST database of handwritten digits. Technical report (1998)
[28] Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6(6), 861-867 (1993)
[29] Li, Y., Liang, Y.: Learning overparameterized neural networks via stochastic gradient descent on structured data. In: Advances in Neural Information Processing Systems 31, pp. 8157-8166 (2018)
[30] Ma, S., Bassily, R., Belkin, M.: The power of interpolation: understanding the effectiveness of SGD in modern over-parametrized learning. In: International Conference on Machine Learning, pp. 3325-3334 (2018)
[31] Mahoney, M., Martin, C.: Traditional and heavy-tailed self regularization in neural network models. In: International Conference on Machine Learning, pp. 4284-4293 (2019)
[32] Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. J. Mach. Learn. Res. 22(165), 1-73 (2021)
[33] Mei, S., Montanari, A., Nguyen, P.-M.: A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. 115(33), 7665-7671 (2018)
[34] Nestoridis, V., Schmutzhard, S., Stefanopoulos, V.: Universal series induced by approximate identities and some relevant applications. J. Approx. Theory 163(12), 1783-1797 (2011)
[35] Oymak, S., Soltanolkotabi, M.: Overparameterized nonlinear learning: gradient descent takes the shortest path? In: International Conference on Machine Learning, pp. 4951-4960 (2019)
[36] Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177-1184 (2008)
[37] Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313-1320 (2009)
[38] Ratzlaff, N., Fuxin, L.: HyperGAN: a generative model for diverse, performant neural networks. In: International Conference on Machine Learning, pp. 5361-5369 (2019)
[39] Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1/2), 1-39 (2010)
[40] Rotskoff, G., Vanden-Eijnden, E.: Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. In: Advances in Neural Information Processing Systems, pp. 7146-7155 (2018)
[41] Sinha, A., Duchi, J.C.: Learning kernels with random features. In: Advances in Neural Information Processing Systems, pp. 1298-1306 (2016)
[42] Sirignano, J., Spiliopoulos, K.: Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math. 80(2), 725-752 (2020)
[43] Sonoda, S.: Fast approximation and estimation bounds of kernel quadrature for infinitely wide models. arXiv:1902.00648 (2019)
[44] Sorenson, H.W., Alspach, D.L.: Recursive Bayesian estimation using Gaussian sums. Automatica 7(4), 465-479 (1971)
[45] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929-1958 (2014)
[46] Wan, L., Zeiler, M., Zhang, S., Cun, Y.L., Fergus, R.: Regularization of neural networks using dropconnect. In: Proceedings of the 30th International Conference on Machine Learning 28, pp. 1058-1066 (2013)
[47] Wang, W., Carreira-Perpinán, M.A.: Projection onto the probability simplex: an efficient algorithm with a simple proof, and an application. arXiv:1309.1541 (2013)
[48] Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets v.s. their induced kernel. arXiv:1810.05369 (2018)
[49] Wu, L., Zhu, Z., E, W.: Towards understanding generalization of deep learning: perspective of loss landscapes. In: ICML Workshop on Principled Approaches to Deep Learning (2017)
[50] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., Pennington, J.: Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. In: International Conference on Machine Learning, pp. 5393-5402 (2018)
[51] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747 (2017)
[52] Zeevi, A.J., Meir, R.: Density estimation through convex combinations of densities: approximation and estimation bounds. Neural Networks 10(1), 99-109 (1997)
[53] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017)