Multi-grade Deep Learning

doi:10.1007/s42967-024-00474-y

Abstract

Abstract: Deep learning requires solving a nonconvex optimization problem of a large size to learn a deep neural network (DNN). The current deep learning model is of a single-grade, that is, it trains a DNN end-to-end, by solving a single nonconvex optimization problem. When the layer number of the neural network is large, it is computationally challenging to carry out such a task efficiently. The complexity of the task comes from learning all weight matrices and bias vectors from one single nonconvex optimization problem of a large size. Inspired by the human education process which arranges learning in grades, we propose a multi-grade learning model: instead of solving one single optimization problem of a large size, we successively solve a number of optimization problems of small sizes, which are organized in grades, to learn a shallow neural network (a network having a few hidden layers) for each grade. Specifically, the current grade is to learn the leftover from the previous grade. In each of the grades, we learn a shallow neural network stacked on the top of the neural network, learned in the previous grades, whose parameters remain unchanged in training of the current and future grades. By dividing the task of learning a DDN into learning several shallow neural networks, one can alleviate the severity of the nonconvexity of the original optimization problem of a large size. When all grades of the learning are completed, the final neural network learned is a stair-shape neural network, which is the superposition of networks learned from all grades. Such a model enables us to learn a DDN much more effectively and efficiently. Moreover, multi-grade learning naturally leads to adaptive learning. We prove that in the context of function approximation if the neural network generated by a new grade is nontrivial, the optimal error of a new grade is strictly reduced from the optimal error of the previous grade. Furthermore, we provide numerical examples which confirm that the proposed multi-grade model outperforms significantly the standard single-grade model and is much more robust to noise than the single-grade model. They include three proof-of-concept examples, classification on two benchmark data sets MNIST and Fashion MNIST with two noise rates, which is to find classifiers, functions of 784 dimensions, and as well as numerical solutions of the one-dimensional Helmholtz equation.

Key words: Deep learning, Deep neural network (DDN), Multi-grade deep learning (MGDL)

CLC Number:

Yuesheng Xu. Multi-grade Deep Learning[J]. Communications on Applied Mathematics and Computation, 2026, 8(2): 778-829.

TrendMD

References

[1] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds) NIPS’06: Proceedings of the 20th International Conference on Neural Information Processing Systems, pp. 153–160. MIT Press, Cambridge (2006)
[2] Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed) Online Learning and Neural Networks. Cambridge University Press, Cambridge (1998)
[3] Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds) NIPS’07: Proceedings of the 21st International Conference on Neural Information Processing Systems, pp. 161–168. Curran Associates Inc., New York (2007)
[4] Brownlee, J.: Probability for machine learning: discover how to harness uncertainty with Python. Mach. Learn. Mastery (2019)
[5] Chen, Q., Huang, N., Riemenschneider, S., Xu, Y.: A B-spline approach for empirical mode decompositions. Adv. Comput. Math. 24(1), 171–195 (2006)
[6] Chen, Z., Micchelli, C., Xu, Y.: A construction of interpolating wavelets on invariant sets. Math. Comput. 68(228), 1569–1587 (1999)
[7] Chen, Z., Micchelli, C.A., Xu, Y.: Multiscale Methods for Fredholm Integral Equations, vol. 28. Cambridge University Press, Cambridge (2015)
[8] Chen, Z., Wu, B., Xu, Y.: Fast multilevel augmentation methods for solving Hammerstein equations. SIAM J. Numer. Anal. 47(3), 2321–2346 (2009)
[9] Chizat, L., Bach, F.: On the global convergence of gradient descent for over-parameterized models using optimal transport. In: Bengio, S., Wallach, H.M. (eds) NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3040–3050. Curran Associates Inc., New York (2018)
[10] Daubechies, I.: Ten Lectures on Wavelets. SIAM, Philadelphia (1992)
[11] Daubechies, I., DeVore, R., Foucart, S., Hanin, B., Petrova, G.: Nonlinear approximation and (deep) ReLU networks. Constr. Approx. 55(1), 127–172 (2022)
[12] Deutsch, F., Deutsch, F.: Best Approximation in Inner Product Spaces, vol. 7. Springer, New York (2001)
[13] Du, S., Lee, J., Li, H., Wang, L., Zhai, X.: Gradient descent finds global minima of deep neural networks. In: International Conference on Machine Learning, ICML, pp. 1675–1685. PMLR (2019)
[14] Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(55), 1–21 (2019)
[15] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
[16] Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
[17] Häggström, I., Schmidtlein, C.R., Campanella, G., Fuchs, T.J.: DeepPET: a deep encoder-decoder network for directly solving the PET image reconstruction inverse problem. Med. Image Anal. 54, 253–262 (2019)
[18] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M.: Co-teaching: robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)
[19] Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, New York (2012)
[20] Huang, N.E., Shen, Z., Long, S.R., Wu, M.C., Shih, H.H., Zheng, Q., Yen, N.C., Tung, C.C., Liu, H.H.: The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc. R. Soc. Lond. A 454(1971), 903–995 (1998)
[21] Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), San Diega, CA, USA (2015)
[22] Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Communications of the ACM 60(6), 84–90 (2017)
[23] Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: 5th International Conference on Learning Representations (ICLR 2017) (2017)
[24] LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
[25] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
[26] Liu, Q., Wang, R., Xu, Y., Yan, M.: Parameter choices for sparse regularization with the norm. Inverse Probl. 39(2), 025004 (2023)
[27] Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41(12), 3397–3415 (1993)
[28] Micchelli, C.A., Xu, Y.: Using the matrix refinement equation for the construction of wavelets on invariant sets. Appl. Comput. Harmon. Anal. 1(4), 391–401 (1994)
[29] Natarajan, N., Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Learning with noisy labels. Adv. Neural Inf. Process. Syst. 26, 1–9 (2013)
[30] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944–1952 (2017)
[31] Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville, A.: On the spectral bias of neural networks. In: International Conference on Machine Learning (2019)
[32] Raissi, M.: Deep hidden physics models: deep learning of nonlinear partial differential equations. J. Mach. Learn. Res. 19(25), 1–24 (2018)
[33] Rice, L., Wong, E., Kolter, Z.: Overfitting in adversarially robust deep learning. In: Singh III, H.D.A. (ed.) Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 119, pp. 8093–8104. PMLR (2020)
[34] van Rooyen, B., Menon, A.K., Williamson, R.C.: Learning with symmetric label noise: the importance of being unhinged. In: NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, vol. 1, pp. 10–18 (2015)
[35] Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Annu. Rev. Biomed. Eng. 19(1), 221–248 (2017)
[36] Shen, Z., Yang, H., Zhang, S.: Deep network with approximation error being reciprocal of width to power of square root of depth. Neural Comput. 33(4), 1005–1036 (2021)
[37] Torlai, G., Mazzola, G., Carrasquilla, J., Troyer, M., Melko, R., Carleo, G.: Neural-network quantum state tomography. Nat. Phys. 14(5), 447–450 (2018)
[38] Wu, W., Feng, G., Li, Z., Xu, Y.: Deterministic convergence of an online gradient method for BP neural networks. IEEE Trans. Neural Netw. 16(3), 533–540 (2005)
[39] Wu, W., Xu, Y.: Deterministic convergence of an online gradient method for neural networks. J. Comput. Appl. Math. 144(1/2), 335–347 (2002)
[40] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:1708.07747 (2017)
[41] Xu, Y., Liu, B., Liu, J., Riemenschneider, S.: Two-dimensional empirical mode decomposition by finite elements. Proc. R. Soc. A Math. Phys. Eng. Sci. 462(2074), 3081–3096 (2006)
[42] Xu, Y., Zhang, H.: Convergence of deep ReLU networks. Neurocomputing 571, 127174 (2024)
[43] Xu, Y., Zeng, T.: Sparse deep neural network for nonlinear partial differential equations. Numer. Math. Theor. Meth. Appl. 16(1), 58–78 (2023)
[44] Xu, Z.Q.J., Zhang, Y., Luo, T.: Overview frequency principle/spectral bias in deep learning. Commun. Appl. Math. Comput. (2024). https://doi.org/10.1007/s42967-024-00398-7
[45] Xu, Z.Q.J., Zhang, Y., Xiao, Y.: Training behavior of deep neural network in frequency domain. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science, vol 11953. Springer, Cham