Deep Energies for Estimating Three-Dimensional Facial Pose and Expression

doi:10.1007/s42967-023-00256-y

Abstract

Abstract: While much progress has been made in capturing high-quality facial performances using motion capture markers and shape-from-shading, high-end systems typically also rely on rotoscope curves hand-drawn on the image. These curves are subjective and difficult to draw consistently; moreover, ad-hoc procedural methods are required for generating matching rotoscope curves on synthetic renders embedded in the optimization used to determine three-dimensional (3D) facial pose and expression. We propose an alternative approach whereby these curves and other keypoints are detected automatically on both the image and the synthetic renders using trained neural networks, eliminating artist subjectivity, and the ad-hoc procedures meant to mimic it. More generally, we propose using machine learning networks to implicitly define deep energies which when minimized using classical optimization techniques lead to 3D facial pose and expression estimation.

Key words: Numerical optimization, Neural networks, Motion capture, Face tracking

Jane Wu, Michael Bao, Xinwei Yao, Ronald Fedkiw. Deep Energies for Estimating Three-Dimensional Facial Pose and Expression[J]. Communications on Applied Mathematics and Computation, 2024, 6(2): 837-861.

TrendMD

References

[1] Aldrian, O., Smith, W.A.: Inverse rendering of faces with a 3D morphable model. IEEE Trans. Pattern. Anal. Mach. Intell. 35(5), 1080-1093 (2013)
[2] Bailer, C., Taetz, B., Stricker, D.: Flow fields: dense correspondence fields for highly accurate large displacement optical flow estimation. IEEE Trans. Pattern. Anal. Mach. Intell. 41(8), 1879-1892 (2015)
[3] Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vision 92(1), 1-31 (2011)
[4] Bao, M., Cong, M., Grabli, S., Fedkiw, R.: High-quality face capture using anatomical muscles. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10794-10803. IEEE (2019)
[5] Bhat, K.S., Goldenthal, R., Ye, Y., Mallet, R., Koperwas, M.: High fidelity facial animation capture and retargeting with contours. In: Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 7-14. ACM (2013)
[6] Black, M.J., Anandan, P.: The robust estimation of multiple motions: parametric and piecewise-smooth flow fields. Comput. Vis. Image Underst. 63(1), 75-104 (1996)
[7] Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25-36. Springer (2004)
[8] Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In: 2017 IEEE International Conference on Computer Vision, pp. 1021-1030. IEEE (2017)
[9] Cao, C., Bradley, D., Zhou, K., Beeler, T.: Real-time high-fidelity facial performance capture. ACM Trans. Gr. (ToG) 34(4), 46 (2015)
[10] Cao, C., Weng, Y., Lin, S., Zhou, K.: 3D shape regression for real-time facial animation. ACM Trans. Gr. (TOG) 32(4), 41 (2013)
[11] Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. Int. J. Comput. Vision 107(2), 177-190 (2014)
[12] Chen, D., Ren, S., Wei, Y., Cao, X., Sun, J.: Joint cascade face detection and alignment. In: European Conference on Computer Vision, pp. 109-122. Springer (2014)
[13] Chen, Y.-L., Wu, H.-T., Shi, F., Tong, X., Chai, J.: Accurate and robust 3D facial capture using a single RGBD camera. In: ICCV'13: Proceedings of the 2013 IEEE International Conference on Computer Vision, pp. 3615-3622. IEEE (2013)
[14] Debevec, P., Hawkins, T., Tchou, C., Duiker, H.-P., Sarokin, W., Sagar, M.: Acquiring the reflectance field of a human face. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 145-156. ACM Press/Addison-Wesley Publishing Co (2000)
[15] Deng, J., Zhou, Y., Cheng, S., Zaferiou, S.: Cascade multi-view hourglass model for robust 3D face alignment. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 399-403. IEEE (2018)
[16] Deng, Z., Chiang, P.-Y., Fox, P., Neumann, U.: Animating blendshape faces by cross-mapping motion capture data. In: Proceedings of the 2006 Symposium on Interactive 3D Graphics and Games, pp. 43-48. ACM (2006)
[17] Dinev, D., Beeler, T., Bradley, D., Bächer, M., Xu, H., Kavan, L.: User-guided lip correction for facial performance capture. Comput. Gr. Forum 37, 93-101 (2018)
[18] Dong, X., Yu, S.-I., Weng, X., Wei, S.-E., Yang, Y., Sheikh, Y.: Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 360-368. IEEE (2018)
[19] Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 658-666. ACM (2016)
[20] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 2758-2766. IEEE (2015)
[21] Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3D face reconstruction and dense alignment with position map regression network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 534-551. IEEE (2018)
[22] Feng, Z.-H., Kittler, J., Christmas, W., Huber, P., Wu, X.-J.: Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. arXiv:1611.05396 (2016)
[23] Garrido, P., Valgaerts, L., Wu, C., Theobalt, C.: Reconstructing detailed dynamic face geometry from monocular video. ACM Trans. Gr. (ToG) 32(6), 158 (2013)
[24] Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2414-2423. IEEE (2016)
[25] Gerig, T., Morel-Forster, A., Blumer, C., Egger, B., Luthi, M., Schönborn, S., Vetter, T.: Morphable face models-an open framework. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 75-82. IEEE (2018)
[26] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2, 2672-2680 (2014)
[27] Guo, J.Z., Zhu, X.Y., Lei, Z.: 3DDFA. https://github.com/cleardusk/3DDFA (2018)
[28] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deep networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1647-1655. IEEE (2017). http://lmb.informatik.uni-freiburg.de//Publications/2017/IMKDB17
[29] Jackson, A.S., Bulat, A., Argyriou, V., Tzimiropoulos, G.: Large pose 3D face reconstruction from a single image via direct volumetric cnn regression. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1031-1039. IEEE (2017)
[30] Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3D face alignment from 2D video for real-time use. Image Vis. Comput. 58, 13-24 (2017)
[31] Jin, X., Tan, X.: Face alignment in-the-wild: a survey. Comput. Vis. Image Underst. 162, 1-22 (2017)
[32] Johnson, J., Alahi, A., Li, F.F.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision, pp. 694-711. Springer (2016)
[33] Jourabloo, A., Liu, X.: Large-pose face alignment via CNN-based dense 3D model fitting. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4188-419. IEEE (2016)
[34] Jourabloo, A., Liu, X.: Pose-invariant face alignment via CNN-based dense 3D model fitting. Int. J. Comput. Vision 124(2), 187-203 (2017)
[35] Kazemi, V., Keskin, C., Taylor, J., Kohli, P., Izadi, S.: Real-time face reconstruction from a single depth image. In: 2014 2nd International Conference on 3D Vision, pp. 369-376. IEEE (2014)
[36] Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1867-1874 (2014)
[37] Kim, H., Zollhöfer, M., Tewari, A., Thies, J., Richardt, C., Theobalt, C.: InverseFaceNet: deep monocular inverse face rendering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4625-4634 (2018)
[38] King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755-1758 (2009)
[39] Korshunova, I., Shi, W., Dambre, J., Theis, L.: Fast face-swap using convolutional neural networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3697-3705. IEEE (2017)
[40] Lewis, J.P., Anjyo, K., Rhee, T., Zhang, M., Pighin, F.H., Deng, Z.: Practice and theory of blendshape facial models. Eurographics 2014, 199-218 (2014)
[41] Li, T.-M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable Monte Carlo ray tracing through edge sampling. ACM Trans. Gr. (ToG) 37(6), 222 (2018)
[42] Li, Y., Liu, S., Yang, J., Yang, M.-H.: Generative face completion. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5892-5900. IEEE (2017)
[43] Loper, M.: Chumpy autodifferentation library. http://chumpy.org (2014)
[44] Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision - ECCV 2014. Lecture Notes in Computer Science, vol. 8695, pp. 154-169. Springer, Cham (2014)
[45] Lourakis, M., Argyros, A.A.: Is Levenberg-Marquardt the most efficient optimization algorithm for implementing bundle adjustment? In: Tenth IEEE International Conference on Computer Vision (ICCV'05), vol. 2, pp. 1526-1531. IEEE (2005)
[46] Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: IJCAI'81: Proceedings of the 7th International Joint Conference on Artificial Intelligence, vol. 2, pp. 674-679. Morgan Kaufmann Publishers Inc., San Francisco, CA (1981)
[47] Ma, W.-C., Jones, A., Chiang, J.-Y., Hawkins, T., Frederiksen, S., Peers, P., Vukovic, M., Ouhyoung, M., Debevec, P.: Facial performance synthesis using deformation-driven polynomial displacement maps. ACM Trans. Gr. (TOG) 27(5), 121 (2008)
[48] Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5188-5196. IEEE (2015)
[49] Markley, F.L., Cheng, Y., Crassidis, J.L., Oshman, Y.: Quaternion averaging. J. Guidance Control Dyn. 30(4), 1193-1196 (2007)
[50] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483-499. Springer (2016)
[51] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS 2017 Autodiff Workshop (2017)
[52] Ramamoorthi, R., Hanrahan, P.: An efficient representation for irradiance environment maps. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, pp. 497-500. ACM (2001)
[53] Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: edge-preserving interpolation of correspondences for optical flow. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1164-1172. IEEE (2015)
[54] Ruder, S.: An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2016)
[55] Sendik, O., Cohen-Or, D.: Deep correlations for texture synthesis. ACM Trans. Gr. (ToG) 36(5), 161 (2017)
[56] Sifakis, E., Neverov, I., Fedkiw, R.: Automatic determination of facial muscle activations from sparse motion capture marker data. ACM Trans. Gr. (ToG) 24, 417-425 (2005)
[57] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
[58] Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), pp. 536-553. Springer (2018)
[59] Tewari, A., Zollhofer, M., Kim, H., Garrido, P., Bernard, F., Perez, P., Theobalt, C.: MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3735-3744. IEEE (2017)
[60] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of RGB videos. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2387-2395. IEEE (2016)
[61] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1653-1660. IEEE (2014)
[62] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9446-9454. IEEE (2018)
[63] Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724-4732. IEEE (2016)
[64] Weise, T., Bouaziz, S., Li, H., Pauly, M.: Realtime performance-based facial animation. ACM Trans. Gr. (ToG) 30, 77 (2011)
[65] Wu, C., Bradley, D., Gross, M., Beeler, T.: An anatomically-constrained local deformation model for monocular face capture. ACM Trans. Gr. (ToG) 35(4), 115 (2016)
[66] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: a boundary-aware face alignment algorithm. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2129-2138. IEEE (2018)
[67] Xing, J., Niu, Z., Huang, J., Hu, W., Zhou, X., Yan, S.: Towards robust and accurate multi-view and partially-occluded face alignment. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 987-1001 (2018)
[68] Zadeh, A., Lim, Y.C., Baltrusaitis, T., Morency, L.-P.: Convolutional experts constrained local model for 3D facial landmark detection. In: 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2519-2528. IEEE (2017)
[69] Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw. (ToMS) 23(4), 550-560 (1997)
[70] Zhu, S., Li, C., Loy, C.C., Tang, X.: Unconstrained face alignment via cascaded compositional learning. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3409-3417. IEEE (2016)
[71] Zhu, X., Liu, X., Lei, Z., Li, S.Z.: Face alignment in full pose range: a 3D total solution. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 78-92 (2017)
[72] Zollhöfer, M., Thies, J., Garrido, P., Bradley, D., Beeler, T., Pérez, P., Stamminger, M., Nießner, M., Theobalt, C.: State of the art on monocular 3D face reconstruction, tracking, and applications. Comput. Graph. Forum 37, 523-550 (2018)