Training deep neural networks with 8-bit floating point numbers

Naigang Wang, Jungwook Choi, Daniel Brand, Chia Yu Chen, Kailash Gopalakrishnan

Research output: Contribution to journalConference article

14 Scopus citations


The state-of-the-art hardware platforms for training Deep Neural Networks (DNNs) are moving from traditional single precision (32-bit) computations towards 16 bits of precision - in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation. Here we demonstrate, for the first time, the successful training of DNNs using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of Deep Learning models and datasets. In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding. The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2 − 4× improved throughput over today's systems.

Original languageEnglish
Pages (from-to)7675-7684
Number of pages10
JournalAdvances in Neural Information Processing Systems
Publication statusPublished - 2018 Jan 1
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: 2018 Dec 22018 Dec 8


Cite this

Wang, N., Choi, J., Brand, D., Chen, C. Y., & Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers. Advances in Neural Information Processing Systems, 2018-December, 7675-7684.