Transformer Block Architecture: Attention and Feed-Forward Integration

cover
18 Jun 2025

Abstract and 1 Introduction

2 Related Work

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

References

3.2 Transformer blocks

Transformers (Vaswani et al., 2017) are made of a stack of homogeneous layers, where each consists of a multi-head attention sub-layer, a feed-forward sub-layer, an add-norm operation with a skip connection, and layer normalization. As an example of a typical Transformer, the GPT-2 architecture is discussed in Appendix D. The multi-head attention and feed-forward (FF) layers account for most of the parameters in the model.

Observation 2 The attention layer and the feed-forward layer can be conceptually integrated into a unified transformer layer.

The attention layers and the FF layers contribute to the majority of the model’s parameters, such that the number of parameters N is proportional to the square of the embedding dimension. The ratio depends on the number of layers and the hidden dimensions of the

transformer blocks. In the current work, we do not consider other modifications such as lateral connections, skip-layer connections, or other compressive modules such as (Xiong et al., 2023; Fei et al., 2023; Munkhdalai et al., 2024).

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo (8@huawei.com);

(3) Lei Deng (deng.lei2@huawei.com);

(4) Wei Han (harvey.hanwei@huawei.com).


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.