Transformer Block Architecture: Attention and Feed-Forward Integration

Transformers (Vaswani et al., 2017) are made of a stack of homogeneous layers, where each consists of a multi-head attention sub-layer, a feed-forward sub-layer, an add-norm operation with a skip connection, and layer normalization. As an example of a typical Transformer, the GPT-2 architecture is discussed in Appendix D. The multi-head attention and feed-forward (FF) layers account for most of the parameters in the model.

Observation 2 The attention layer and the feed-forward layer can be conceptually integrated into a unified transformer layer.

The attention layers and the FF layers contribute to the majority of the model’s parameters, such that the number of parameters N is proportional to the square of the embedding dimension. The ratio depends on the number of layers and the hidden dimensions of the

transformer blocks. In the current work, we do not consider other modifications such as lateral connections, skip-layer connections, or other compressive modules such as (Xiong et al., 2023; Fei et al., 2023; Munkhdalai et al., 2024).

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo (8@huawei.com);

(3) Lei Deng (deng.lei2@huawei.com);

(4) Wei Han (harvey.hanwei@huawei.com).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

← Previous

Associative Memories: Transformer Memorization & Performance Dynamics

Up Next →

New Energy Function for Transformers: No External Regularization