Associative Memories: Transformer Memorization & Performance Dynamics

cover

18 Jun 2025

Table of Links

Abstract and 1 Introduction

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

3 Model

3.1 Associative memories

Observation 1 The models tend to memorize the patterns of the training data.

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo ([email protected]);

(3) Lei Deng ([email protected]);

(4) Wei Han ([email protected]).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Related Work: Scaling Laws and Hopfield Models in LLM Research

Transformer Block Architecture: Attention and Feed-Forward Integration