New Technique Optimizes Language Models, Reduces Costs | Risks of using chatgpt in business | Llm fine-tuning datasets | Llm examples | Turtles AI
A Tokyo startup, Sakana AI, has developed an innovative technique, “transformer universal memory,” that improves the efficiency of language models by reducing processing costs and optimizing memory use through specialized neural networks.
Key Points:
- Transformer Universal Memory optimizes the use of context windows in Transformer models.
- It uses attentional memory neural networks (NAMMs) to efficiently manage relevant information.
- NAMMs significantly reduce the cache memory required without sacrificing performance.
- The technique adapts to different modes and applications, maintaining high versatility.
Sakana AI, a promising Japanese startup, has presented an innovative methodology to optimize large language models (LLMs) and other Transformer-based systems, reducing costs and improving performance. The core of this innovation is the so-called “Transformer Universal Memory,” an advanced technique that uses attentional neural networks (NAMMs) to refine information processing. The concept of a context window, the space in which a model processes user input, represents the working memory of an LLM. While larger context windows allow for more data to be included, they come at a high computational cost. This challenge has led to the development of prompt engineering techniques, which aim to maximize model efficiency by removing redundant details without sacrificing essential information. However, these solutions often require significant resources or manual intervention.
Transformer Universal Memory addresses these challenges through NAMMs, lightweight models that decide which tokens to keep or discard during inference. Their training, separate from LLMs, is done using evolutionary algorithms, which simulate an iterative process of selection and mutation to optimize performance. This allows them to achieve a specific goal: reducing computational burden while maintaining the quality of responses. By operating directly on the Transformers’ attention layers, NAMMs identify the most relevant tokens for the specific task, discarding the superfluous ones. This feature not only improves efficiency but also makes NAMMs adaptable to different contexts. For example, a NAMM trained on pure text models can be successfully applied to multimodal or computer vision systems without the need for further adaptation.
Tests conducted by researchers have shown that NAMMs can reduce up to 75% of the cache memory required when performing complex tasks, while improving performance on particularly long sequences. This efficiency translates into concrete benefits even in “out of distribution” contexts, such as processing redundant video streams or managing optimized decisions in reinforcement learning systems. An interesting aspect of NAMMs is their ability to adapt their behavior to the needs of the task. In encoding problems, for example, they eliminate comments and whitespace, while in natural language tasks they reduce non-essential grammatical redundancies.
The versatility and modular design of NAMMs make this technology particularly attractive to companies that need to optimize models at scale. The publication of the code by the researchers is a step towards wider adoption of this technique.
With future development possibilities, such as integrating NAMMs during the training phase of LLMs, the potential of this innovation appears promising to innovate the efficiency of Transformer-based AI systems.