Yao Fu’s Log | 须臾所思

yao.fu@ed.ac.uk

💭

Yao: I want to write a punch line saying deep communication is through writing. Can you think of some sentences? Language Model: True depth is found not in speech, but in the quiet dance of pen on paper.

💡

All facts in this blog are based on existing public information, mostly from Arxiv, Huggingface and Github. All opinions are my own.

Table of Content

May 2024 | Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models Apr 2024 | Llama 3 Opens the Second Chapter of the Game of Scale Mar 2024 | How Do Language Models put Attention Weights over Long Context?Dec 2023 | Towards 100x Speedup: Full Stack Transformer Inference Optimization Sep 2023 | An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining Jun 2023 | A Stage Review of Instruction Tuning May 2023 | Towards Complex Reasoning: the Polaris of Large Language Models Dec 2022 | How does GPT Obtain its ability? Tracing Emergent Abilities of Language Models to their Sources Nov 2022 | A Closer Look at Large Language Models Emergent Abilities Feb 2022 | Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation

May 2024 | Full Stack Transformer Inference Optimization Season 2: Deploying Long-Context Models

Why long-context models are important? Because they are the foundations for advanced AI applications such as hour-long video understanding, repository-level coding agents, and life-long AI companion. Our research objective is to foster an AI-based application ecosystem. For this to happen, we have to reduce the deployment cost of long-context transformers. This is the second season of our transformer inference optimization posts. This post focuses on long-context optimization. We aim to address an ambitious research challenge: How to reduce the deployment of 1M context production-level transformers to be as cheap as 4K? we describe a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis about how all additional computational cost, compared to 4K context, trace back to one single source: the large size of the KV cache. We further analyze how existing efforts address the deployment challenges from the perspectives of concurrency, prefilling, decoding, and context-switching, and identify possibilities of combining them to build end- to-end efficient systems.

Apr 2024 | Llama 3 Opens the Second Chapter of the Game of Scale

The scaling of text data is likely reaching a ceiling as most of the easy web text (Common Crawl, Github, Arxiv .etc) are now used up. New text data may only incrementally improve model performance because they may not add another order of magnitude. The first chapter of the game of scale, namely scaling up text data, is coming to a conclusion where frontier models are all about GPT-4 parity. Video data can be orders of magnitudes larger than text data. They significantly improves the perception of language models, and opens the possibility of large world models. However, it seems that video data cannot improve reasoning. Reinforcement learning have not yet been scaled, and most existing work only focus on single-step offline optimization. Scaling up the exploration and exploitation with online iterative RL from human, environment, and AI feedback could potentially further improve the model’s reasoning.

Mar 2024 | How Do Language Models put Attention Weights over Long Context?

Yao Fu. University of Edinburgh

We are interested in the problem of lossless KV cache compression: to make the KV cache take less memory without sacrifacing language model’s capability during inference. We tend to view lossless KV cache compression is the number one challenge for democratizing and deployting long-context (100K - 10M) language models in real world.
But sorry, we won’t discuss any techniques related to KV cache compression in this post 😅. Instead, we look at its pre-requisition, i.e., the attention patterns inside the transformer architecture, because only an in-depth understanding of the attention mechanism allows us to find out which KV cache is compressible and which is not.
In this post, we discuss six typical attention patterns over long-context input, across all the transformer layers and heads, aiming to provide an intuitive understanding of what’s happening inside the transformer long-context attention, and potentially identify what part of KV cache is compressible.

Dec 2023 | Towards 100x Speedup: Full Stack Transformer Inference Optimization

Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?
Company B, because its cost is cheaper
Imagine a researcher has come up with a super smart decoding method: clever algorithm, solid math, but not compatible with FlashAttention. Can this method be used in production?
Probably not, because flash attention is essential for large scale model deployment

An in-depth understanding of transformer inference can be extremely beneficiary for both research and production. Yet in real world, large scale production is usually not so close to cutting edge research, such that people know algorithm may not know MLsys, and verse visa.
In this post, we discuss full-stack transformer inference optimization, from hardware specs like A100 memory hierarchy, to MLSys methods like FlashAttention and vLLM, to model architectures like Mixture of Experts, to decoding algorithms like Speculative Decoding and its variants. Like adding buffs in an RPG game, we see how transformer inference is scaled and speed up, step by step.

Sep 2023 | An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

Recently, the focus of research and open-source community are gradually shifting from model engineering to data engineering, realizing the crucial importance of data quality. However, when we say “we want better data”, what does “better data” mean precisely? When we say “we optimize the data composition”, what is the objective that we are optimizing? We would like to study the theoretical support for the language modeling data engineering. We believe that in-depth understanding of the problem is equally important as developing methods to solve the problem, and theoretical analysis will lead us to predictable scaling: to predict the eventual performance on every single task before we actually run the experiments.

Jun 2023 | A Stage Review of Instruction Tuning

Following the great success of ChatGPT, on February 24, 2023, the emergence of LLaMA heated up the direction of instruction tuning. On March 18, Alpaca demonstrated the potential of distilling smaller models from mature ones to become decent ChatBots, triggering a Cambrian explosion of llama-based models. However, just three months later, people began to recognize the various problems of training LLaMA with ChatGPT's data. This article reviews the development of LLaMA-based models over the past three months and discusses the next challenges of Instruction Tuning.

May 2023 | Towards Complex Reasoning: the Polaris of Large Language Models

Recently, there are many works on smaller models that achieve inspiring dialog abilities, which makes people imagine if smaller models can have comparable performance to large models like GPT-3.5. Generally, language models have multi-dimensional abilities, which makes them hard to compare. Finding the correct metric is crucial for developing strong language models. At the current stage, the community is eager to know what are the key differentiators that mark the potential of strong language models. In GPT-4 release blog, the authors write: “In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold”. This means that complex tasks are likely to be the key differentiators for large v.s. small language models. More importantly, complex reasoning opens up opportunities for building a large spectrum of applications upon language models, effectively making language models the next-generation computation platform/ operating system. This has the potential to substantially change the way humans interact with computers and reshape the whole computational ecosystem. In this post, we take a close look at methods toward models of strong complex reasoning capabilities.

Dec 2022 | How does GPT Obtain its ability? Tracing Emergent Abilities of Language Models to their Sources

Recently, the field has been greatly impressed and inspired by OpenAI’s ChatGPT. It is undoubtedly clever, capable, and very fun to talk to. Its multi-faceted abilities are significantly beyond many NLP researchers’ and practitioners’ expectations based on the impression of (not-that-strong) original GPT-3. The natural question is how ChatGPT gets there, and where these fantastic abilities come from. In this post, we try to dissect the emergent abilities and trace them to their sources, hoping to give a comprehensive roadmap about how the GPT-3.5 model family, along with related large language models, evolved to their current forms. We hope this post can promote the transparency of large language models and serve as the roadmap for the community’s ongoing efforts of reproducing GPT-3.5.

Nov 2022 | A Closer Look at Large Language Models Emergent Abilities

Recently, there has been great interest and progress in showing great abilities in large language models (chain of thought, scratch pad). Collectively referred to as emergent abilities of large language models, these are abilities likely to only exist in large models but not in smaller ones, hence the “emergence” framing. Many of the abilities are quite impressive, like complex reasoning, reasoning with knowledge, and out-of-distribution robustness, as we will look closely below. These abilities are potentially close to what the NLP community have urged for decades, thus representing a potential research paradigm shift from fine-tuning small models to in-context learning with large models. For pioneers, the paradigm shift may be straightforward without the need for justification. Yet, for scientific rigor, we do need very explicit reasons why one should shift to large language models, which are expensive, hard to access, and potentially not as good. In this post, we will scrutinize what these abilities are, what large language models may deliver, and what are their potential advantages in a broader NLP/ ML context.

Feb 2022 | Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation

The Structured State Space for Sequence Modeling (S4) model achieves impressive results on the Long-range Arena benchmark with a substantial margin over previous methods. However, it is written in the language of control theory, ordinary differential equation, function approximation, and matrix decomposition, which is hard for a large portion of researchers and engineers from a computer science background. This post aims to explain the math in an intuitive way, providing an approximate feeling/ intuition/ understanding of the S4 model: Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022