Architectures for longer sequences and efficient inference: StripedHyena

Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers

One of the focus areas of hessian.AI is to develop new architectures for deep learning and generative AI with partners. We have partnered with to train and realize the StripedHyena line of models. This release includes in particular the StripedHyena-Hessian-7B (SH 7B), a base model:

  • SH 7B is competitive with the best open-source Transformers in short and long-context evaluations. The same model outperforms LLAMA-2 13B (twice its size) on OpenLLM leaderboard tasks and Mistral 7B on long-context summarization.
  • SH 7B is faster and more memory efficient for long sequence training, finetuning, and generation. Beside attention, one core computational primitive in the model is a state-space model (SSM) layer, building on pioneering work such as S4 (Gu el al.), allowing efficient training with convolution and efficient inference with a recurrence. Using our latest on fast kernels for gated convolutions and on efficient Hyena inference, SH 7B is more than 10%, 20%, and 50% faster in end-to-end training on sequences of length 32k, 64k and 131k, compared to an optimized Transformer baseline using FlashAttention v2 and custom kernels. SH 7B caches for autoregressive generation are 50% smaller than an equivalently-sized Transformer using grouped-query attention. 
  • SH 7B is designed using our latest research on scaling laws of efficient architectures. In particular, SH 7B is a hybrid architecture composed with attention and gated convolutions. Via a compute-optimal scaling protocol, we find StripedHyena hybrids improve on compute-optimal scaling laws for Transformers (Chinchilla), yielding higher quality models than Transformers at each compute budget. With our academic partners, we have been developing theory and synthetic tasks to understand how and why this occurs.  
  • SH 7B is optimized using a set of new model grafting techniques, enabling us to change model architecture during training or after a pretraining phase. SH 7B was obtained by fusing components of Mistral and Hyena, and trained on a mix of (book-free) RedPajama and long-context data.

We are excited to keep pushing the boundaries of model architectures for fast training and inference. Improving on the rate (quality gain per unit of compute – FLOP) allows us to obtain higher quality base models for each compute budget. With StripedHyena models, we are able reach the same pretraining performance of a strong Transformer architecture (LLaMA) with fewer FLOPS. With model fusion, we open up new opportunities for iterative model building and architecture optimization.