Research & Application – 3rd wave of AI

The research activities of the Hessian AI Service Centre serve to build a bridge from AI research to application and to strengthen the “AI made in Germany” brand. The focus here is on translating the results of our basic research into services and applications, which in turn give rise to new questions for research.

Research and Application

The service centre works closely with hessian.AI researchers, the public sector and Hessian companies in order to lower the barriers to the application of artificial intelligence with demonstrators and directly applicable models.

The wide range of educational training and service offers also support users in applying the research results provided.

Main research areas

As part of the 3rd wave of AI, the Hessian AI Service Centre is concentrating on

  • Large generalisable models
  • Transparency & explainability
  • Contextual adaptation
  • Utilisation of specific (network) structures

with the aim of developing and providing robust, secure and sustainable AI systems for a wide range of users.

Data sets, models and demonstrators

The following models have been developed, trained and made available as demonstrators by hessian.AI as part of the AI Service Centre’s research to date:

LeoLM – First open German Foundation Language Model

LeoLM (Linguistically Enhanced Open Language Model) is a high-quality bilingual (German / English) language model. It is based on the LLama-2 architecture and was trained and fine-tuned with an extensive, high-quality German text corpus.

LeoLM was pre-trained with 2 billion primary English language tokens and additionally with 65 billion specifically filtered and deduplicated tokens from web texts of the OSCAR-2301 corpus. The model was fine-tuned using six German or German-English language datasets.

The quality of the model was improved by using linear RoPE scaling and Flash Attention 2 to improve training efficiency and doubling the context length to 8k tokens.

Below you will find a detailed description of the model, the associated repositories and corresponding chat demonstrators.

The project was developed in cooperation between the Hessian AI Service Centre and LAION gemeinnütziger e.V.. Many thanks for the excellent co-operation and support.

Detailed description: LeoLM: Igniting German-Language LLM Research | LAION
Demonstrator 7b: Gradio
Repository 7b: LeoLM/leo-hessianai-7b · Hugging Face
Demonstrator 13b: LeoLM 13b Chat – a Hugging Face Space by LeoLM
Repository 13b: LeoLM/leo-hessianai-13b · Hugging Face

StripedHyena-7B – Long context LLM

The StripedHyena7B model is based on the hybrid Hyena Architecture, which is made up of multi-headed, grouped queries and gated convolutions arranged in Hyena blocks and differs from conventional decoder-only transformers.

This architecture enables

  • cost-effective memory decoding in Hyena blocks by representing convolutions as state space models (modal or canonical form) or as truncated filters.
  • Low latency, faster decoding and higher throughput than Transformers.
  • Improved training and inference-optimal scaling laws compared to optimised transformer architectures such as Llama-2.
  • By training on sequences of up to 32k, longer prompts can also be processed.

This makes StripedHyena 7B the first alternative model that is competitive with the best open source transformers in short and long context evaluations.

The project was developed in cooperation between the Hessian AI Service Centre and Together Computer Inc. Many thanks for the excellent cooperation and support.

Detailed description: Paving the way to efficient architectures: StripedHyena-7B, open source models offering a glimpse into a world beyond Transformers
Repository 13b Foundation model: togethercomputer/StripedHyena-Hessian-7B · Hugging Face
Repository 13b Chat model: togethercomputer/StripedHyena-Nous-7B · Hugging Face

Occiglot-7B-EU5 – Multilingual European LLM

Occiglot-7B-EU5 is a generative language model with 7 billion parameters that supports the 5 most important EU languages (English, Spanish, French, German and Italian). 
It is based on Mistral-7B-v0.1 and was trained on 293 billion tokens of additional multilingual and coded data with a block size of 8,192 tokens per sample.

The model is a general basic model that has neither been tuned for commands nor optimised for chats or other applications.

The model was trained in cooperation between the German Research Centre for Artificial Intelligence (DFKI) and the Hessian AI Service Centre.

Repository: occiglot/occiglot-7b-eu5 · Hugging Face

Current research activities

Our doctoral students are researching:

Intelligent data documentation and data cleansing [Lukas Helff]

The quality and size of training datasets is crucial for the successful development of modern AI applications. Data documentation and data cleansing play a central role here, especially with the emergence of popular models such as GPT-4, which are gaining traction in various fields. As these AI systems become more autonomous, their social, scientific and industrial impact expands, necessitating high-quality data to avoid bias and stereotypes.

The manual annotation of large data sets is not only error-prone, but also tedious and requires a lot of human resources. Intelligent data documentation and data cleansing are therefore key solutions to these challenges, with the aim of optimising the preparation of high-quality datasets for AI applications.

In this context, this project focuses on the potential of machines to assist in the documentation of potentially inappropriate content by utilising the knowledge stored in Transformer models. This could significantly reduce the amount of human labour involved in data preparation.

The aim is to develop intelligent data documentation and data cleansing that can be offered as services to recognise inappropriate content. Planned steps include the training of documentation models for images, the extension to text and tabular data and the automatic documentation of mixed data. Generative and axiomatic synchronisation of the documentation models and provision as a service will ensure practical suitability and marketability.

The results of this research project are used in the AI Service Centre as modules for data documentation, cleansing and quality assurance.

Adaptation of large (vision) language models [Christopher Tauchmann]

This project focusses on the adaptation of large (image processing) language models to downstream tasks and also to more general requirements. To achieve this goal, we are pursuing several lines of research:

A particular interest lies in the use of modular and parameter-efficient transfer learning methods. Such methods update only a fraction of the parameters of a model, add a small number of trainable parameters or reuse existing parameters. Other methods learn smaller models from larger models or combine several modules.

In this sense, the use of various prompting techniques, i.e. analysing and using in-context learning capabilities (where a model learns from examples in the prompt) is very promising to adapt models on-the-fly; or in the case of instructional tuning as a related learning technique to tune models to specific requirements. Retrieval augmented generation uses external knowledge to extend the capabilities of a model.

In addition, we dive deep into model architectures, i.e. we interpret and edit model-internal representations. This can be done, for example, by tracing the flow of information or analysing individual model components. Such an approach views the modules of the model as part of a stream that can be traced back to the input at any point, with the processing of the stream at certain points leading to measurable results.

We find it interesting to see which tasks benefit from different approaches. Since today’s very large models perform very well on a large number of tasks, in the case of language-only models, we are particularly interested in their reasoning capabilities on most, if not all, of the more traditional NLP tasks.

AI hardware for Data Processing [Nils Boeschen]

This project investigates how modern AI hardware (e.g. GPUs) can be used to accelerate data processing tasks with intensive hard disk and network access.

GPUs are powerful computing units for many data- and processing-intensive workloads and outperform CPUs by several orders of magnitude in these tasks. For this reason, we are investigating how the speed advantages of GPUs can be utilised for data processing on modern storage and fast networks.

The results so far show that GPUs can use techniques such as heavyweight decompression and pruning to achieve a significant increase in data load and processing bandwidth that is currently unattainable for CPU systems.

Scalable Data Processing using Tensor Runtimes [Nils Boeschen]

This project investigates how current Tensor frameworks such as PyTorch and Tensorflow can be used as platforms for distributed query processing. These frameworks are interesting as universal query processing programmes, as they immediately support a variety of hardware types (CPUs, GPUs, TPUs, etc.), data formats and data operations.

It has been shown that SQL queries for single-node setups can be converted into a series of operations that are comparable to tensor runtimes and that the performance of this type of execution is comparatively high. However, it is still unclear whether these advantages can also be realised in a networked environment.

Since distributed query processing requires efficient key-based network shuffling, the overlapping of network and computational operations, and the handling of skew, the transformation is not trivial. In this line of research, we investigate how to transform distributed queries so that they can be efficiently executed via Tensor frameworks with the same benefits.

Code Transformers [Mert Tiftikci]

The Code Transformers project focuses on understanding and improving generative AI models for programme code. The advanced code models created can help developers in their tasks, as well as their colleagues in terms of pair programming. It is not enough to produce compilable and easily readable code. It must also solve the problems posed by the developer and adhere to industry standards, even for newly introduced projects and libraries. This can be achieved if AI models can understand the code syntactically and semantically.

Current high-performance models have black box structures that learn from huge amounts of data generated by collecting code in the “wild”. Such datasets contain vulnerable code or even malware. It is also worth noting that many of these models learn from these datasets by sequentially predicting the completion of a particular code, ignoring the rich structure it contains.

The Code Transformers project aims to develop efficient code models by first investigating the depth of their understanding and their limitations. It will then design customisable and modular structures that can use different techniques such as multimodal training or neurosymbolic learning. Such models can utilise rich metadata associated with the code, they adapt to user preferences and are more trustworthy as they can provide explanations for their generations.

Use of structure and multimodality in transformer models [Falko Helm]

The research project “Structure and Multimodality for Transformer Networks” is about dealing with documents such as PDF and XML files (e.g. MS Word). 
In the past, language understanding systems discarded everything but the plain text when processing a document. The goal of this project is to work directly with raw documents without any preprocessing. This includes considering the different modalities (text, images, tables, charts) as well as their interplay in the form of layout and explicit links.

To provide some background, let’s dissect the project title word by word. “Multimodality” means that there are non-text elements present in the document. For example, these can be images, tables or charts. The modalities video and audio are not considered because they are quite rare in business documents. Importantly, multimodality means that the modalities are interleaved and their interplay is complex, i.e. we will go beyond simple pairs of image and corresponding text caption.

The term “structure” refers to everything which goes beyond text as a plain sequence of characters. Structure can manifest itself via linebreaks in a poem, chapters in a book or columns in a newspaper. Further, “structure” also describes the relations between text and non-text elements of a document. This can be implicit by the spatial arrangement or explicit via references to tables and charts.

A “Transformer” is a special kind of Neural Network (i.e. a model that can learn from data) and can be used for analysis and prediction. In the language domain, most recent breakthroughs (i.e. ChatGPT) were mainly due to this architecture. Currently, the vanilla Transformer pushes hardware to the limit as its key component, the so-called self-attention, scales poorly with sequence length. Thus, we also want to look into recent attention-free models, e.g. state space models such as Mamba.

The goal of the project is to provide an easy-to-use Code-Framwork for understanding multimodal content input.

Support for non-standard DNNs [Tim Noack]

Despite their widespread use, conventional neural networks struggle with modelling uncertainties and require extensive training data. These limitations can be problematic if, for example, their predictions need to be understandable. Probabilistic circuits (PCs) offer a convincing alternative to neural networks.

Sum product networks (SPNs), an important member of the PC family, are characterised by their efficient inference capability, their ability to be trained with relatively little data and hyperparameters, and their ability to model the uncertainty of their predictions. However, using SPNs on computing platforms such as CPUs, GPUs and FPGAs is a challenge. In particular, these networks do not fit well with massively parallel architectures due to their sparse connections and convergence to a single output node.

This project addresses these issues by developing an MLIR-based SPNC compiler. The SPNC compiler aims to bridge the gap between advanced hardware architectures and the accessibility of these powerful models to AI practitioners without hardware knowledge.

This integration maximises the benefits of the hardware and minimises the learning curve for AI researchers and developers. By efficiently computing SPNs on specialised hardware, such as FPGAs and IPUs, our project opens up new avenues for AI applications.

Current application projects

ETH Text-to-Music

In co-operation with ETH Zurich, a text-to-music diffusion model with 1 to 3 billion parameters is being trained, which generates music pieces based on text prompts and parameters such as tempo and keywords.

A high-quality partial dataset from the FMA music dataset and the Jamendo royalty-free music dataset serves as the basis for training the model.

As a result of the joint project, the trained model and the training data will be published under an open source licence.

Cooperation partner: Luca Lanzendörfer, Distributed Computing Group, ETH Zurich

Haris’ MorphPiece Tokenisation

In cooperation with Haris Jabbar, we are validating and evaluating a novel tokenisation scheme for the English language that integrates morph-based segmentation with byte pair encoding to improve NLP models.

The goal is to test a more linguistically oriented tokenisation mechanism that improves the model’s understanding of linguistic nuances and thus the model’s performance. At the same time, it should support the more efficient handling of languages with rich morphology (compared to methods relying only on statistical analyses).