Group photo with the participants of the joint project “DataHub Europe”, which was presented on October 21 as part of the Digital Summit in Frankfurt.

“DataHub Europe” an AI platform with hessian.AI participation

The Hessian Center for Artificial Intelligence plays a key role in the Europe-wide “DataHub Europe” project. The platform brings together companies such as Schwarz Digits, the IT and digital division of the Schwarz Group, and Deutsche Bahn AG as well as public institutions and research institutes (including DFKI / TU Darmstadt / hessian.AI) to develop AI models under the highest standards for data protection and security. The aim is to create trustworthy AI solutions for the European market and at the same time strengthen Europe’s digital sovereignty.

What is “DataHub Europe” about?

“DataHub Europe” is an innovative platform that collects, processes and makes available high-quality data from various sectors – such as industry and media. This data enables partners to train AI models in a secure infrastructure and adapt them to specific use cases. Compliance with EU-wide regulations such as the GDPR and the AI Act ensures that the data is used in a legally and ethically sound manner.

Data quality testing by hessian.AI and TU Darmstadt

AI researchers from Darmstadt have played a key role in the development of “DataHub Europe” from the very beginning. The two researchers Dr. Patrick Schramowski and Manuel Brack (hessian.AI/TU Darmstadt/DFKI) are playing a leading role in its implementation.

Project manager Simon Schampijer and his team Ashal Ashal and Lev Dadashev from the AI Innovation Lab | hessian.AI evaluated the data quality of the training data provided by the media partners Frankfurter Allgemeine Zeitung and DvH Medien (Handelsblatt) together with the DFKI (German Research Center for Artificial Intelligence). The evaluation included a competence assessment, language comprehension and breadth of knowledge.

The results showed that the overall quality of the data was comparable to the high standard (for Common Crawl) of Wikipedia data. The data was not part of the common crawl. To have a greater impact, the data would need to be more diverse (see diversity benchmarks). Another finding was, that the OSCAR Pipelines (which is also used in the Occiglot LLM- training) makes a significant contribution to the pre-processing of common crawl data – which underlines the importance of data curation.

A particular highlight of the project is the use of the AI supercomputer “fortytwo”, which is operated by hessian.AI. The AI experts from Darmstadt carried out the further training of the models and the data evaluation on the high-performance computer. Innovative data processing tools from the Occiglot initiative were also used. Prof. Dr. Kristian Kersting, co-director of hessian.AI, emphasizes:

“With the DataHub, we can significantly increase the performance of large language models in German-speaking countries and at the same time ensure that these models are developed on a legally compliant basis.”

A catalyst for European AI innovation

DataHub Europe promotes the development of powerful, trustworthy and data-secure AI applications. Companies such as Deutsche Bahn and Schwarz Digits are already using the platform to develop AI solutions such as “AuditGPT”, which increases efficiency in corporate audits.

With its participation, hessian.AI underlines the importance of cooperation in European AI research and makes a significant contribution to unleashing the full potential of “AI made in Europe”.

Image: BMDV / Sebastian Woithe