Computing infrastructure

The AI Service Center offers a high-performance computing (HPC) infrastructure for training and developing AI models that is unique in Hesse. In addition to infinitely scalable computing power of up to 632 A100 GPUs, the computing cluster offers additional non-mainstream hardware for the research and development of specialized AI solutions. For example, 4 Graphcore bow-200 nodes and an Nvidia Developer Toolkit are integrated into the HPC cluster.

We continuously expand our computing cluster to sustain and strengthen Germany’s sovereignty as a hub for artificial intelligence in the future.

That way, even large models can be trained and efficient proof of concepts as well as larger projects can be realized directly on site as part of your services.

42_Cluster
HPC Cluster with 632 A100-80GB-SMX GPUs

HPC Cluster with 79 Apollo 6500 server, each with

2x AMD EPYC 7313 3.0GHz 16-core
8x NVIDIA HGX A100 80GB GPU with NV-Link (SMX Card)
32x HPE 64GB Dual Rank x4 DDR4-3200 (= 2.048 GB)

HPC Graphcore Knot with

2x AMD EPYC 7713 2.0GHz 64-core
4x Graphcore bow-2000 Nodes
16x HPE 32GB Dual Rank x4 DDR4-3200 (= 512 GB)

Parallel File Storage System with

1.251 TB of usable capacity
192 GB/s read, and 152,3 GB/s write

Machine Learning Development Environment

Our cluster offers a unique interface for training and evaluating AI models. HPE’s Machine Learning Development Environment provides a standardized interface with WebGUI and command-line interface for easy integration of our cluster into your development processes. MLDE reduces the complexity of training, allows for infinite scaling of the experiment to up to 632 GPUs, and effortless collaboration between geographically distributed teams without any major adaptations in the model code. For more insights into our development environment, take a look at our onboarding video.

Documentation & knowledge base

Below you find the most important information about the efficient use and independent troubleshooting of our computing infrastructure, as well as a link to our service portal for more details.

Access to the cluster

Who can apply to use our computing power?

Our services are open to companies and institutions. Unfortunately, private individuals are not eligible. You may use our services either as part of a proof of concept sprint or a cooperative small, medium, or large project.

How do I apply to use the computing power?

The application for the use of computing power can be submitted via this link. Please select the application form “Apply for HPC Cluster”.

What are the requirements for a successful project application?

For a small project, all you need is a fully developed project description with scientific added value and an appropriate project team. Medium and large projects require a prior proof of concept and/or corresponding previous studies/publications.

Is the use of the computing power limited?

As part of the application for the use of computing power, please state the number of GPUs required and the expected project duration. A committee comprising three professors and two technical liaisons will review your application. The maximum number of GPUs that can be allocated depends on hardware availability. However, the project duration is limited to a maximum of 12 months.

Using the cluster

How can I use the cluster and train models there?

Our onboarding video provides a detailed description of the first steps of using our cluster.

Please note: Once your application has been approved and your user account has been created on our cluster, you will have access to our knowledge database. Access is only available to active cluster users.

Terms of use and fees

Stay tuned for more information…

Machine Learning Development Environment

Do you have a collection of best practices?

Once your application has been approved and your user account has been created on our cluster, you will have access to our onboarding material. Access is only available to active cluster users.

How can I start experiments and JupiterLabs on the cluster?

Our Machine Learning Development Environment from Determined.ai provides the basis and interface for cluster utilization. The platform offers numerous functions and training options, enabling almost infinite scaling of AI models.

A detailed documentation of the interfaces and functions is available at https://docs.determined.ai/latest/.

What do I do if I have questions, or something doesn’t work?

Please use our service portal if you have any questions, problems, or uncertainties. The service portal serves as a central hub for answering your questions and providing support from our team of experts.

The service portal is available at: https://hessian-ai.atlassian.net/servicedesk/customer/portal/3

Please note: Once your application has been approved and your user account has been created on our cluster, you will have access to our service portal. Access is only available to active cluster users.

Computing infrastructure

42_ClusterHPC Cluster with 632 A100-80GB-SMX GPUs

HPC Cluster with 79 Apollo 6500 server, each with

HPC Graphcore Knot with

Parallel File Storage System with

Machine Learning Development Environment

Documentation & knowledge base

Access to the cluster

Who can apply to use our computing power?

How do I apply to use the computing power?

What are the requirements for a successful project application?

Is the use of the computing power limited?

Using the cluster

How can I use the cluster and train models there?

Terms of use and fees

Machine Learning Development Environment

Do you have a collection of best practices?

How can I start experiments and JupiterLabs on the cluster?

What do I do if I have questions, or something doesn’t work?

42_Cluster
HPC Cluster with 632 A100-80GB-SMX GPUs