Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud : US Pioneer Global VC DIFCHQ SFO Singapore – Riyadh Swiss Our Mind

In recent years, large language models (LLMs) have achieved extraordinary progress in areas such as reasoning, code generation, machine translation, and summarization. However, despite their advanced capabilities, foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.

Overcoming those limitations can be achieved with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG). This requires high-quality, domain-specific datasets, a robust AI platform (software and hardware stack), and advanced AI expertise.

iGenius

iGenius is an Italian technology company specializing in artificial intelligence for enterprises operating in highly regulated sectors, such as financial services and public administration. iGenius operates between Europe and the United States to put AI at the service of people and businesses. It was founded in 2016 with the mission to humanize data and democratize business knowledge.

IGenius, an NVIDIA Inception partner, aimed to develop a state-of-the-art foundational LLM within a tight timeline but faced challenges in accessing large-scale GPU clusters (thousands of GPUs) and securing support for highly scalable training frameworks. During this engagement, iGenius developed the Colosseum 355B LLM, designed and developed for highly regulated environments, which provides businesses with confidence in the accuracy of the model output and security, knowing none of their information or IP is ever compromised.

NVIDIA DGX Cloud enables customers to access large clusters designed for high-performance AI training with enterprise-grade software and NVIDIA AI expertise. As a result, iGenius chose to collaborate with NVIDIA to accelerate their LLM development for Colosseum 355B.

In less than one week, iGenius had access to dedicated large-scale infrastructure tuned for AI workloads with over 3K GPUs and within two months, iGenius had completed continued pretraining for their largest LLM, Colosseum 355B. The work included the following:

  • Increasing the number of parameters
  • Increasing context length
  • Achieving CPT in FP8
  • Aligning the model’s capabilities for domain-specific expertise

Colosseum 355B’s capabilities 

As a key use case in the context of agentic AI, iGenius develops LLMs to power their business intelligence agent, Crystal, a sovereign AI solution. By building an end-to-end stack, iGenius provide a secure experience without relying on centralized models:

  • Database integration
  • AI-assisted configuration
  • LLM-powered orchestration for tool usage, query execution, and generation
  • Private deployment infrastructure

This approach enables Crystal to function as an isolated AI operating system, using an orchestrator to manage tasks effectively and integrate specialized tools. By using their own foundational LLMs, iGenius ensures greater control over data privacy, customization, and performance, tailoring the AI to meet specific business needs in highly regulated environments.

DGX Cloud environment 

Improving LLM reasoning capabilities requires a robust, distributed hardware and software solution, where accelerated compute, networking, storage, and libraries must seamlessly work together. Any bottleneck in the system can significantly slow or even stop the entire training process.

Building a high-performance AI training infrastructure for Colosseum 355B requires significant technical expertise and demands time for standup, setup, and validation of the system.

NVIDIA DGX SuperPOD negates the risk and complexity of this by providing a fully optimized solution that is designed, built, and validated by NVIDIA before handing over a ready-to-go system to their customers.

However, for customers that require immediate access to AI-optimized infrastructure, NVIDIA DGX Cloud makes this type of environment accessible within NVIDIA key cloud service provider partner (CSP) environments. Close collaboration with CSP partners such as Microsoft Azure, Google Cloud Platform (GCP), OCI, and AWS enables NVIDIA to build large contiguous blocks of AI-focused infrastructure, fully tested and validated for the NVIDIA AI Enterprise software suite. This close-knit collaboration enables customers to immediately start large-scale training on a large cluster.

Finally, DGX Cloud engagements encompass access to NVIDIA AI expertise, which accelerates time-to-first-training runs and facilitates the resolution of any software or hardware blockers. During the iGenius project, they established several workstreams from data preparation, LLM training, and alignment to model validation and inference optimization.

Within one week of signing up to NVIDIA DGX Cloud, iGenius had private access to an environment with over 3K NVIDIA H100 GPUs, all with the following resources:

iGenius dataset highlights 

In the context of CPT, it is essential to preserve a substantial portion of the original training dataset to mitigate significant distributional shifts in the data, which could lead to training instabilities or exacerbate issues such as catastrophic forgetting.

Given that the training datasets for LLMs are predominantly composed of web documents and open-source repositories such as ArXiv, PubMed Central, GitHub, and similar sources, iGenius opted to construct a CPT dataset that preserves a comparable distribution of coding and multilingual tokens, ensuring consistency with the original dataset’s composition.

The multilingual capabilities of the model extend to over 50 languages, with a particular emphasis on European languages such as Italian, German, French, Spanish, Portuguese, Russian, Romanian, and Polish. The training dataset also includes a robust representation of non-European languages, including Japanese, Chinese, Arabic, Vietnamese, and Korean.

Colosseum 355B incorporates specialized sources from domains such as finance and reasoning, drawing from high-quality domain-specific datasets to enhance its performance in these areas.

In total, the CPT dataset consists of an extensive collection of approximately 2.5T tokens, resulting in a total of 10T considering 8T of the base.

In contrast, the dataset used for supervised fine-tuning (SFT) consists of approximately 1M samples curated to align with specific downstream tasks and objectives, such as problem-solving, factual recall, analytical reasoning, and coding questions.

LLM continued pretraining 

Improving an already state-of-the-art large language model like Colosseum 355B is no easy task, particularly when you are working with a model comprising hundreds of billions of parameters!

Improvements such as new knowledge, better reasoning, and even expanding the overall model size requires changes to every single parameter of the current model. Taking an already established model and improving it to this degree is referred to as continued pretrainingAt this scale, it’s a task that only proficient model builders are likely to embark on.

iGenius embraced NVIDIA NeMo Framework, taking advantage of its latest training and optimization techniques. NeMo Framework exposes both model-specific and training hyperparameters through a simple YAML config file.

The process of efficiently training Colosseum 355B consisted of experimental exploration to find the best training configuration. The Model FLOP/s Utilization (MFU) metric quantifies how efficiently the GPUs are used during training, which affects the overall training time. MFU was a key metric that iGenius focused on improving.

The iGenius team started with a 4K context length state-of-the-art foundation model and its default training configuration and worked on optimizing the training configuration to achieve the best MFU value. This configuration was originally distributing the model across 12 nodes (96 H100 GPUs) and an achieved MFU of 25% in BF16.

The diagram shows the MFU results achieved and training parameters modified for the goals of each phase. From the baseline, scale to 2880 GPUs and identify optimal parameters. After the first phase, increase the sequence length and number of parameters. After the second phase, enable FP8.
Figure 1. Training phases for the iGenius Colosseum 355B LLM

The first phase of pretraining focused on identifying the optimal training parameters to enhance the MFU. They achieved this through several experiments involving short-duration training runs to explore the impact of each config on the training duration.

Some of the key experimentation consisted in reducing the model distribution to its minimal number of nodes, enabling iGenius to maximize computation per GPU. Specifically, the pipeline parallelism was reduced from 12 to 8, leading to a model distribution across 8 nodes (64 GPUs). Also, some NeMo communication overlapping configurations were crucial to accelerate the overall training. For more information, see Communication Overlap in the NVIDIA NeMo Framework User Guide.

The following code shows the key CPT parameters:

Global Batch Size: 2880
Micro Batch Size: 1
Context Parallel Size: 1
Tensor Parallelism: 8
Pipeline Parallelism: 8
Virtual Pipeline Parallelism: 12
Learning rate: [1e-5, 5e-6]
Sequence length: 4096
Checkpoint format: torch_dist
precision: bf16
 
# communication configurations
defer_embedding_wgrad_compute: True
wgrad_deferral_limit: 22 
cross_entropy_loss_fusion: True
enable_vboost: True
overlap_p2p_comm: True
batch_p2p_comm: False
ub_tp_comm_overlap: True
apply_rope_fusion: True
deterministic_mode: False

Using these parameters and model distribution, iGenius achieved an MFU of 40%, a significant improvement from the initial 25%. This marked improvement has a direct financial implication, enabling iGenius to complete more work in less time. This highlights the importance of exploring hyperparameters before initiating large-scale LLM training.

The second phase of pretraining focused on extending the foundational model to a size of 355B parameters by adding several layers and increasing the context length to from 4K to 16K.

After several hyperparameter experiments training the extended model, the achieved MFU dropped from 40% to 33% due to the augmented sequence length and extra layers for Colosseum 355B.

As the model size increased, the best model distribution consisted of increasing the context parallel size from 1 to 4 and pipeline parallelism from 8 to 10. For more information, see Context Parallelism in the NVIDIA NeMo Framework User Guide.

This configuration resulted in a data parallel (DP) size of 9 over 360 nodes (2,880 H100 GPUs). The following code shows the key CPT parameters of Colosseum 355B in BF16:

Global Batch Size: 1260
Micro Batch Size: 1
Context Parallel Size: 4
Tensor Parallelism: 8
Pipeline Parallelism: 10
Virtual Pipeline Parallelism: 10
Validation check interval: 100
Learning rate: [1e-5, 5e-6]
Sequence length: 16384
Checkpoint format: torch_dist
precision: bf16
The line chart shows the pretraining results on DGX Cloud, from above 1.455 to less than 1.452 at 10K steps.
Figure 2. Colosseum 355B base CPT validation loss

The NVIDIA Hopper architecture, which is the foundation of the NVIDIA H100 GPU, includes hardware acceleration for 8-bit floating-point (FP8) operations. iGenius used FP8 in their third phase of pretraining to accelerate training and reduce the model’s memory footprint.

NeMo Framework integrates FP8 training out-of-the-box with the Transformer Engine library. Enabling FP8 can be done by adding the following parameters to the training configuration file:

transformer_engine: True
fp8: True
fp8_params: True
fp8_e4m3: False
fp8_hybrid: True
fp8_margin:
fp8_interval:
fp8_amax_history_len: 1024 
fp8_amax_compute_algo: max 
fp8_wgrad: True
ub_tp_comm_overlap: False

iGenius successfully continued pretraining in FP8, resulting in an increased MFU from 33% with BF16 to 37% with FP8. In addition, the overall training step accelerated 1.15x with FP8. This speedup was obtained by just enabling FP8 and can be increased if you take into account the memory savings from FP8. By tuning the parallelisms and micro batch size, you can better optimize the memory available from FP8, resulting in greater speedups.

The diagram shows a drop for both BF16 and FP8 precision, with the colosseum355B_fp8 model having the greatest decrease over the global steps.
Figure 3. BF16 and FP8 training results using NVIDIA NeMo Framework

Changing the model representation from BF16 to FP8 during CPT requires careful consideration to avoid training divergence. To achieve FP8 training stability, iGenius explored various approaches, but the one that worked the best was reducing the learning rate at early signs of instability.

Other techniques considered were keeping specific layers of the model in the original BF16 precision, after running a histogram analysis of the tensors to detect the ones that would overflow or underflow in FP8.

iGenius employed a variety of benchmarks to comprehensively assess the performance improvements to Colosseum 355B’s base model. Among these, iGenius prioritized the Massive Multitask Language Understanding (MMLU) benchmark due to its broad scope and general applicability across diverse subject areas.

By using MMLU, iGenius aimed to quantify the extent of knowledge retention and integration achieved through CPT, providing a robust measure of the improvements in the model’s alignment with general human knowledge and reasoning capabilities. At the end of training, iGenius was able to achieve 82.04% accuracy with Colosseum 355B in a 5-shot setting.

LLM alignment  

When an LLM has been trained, it has a general understanding of the languages in its dataset and how words, paragraphs, and complex concepts are all related to one another. However, the model has not yet learned how to carry out specific tasks, such as summarization or translation or what a conversation looks like.

This next phase of model development focuses on this next phase of learning. There are many techniques available to model builders .iGenius focused on supervised fine-tuning and human preferences alignment using Direct Preference Optimization (DPO).

Supervised fine-tuning 

Supervised fine-tuning (SFT) is a foundational step in aligning LLMs outputs with a user-defined behavior. SFT consists of refining the pretrained model’s parameters using labelled input and desired output pairs.

Instruction tuning combines fine-tuning and prompting using natural language formulated instructions, such as, “Summarize this article,” or “Translate to Italian.” SFT can be applied to this type of one-shot question answering or optimized for chat interactions enabling models to answer in conversational settings.

iGenius used NVIDIA NeMo aligner for Colosseum 355B’s chat instruction fine-tuning. The syntax for the chat data template uses the following structure outline:

{
"system": "",
"conversations": [{"from": "User", "value": "What’s the name of the main index on the Italian Stock Exchange?", "label": null}, {"from": "Assistant", "value": "The main index on the Italian Stock Exchange is the FTSE MIB.", label": ""}],
"mask": "User",
"type": "VALUE_TO_TEXT"
}

iGenius ran SFT using NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py on 350 nodes. Here are the key settings for the SFT training recipe with respect to the iGenius data:

chat: True
chat_prompt_tokens:
  system_turn_start: <extra_id_0>
  turn_start: <extra_id_1>
  label_start: <extra_id_2>
  end_of_turn: "\x0A"
  end_of_name: "\x0A"
num_workers: 0
shuffle: True

iGenius conducted experiments with learning rates in the range of [1e-7, 5e-7], using a global batch size of 140. Notably, iGenius did not observe any significant changes in performance when employing either constant or annealing learning rate schedules.

To evaluate the effectiveness of SFT and alignment strategies, iGenius used the IFEval benchmark, which assesses a model’s ability to follow instructions and align with user intent. iGenius trained on different data mixtures for about one epoch, mainly relying on this benchmark to select the best checkpoint.

The line chart with all data points and a trend line shows the loss curve decreases indicating model alignment with the target SFT objectives.
Figure 4. Colosseum 355B chat instruction SFT training loss

Human preferences alignment 

Following the SFT stage, iGenius used DPO to further refine the language model with human preferences, focusing on choosing between preferred or rejected responses. iGenius ran DPO using NeMo-Aligner/examples/nlp/gpt/train_gpt_dpo.py on 350 nodes.

To optimize performance, rejected responses were generated using the best-performing SFT checkpoint. iGenius’ curated dataset excluded examples with minimal differences between the chosen and rejected responses, ensuring that only meaningful preferences were selected. Datasets were formatted with SFT chat template structuring both chosen and rejected responses.

{
"prompt": "<extra_id_0>System\n\n<extra_id_1>User\nWhich year was the Magna Carta signed?\n<extra_id_1>Assistant\n",
"chosen_response": "1215\n<extra_id_1>",
"rejected_response": "I refuse to answer this question.\n<extra_id_1>"
}

iGenius trained the model for approximately three epochs on approximately 100K samples and, similar to the SFT stage, relied on IFEval, among other benchmarks, to select the optimal checkpoint.

Challenges and best practices for building LLMs  

As training scales, minor issues become critical. Running the Colosseum 355B training job on 3K GPUs can take 15–20 minutes just to load the checkpoint, which consumes 5 TB of memory footprint.

A storage system performing well with multiple small jobs across thousands of GPUs may struggle when a single workload requires all GPUs to read and write checkpoints simultaneously, causing delays and potential timeouts. Temporary network breaks result in training job failures.

The main challenge in this category is a flapping network link, one that repeatedly alternates between the up and down states. DGX Cloud manages these types of infrastructure complexities, enabling you to focus on your AI training goals.

Scaling also exposes previously undetected issues, requiring a rigorous experiments tracking and debugging approach. Here are some best practices and lessons learned when running LLM training at scale:

  • Explore the basics at a reduced scale
  • Monitor effectively and track at scale

Explore the basics at reduced scale

Progressive scaling is key, enabling rapid experimentation, error identification, and saving time and resources.

  • Small debug model: Exploring large- scale training configurations is highly challenging. For instance, training Colosseum 355B requires 80 H100 GPUs just to load as a single instance. During this project, iGenius used a small debug model of 8B (fits on 2 GPUs) for fast configuration exploration such as checkpoint format, FP8 parameters, or to debug some encountered NCCL issues.
  • End-to-end process testing: The LLM training process executes a loop of training steps, validation, saving, and resuming checkpoints. Lowering the checkpoint interval enables expediting the testing loop.
  • Robust checkpointing: Some checkpoint formats can be challenging to resume when the training distribution changes, while formats like torch_dist support resuming and experimenting with different parallelization layouts.
  • Expansions from minimal to large-scale distribution: Start testing execution with the minimal training distribution before gradually increasing the DP size.
  • Dataset testing: Given the extensive effort and complexity involved in data processing, it is essential to test a dataset at a small scale to identify either potential dataset preparation mistakes or sample corruption early in the training stage.

Monitor effectively and track at scale

With potentially hundreds or even thousands of nodes involved in the training process, it is vital to maintain job observability, infrastructure health and overall resource utilization to tune, adapt, and react accordingly, ensuring maximum utilization from your infrastructure.

  • Performance: Monitor MFU while scaling and adapt the hyperparameters accordingly.
  • Accurate experiment tracking: It is essential to record environment variables, model configurations, and scripts across runs to ensure reproducibility and help identifying potential issues or improvements.
  • Infrastructure observability: Monitor system health and identify when resources are underused or pushed to their limits.
  • Predefined tests: Unhealthy nodes are inevitable, so it’s imperative to have predefined tests ready to run on any suspect nodes to pinpoint issues and enable remediation. Any new nodes added to the cluster should also have these tests conducted to confirm overall health. It is prudent to have lightweight versions of these tests that can be run as part of the job’s Prolog and Epilog to increase the chances of job success.

Summary

By using large-scale CPT and alignment on specific domains, iGenius built Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework.

iGenius reduced computational costs and improved efficiency through CPT in FP8 precision. This approach not only enhanced baseline performances on crucial benchmarks like MMLU but also proves iGenius’s capability to continuously improve foundational LLMs over time at a lower cost. This demonstrates iGenius’ commitment to providing sustainable solutions for its core use cases and clients.

Looking ahead, iGenius will continue to explore continual learning strategies to keep improving its models and adapt them to various business domains, ensuring sustained performance improvements and cost efficiency.

Colosseum 355B is also now available as an NVIDIA NIM microservice on the NVIDIA API Catalog. NIM microservices are designed to streamline and accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations. NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.

Explore the Colosseum 355B NIM microservice.

Acknowledgements

Thanks to the following iGenius contributors: Michele Resta, Andrea Valenti, and Danilo Numeroso. Thanks also to the following NVIDIA contributor Oleg Sudakov.

Continued Pretraining of State-of-the-Art LLMs for Sovereign AI and Regulated Industries with iGenius and NVIDIA DGX Cloud