In recent years, large language models (LLMs) have achieved extraordinary progress in areas such as reasoning, code generation, machine translation, and summarization. However, despite their advanced capabilities, foundation models have limitations when it comes to domain-specific expertise such as finance or healthcare or capturing cultural and language nuances beyond English.
Overcoming those limitations can be achieved with further development using continued pretraining (CPT), instruction fine-tuning, and retrieval-augmented generation (RAG). This requires high-quality, domain-specific datasets, a robust AI platform (software and hardware stack), and advanced AI expertise.
iGenius
iGenius is an Italian technology company specializing in artificial intelligence for enterprises operating in highly regulated sectors, such as financial services and public administration. iGenius operates between Europe and the United States to put AI at the service of people and businesses. It was founded in 2016 with the mission to humanize data and democratize business knowledge.
IGenius, an NVIDIA Inception partner, aimed to develop a state-of-the-art foundational LLM within a tight timeline but faced challenges in accessing large-scale GPU clusters (thousands of GPUs) and securing support for highly scalable training frameworks. During this engagement, iGenius developed the Colosseum 355B LLM, designed and developed for highly regulated environments, which provides businesses with confidence in the accuracy of the model output and security, knowing none of their information or IP is ever compromised.
NVIDIA DGX Cloud enables customers to access large clusters designed for high-performance AI training with enterprise-grade software and NVIDIA AI expertise. As a result, iGenius chose to collaborate with NVIDIA to accelerate their LLM development for Colosseum 355B.
In less than one week, iGenius had access to dedicated large-scale infrastructure tuned for AI workloads with over 3K GPUs and within two months, iGenius had completed continued pretraining for their largest LLM, Colosseum 355B. The work included the following:
- Increasing the number of parameters
- Increasing context length
- Achieving CPT in FP8
- Aligning the model’s capabilities for domain-specific expertise
Colosseum 355B’s capabilities
As a key use case in the context of agentic AI, iGenius develops LLMs to power their business intelligence agent, Crystal, a sovereign AI solution. By building an end-to-end stack, iGenius provide a secure experience without relying on centralized models:
- Database integration
- AI-assisted configuration
- LLM-powered orchestration for tool usage, query execution, and generation
- Private deployment infrastructure
This approach enables Crystal to function as an isolated AI operating system, using an orchestrator to manage tasks effectively and integrate specialized tools. By using their own foundational LLMs, iGenius ensures greater control over data privacy, customization, and performance, tailoring the AI to meet specific business needs in highly regulated environments.
DGX Cloud environment
Improving LLM reasoning capabilities requires a robust, distributed hardware and software solution, where accelerated compute, networking, storage, and libraries must seamlessly work together. Any bottleneck in the system can significantly slow or even stop the entire training process.
Building a high-performance AI training infrastructure for Colosseum 355B requires significant technical expertise and demands time for standup, setup, and validation of the system.
NVIDIA DGX SuperPOD negates the risk and complexity of this by providing a fully optimized solution that is designed, built, and validated by NVIDIA before handing over a ready-to-go system to their customers.
However, for customers that require immediate access to AI-optimized infrastructure, NVIDIA DGX Cloud makes this type of environment accessible within NVIDIA key cloud service provider partner (CSP) environments. Close collaboration with CSP partners such as Microsoft Azure, Google Cloud Platform (GCP), OCI, and AWS enables NVIDIA to build large contiguous blocks of AI-focused infrastructure, fully tested and validated for the NVIDIA AI Enterprise software suite. This close-knit collaboration enables customers to immediately start large-scale training on a large cluster.
Finally, DGX Cloud engagements encompass access to NVIDIA AI expertise, which accelerates time-to-first-training runs and facilitates the resolution of any software or hardware blockers. During the iGenius project, they established several workstreams from data preparation, LLM training, and alignment to model validation and inference optimization.
Within one week of signing up to NVIDIA DGX Cloud, iGenius had private access to an environment with over 3K NVIDIA H100 GPUs, all with the following resources:
- A dedicated, high-bandwidth, RDMA-based network to facilitate model training communications for Colosseum 355B
- 500 TB of Lustre-based high-performance storage
- Access to the latest NVIDIA NeMo Framework containers
iGenius dataset highlights
In the context of CPT, it is essential to preserve a substantial portion of the original training dataset to mitigate significant distributional shifts in the data, which could lead to training instabilities or exacerbate issues such as catastrophic forgetting.
Given that the training datasets for LLMs are predominantly composed of web documents and open-source repositories such as ArXiv, PubMed Central, GitHub, and similar sources, iGenius opted to construct a CPT dataset that preserves a comparable distribution of coding and multilingual tokens, ensuring consistency with the original dataset’s composition.
The multilingual capabilities of the model extend to over 50 languages, with a particular emphasis on European languages such as Italian, German, French, Spanish, Portuguese, Russian, Romanian, and Polish. The training dataset also includes a robust representation of non-European languages, including Japanese, Chinese, Arabic, Vietnamese, and Korean.
Colosseum 355B incorporates specialized sources from domains such as finance and reasoning, drawing from high-quality domain-specific datasets to enhance its performance in these areas.
In total, the CPT dataset consists of an extensive collection of approximately 2.5T tokens, resulting in a total of 10T considering 8T of the base.
In contrast, the dataset used for supervised fine-tuning (SFT) consists of approximately 1M samples curated to align with specific downstream tasks and objectives, such as problem-solving, factual recall, analytical reasoning, and coding questions.
LLM continued pretraining
Improving an already state-of-the-art large language model like Colosseum 355B is no easy task, particularly when you are working with a model comprising hundreds of billions of parameters!
Improvements such as new knowledge, better reasoning, and even expanding the overall model size requires changes to every single parameter of the current model. Taking an already established model and improving it to this degree is referred to as continued pretraining. At this scale, it’s a task that only proficient model builders are likely to embark on.
iGenius embraced NVIDIA NeMo Framework, taking advantage of its latest training and optimization techniques. NeMo Framework exposes both model-specific and training hyperparameters through a simple YAML config file.
The process of efficiently training Colosseum 355B consisted of experimental exploration to find the best training configuration. The Model FLOP/s Utilization (MFU) metric quantifies how efficiently the GPUs are used during training, which affects the overall training time. MFU was a key metric that iGenius focused on improving.
The iGenius team started with a 4K context length state-of-the-art foundation model and its default training configuration and worked on optimizing the training configuration to achieve the best MFU value. This configuration was originally distributing the model across 12 nodes (96 H100 GPUs) and an achieved MFU of 25% in BF16.
The first phase of pretraining focused on identifying the optimal training parameters to enhance the MFU. They achieved this through several experiments involving short-duration training runs to explore the impact of each config on the training duration.
Some of the key experimentation consisted in reducing the model distribution to its minimal number of nodes, enabling iGenius to maximize computation per GPU. Specifically, the pipeline parallelism was reduced from 12 to 8, leading to a model distribution across 8 nodes (64 GPUs). Also, some NeMo communication overlapping configurations were crucial to accelerate the overall training. For more information, see Communication Overlap in the NVIDIA NeMo Framework User Guide.
The following code shows the key CPT parameters:
Global Batch Size : 2880 Micro Batch Size : 1 Context Parallel Size : 1 Tensor Parallelism : 8 Pipeline Parallelism : 8 Virtual Pipeline Parallelism : 12 Learning rate : [ 1e-5 , 5e-6 ] Sequence length : 4096 Checkpoint format : torch_dist precision: bf16 # communication configurations defer_embedding_wgrad_compute: True wgrad_deferral_limit: 22 cross_entropy_loss_fusion: True enable_vboost: True overlap_p2p_comm: True batch_p2p_comm: False ub_tp_comm_overlap: True apply_rope_fusion: True deterministic_mode: False |
Using these parameters and model distribution, iGenius achieved an MFU of 40%, a significant improvement from the initial 25%. This marked improvement has a direct financial implication, enabling iGenius to complete more work in less time. This highlights the importance of exploring hyperparameters before initiating large-scale LLM training.
The second phase of pretraining focused on extending the foundational model to a size of 355B parameters by adding several layers and increasing the context length to from 4K to 16K.
After several hyperparameter experiments training the extended model, the achieved MFU dropped from 40% to 33% due to the augmented sequence length and extra layers for Colosseum 355B.
As the model size increased, the best model distribution consisted of increasing the context parallel size from 1 to 4 and pipeline parallelism from 8 to 10. For more information, see Context Parallelism in the NVIDIA NeMo Framework User Guide.
This configuration resulted in a data parallel (DP) size of 9 over 360 nodes (2,880 H100 GPUs). The following code shows the key CPT parameters of Colosseum 355B in BF16:
Global Batch Size : 1260 Micro Batch Size : 1 Context Parallel Size : 4 Tensor Parallelism : 8 Pipeline Parallelism : 10 Virtual Pipeline Parallelism : 10 Validation check interval : 100 Learning rate : [ 1e-5 , 5e-6 ] Sequence length : 16384 Checkpoint format : torch_dist precision: bf16 |
The NVIDIA Hopper architecture, which is the foundation of the NVIDIA H100 GPU, includes hardware acceleration for 8-bit floating-point (FP8) operations. iGenius used FP8 in their third phase of pretraining to accelerate training and reduce the model’s memory footprint.
NeMo Framework integrates FP8 training out-of-the-box with the Transformer Engine library. Enabling FP8 can be done by adding the following parameters to the training configuration file:
transformer_engine: True fp8: True fp8_params: True fp8_e4m3: False fp8_hybrid: True fp8_margin: 0 fp8_interval: 1 fp8_amax_history_len: 1024 fp8_amax_compute_algo: max fp8_wgrad: True ub_tp_comm_overlap: False |
iGenius successfully continued pretraining in FP8, resulting in an increased MFU from 33% with BF16 to 37% with FP8. In addition, the overall training step accelerated 1.15x with FP8. This speedup was obtained by just enabling FP8 and can be increased if you take into account the memory savings from FP8. By tuning the parallelisms and micro batch size, you can better optimize the memory available from FP8, resulting in greater speedups.
Changing the model representation from BF16 to FP8 during CPT requires careful consideration to avoid training divergence. To achieve FP8 training stability, iGenius explored various approaches, but the one that worked the best was reducing the learning rate at early signs of instability.
Other techniques considered were keeping specific layers of the model in the original BF16 precision, after running a histogram analysis of the tensors to detect the ones that would overflow or underflow in FP8.
iGenius employed a variety of benchmarks to comprehensively assess the performance improvements to Colosseum 355B’s base model. Among these, iGenius prioritized the Massive Multitask Language Understanding (MMLU) benchmark due to its broad scope and general applicability across diverse subject areas.
By using MMLU, iGenius aimed to quantify the extent of knowledge retention and integration achieved through CPT, providing a robust measure of the improvements in the model’s alignment with general human knowledge and reasoning capabilities. At the end of training, iGenius was able to achieve 82.04% accuracy with Colosseum 355B in a 5-shot setting.
LLM alignment
When an LLM has been trained, it has a general understanding of the languages in its dataset and how words, paragraphs, and complex concepts are all related to one another. However, the model has not yet learned how to carry out specific tasks, such as summarization or translation or what a conversation looks like.
This next phase of model development focuses on this next phase of learning. There are many techniques available to model builders .iGenius focused on supervised fine-tuning and human preferences alignment using Direct Preference Optimization (DPO).
Supervised fine-tuning
Supervised fine-tuning (SFT) is a foundational step in aligning LLMs outputs with a user-defined behavior. SFT consists of refining the pretrained model’s parameters using labelled input and desired output pairs.
Instruction tuning combines fine-tuning and prompting using natural language formulated instructions, such as, “Summarize this article,” or “Translate to Italian.” SFT can be applied to this type of one-shot question answering or optimized for chat interactions enabling models to answer in conversational settings.
iGenius used NVIDIA NeMo aligner for Colosseum 355B’s chat instruction fine-tuning. The syntax for the chat data template uses the following structure outline:
{ "system" : "" , "conversations" : [ { "from" : "User" , "value" : "What’s the name of the main index on the Italian Stock Exchange?" , "label" : null } , { "from" : "Assistant" , "value" : "The main index on the Italian Stock Exchange is the FTSE MIB." , label ": " " } ] , "mask" : "User" , "type" : "VALUE_TO_TEXT" } |
iGenius ran SFT using NeMo-Aligner/examples/nlp/gpt/train_gpt_sft.py on 350 nodes. Here are the key settings for the SFT training recipe with respect to the iGenius data:
chat: True chat_prompt_tokens: system_turn_start: <extra_id_0> turn_start: <extra_id_1> label_start: <extra_id_2> end_of_turn: "\x0A" end_of_name: "\x0A" num_workers: 0 shuffle: True |
iGenius conducted experiments with learning rates in the range of [1e-7, 5e-7], using a global batch size of 140. Notably, iGenius did not observe any significant changes in performance when employing either constant or annealing learning rate schedules.
To evaluate the effectiveness of SFT and alignment strategies, iGenius used the IFEval benchmark, which assesses a model’s ability to follow instructions and align with user intent. iGenius trained on different data mixtures for about one epoch, mainly relying on this benchmark to select the best checkpoint.
Human preferences alignment
Following the SFT stage, iGenius used DPO to further refine the language model with human preferences, focusing on choosing between preferred or rejected responses. iGenius ran DPO using NeMo-Aligner/examples/nlp/gpt/train_gpt_dpo.py on 350 nodes.
To optimize performance, rejected responses were generated using the best-performing SFT checkpoint. iGenius’ curated dataset excluded examples with minimal differences between the chosen and rejected responses, ensuring that only meaningful preferences were selected. Datasets were formatted with SFT chat template structuring both chosen and rejected responses.
{ "prompt" : "<extra_id_0>System\n\n<extra_id_1>User\nWhich year was the Magna Carta signed?\n<extra_id_1>Assistant\n" , "chosen_response" : "1215\n<extra_id_1>" , "rejected_response" : "I refuse to answer this question.\n<extra_id_1>" } |
iGenius trained the model for approximately three epochs on approximately 100K samples and, similar to the SFT stage, relied on IFEval, among other benchmarks, to select the optimal checkpoint.
Challenges and best practices for building LLMs
As training scales, minor issues become critical. Running the Colosseum 355B training job on 3K GPUs can take 15–20 minutes just to load the checkpoint, which consumes 5 TB of memory footprint.
A storage system performing well with multiple small jobs across thousands of GPUs may struggle when a single workload requires all GPUs to read and write checkpoints simultaneously, causing delays and potential timeouts. Temporary network breaks result in training job failures.
The main challenge in this category is a flapping network link, one that repeatedly alternates between the up and down states. DGX Cloud manages these types of infrastructure complexities, enabling you to focus on your AI training goals.
Scaling also exposes previously undetected issues, requiring a rigorous experiments tracking and debugging approach. Here are some best practices and lessons learned when running LLM training at scale:
- Explore the basics at a reduced scale
- Monitor effectively and track at scale
Explore the basics at reduced scale
Progressive scaling is key, enabling rapid experimentation, error identification, and saving time and resources.
- Small debug model: Exploring large-
- End-to-end process testing: The LLM training process executes a loop of training steps, validation, saving, and resuming checkpoints. Lowering the checkpoint interval enables expediting the testing loop.
- Robust checkpointing: Some checkpoint formats can be challenging to resume when the training distribution changes, while formats like
torch_dist
support resuming and experimenting with different parallelization layouts. - Expansions from minimal to large-scale distribution: Start testing execution with the minimal training distribution before gradually increasing the DP size.
- Dataset testing: Given the extensive effort and complexity involved in data processing, it is essential to test a dataset at a small scale to identify either potential dataset preparation mistakes or sample corruption early in the training stage.
Monitor effectively and track at scale
With potentially hundreds or even thousands of nodes involved in the training process, it is vital to maintain job observability, infrastructure health and overall resource utilization to tune, adapt, and react accordingly, ensuring maximum utilization from your infrastructure.
- Performance: Monitor MFU while scaling and adapt the hyperparameters accordingly.
- Accurate experiment tracking: It is essential to record environment variables, model configurations, and scripts across runs to ensure reproducibility and help identifying potential issues or improvements.
- Infrastructure observability: Monitor system health and identify when resources are underused or pushed to their limits.
- Predefined tests: Unhealthy nodes are inevitable, so it’s imperative to have predefined tests ready to run on any suspect nodes to pinpoint issues and enable remediation. Any new nodes added to the cluster should also have these tests conducted to confirm overall health. It is prudent to have lightweight versions of these tests that can be run as part of the job’s Prolog and Epilog to increase the chances of job success.
Summary
By using large-scale CPT and alignment on specific domains, iGenius built Colosseum 355B, a foundational LLM developed using NVIDIA DGX Cloud infrastructure and NVIDIA AI Enterprise software with NVIDIA NeMo Framework.
iGenius reduced computational costs and improved efficiency through CPT in FP8 precision. This approach not only enhanced baseline performances on crucial benchmarks like MMLU but also proves iGenius’s capability to continuously improve foundational LLMs over time at a lower cost. This demonstrates iGenius’ commitment to providing sustainable solutions for its core use cases and clients.
Looking ahead, iGenius will continue to explore continual learning strategies to keep improving its models and adapt them to various business domains, ensuring sustained performance improvements and cost efficiency.
Colosseum 355B is also now available as an NVIDIA NIM microservice on the NVIDIA API Catalog. NIM microservices are designed to streamline and accelerate the deployment of generative AI models across NVIDIA-accelerated infrastructure anywhere, including cloud, data center, and workstations. NIM uses inference optimization engines, industry-standard APIs, and prebuilt containers to provide high-throughput AI inference that scales with demand.
Explore the Colosseum 355B NIM microservice.
Acknowledgements
Thanks to the following iGenius contributors: Michele Resta, Andrea Valenti, and Danilo Numeroso. Thanks also to the following NVIDIA contributor Oleg Sudakov.