Our GPU cluster, IRON, powers our AI training & fine-tuning services.
The IRON Cluster is hosted with our strategic partner, Cogent Communications in Washington DC (1050 Connecticut Ave NW).
Understanding the Time Required to Fine-Tune AI Models in The IRON Cluster
As the dataset size used to train the AI model grows, the time required for fine-tuning also increases.
As the number of GPUs dedicated to training the AI model in a GPU cluster increases, the time required for training/fine-tuning decreases.
The time required to fine-tune an AI model depends on:
- The sample size (in millions of samples) used for fine-tuning the AI model.
- The number of GPUs in the cluster for fine-tuning the AI model.
- Other factors.
Fine-Tuning Time Equation
Fine-Tuning Time (T) = (Dataset Size * Model Complexity) / (GPU Performance Factor)
- Dataset Size is the number of examples in your training data.
- Model Complexity refers to the size and structure of your neural network (e.g., number of parameters).
- GPU Performance Factor is determined by the GPU's TFLOPs, memory bandwidth, and cores.
Understanding Performance per Dollar of The IRON Cluster
When training/fine-tuning AI models, the size of the dataset has a dramatic impact on the cost and time required for training/fine-tuning.
For sample sizes like 1, 5, or 10 million samples, the Nvidia RTX 3090 GPU has superior performance per dollar compared to the Nvidia RTX 4090 or H100 GPUs.
In the graphs below, both the RTX 3090 GPU and the RTX 4090 GPU show superior performance per dollar against the H100 GPU for small sample sizes.
However, the Nvidia H100 GPU dominates the cheaper GPUs for larger sample sizes despite its higher cost. The IRON cluster currently comprises RTX 3090 GPUs.
Cost-Time Efficiency (CTE) refers to the balance between the cost of a GPU and the time it takes to complete a given task (such as fine-tuning a model or rendering) when considering its computational power.