Cerebras introduces gigaGPT: GPT-3 sized models in 565 lines of code.
GigaGPT is Cerebras’ implementation of Andrei Karpathy’s nanoGPT – the simplest and most compact code base to train and fine-tune GPT models. Whereas nanoGPT can train models in the 100M parameter range, gigaGPT trains models well over 100B parameters. We do this without introducing additional code or relying on third party frameworks – the entire repo is just 565 lines of code. Instead gigaGPT utilizes the large memory and compute capacity of Cerebras hardware to enable large scale training on vanilla torch.nn code. With no modifications, gigaGPT supports long context lengths and works with a variety of optimizers.
Why gigaGPT
While the transformer architecture is simple, training a large transformer on a large number of GPUs is hard. Beyond a few billion parameters, vanilla GPT models run out of memory on even the latest GPUs. Training larger models requires breaking up models into smaller pieces, distributing them to multiple GPUs, coordinating the workload among the workers, and assembling the results. This is typically done via LLM scaling frameworks such as Megatron, DeepSpeed, NeoX, Fairscale, and Mosaic Foundry. Though powerful, these frameworks introduce significant complexity.