We’re in a golden age of AI, with cutting-edge fashions disrupting industries and poised to remodel life as we all know it. Powering these developments are more and more highly effective AI accelerators, similar to NVIDIA H100 GPUs, Google Cloud TPUs, AWS’s Trainium and Inferentia chips, and extra. With the rising variety of choices comes the problem of selecting the most optimal platform for our machine studying (ML) workloads — an important resolution contemplating the excessive prices related to AI computation. Importantly, a complete evaluation of every choice necessitates guaranteeing that we’re maximizing its utilization to totally leverage its capabilities.
On this submit, we’ll overview a number of strategies for optimizing an ML workload on AWS’s custom-built AI chips utilizing the AWS Neuron SDK. This continues our ongoing sequence of posts centered on ML mannequin efficiency evaluation and optimization throughout varied platforms and environments (e.g., see here and here). Whereas our main focus can be on an ML coaching workload and AWS Inferentia2, the strategies mentioned are additionally relevant to AWS Trainium. (Recall that though AWS Inferentia is primarily designed as an AI inference chip, we’ve got previously demonstrated its effectiveness in coaching duties as effectively.)
Usually talking, efficiency optimization is an iterative course of that features a efficiency evaluation step to appropriately establish efficiency bottlenecks and useful resource under-utilization (e.g., see here). Nevertheless, for the reason that strategies we’ll talk about are normal objective (i.e., they’re probably relevant to any mannequin, no matter their efficiency profile), we defer the dialogue on performance analysis with the Neuron SDK to a future submit.
Disclaimers
The code we’ll share is meant for demonstrative functions solely — we make no claims concerning its accuracy, optimality, or robustness. Please don’t view this submit as an alternative to the official Neuron SDK documentation. Please don’t interpret our point out of any platforms, libraries, or optimization strategies as an endorsement for his or her use. The most effective choices for you’ll rely drastically on the specifics of your use-case and would require your personal in-depth investigation and evaluation.
The experiments described under had been run on an Amazon EC2 inf2.xlarge occasion (containing two Neuron cores and 4 vCPUs). We used the newest model of the Deep Learning AMI for Neuron out there on the time of this writing, “Deep Studying AMI Neuron (Ubuntu 22.04) 20240927”, with AWS Neuron 2.20 and PyTorch 2.1. See the SDK documentation for extra particulars on setup and set up. Take into account that the Neuron SDK is underneath lively growth and that the APIs we discuss with, in addition to the runtime measurements we report, might change into outdated by the point you learn this. Please make sure to keep up-to-date with the newest SDK and documentation out there.
To facilitate our dialogue, we introduce the next easy Vision Transformer (ViT)-backed classification mannequin (primarily based on timm model 1.0.10):
from torch.utils.information import Dataset
import time, os
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
from timm.fashions.vision_transformer import VisionTransformer# use random information
class FakeDataset(Dataset):
def __len__(self):
return 1000000
def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(information=index % 1000, dtype=torch.int64)
return rand_image, label
def prepare(batch_size=16, num_workers=0):
# Initialize XLA course of group for torchrun
import torch_xla.distributed.xla_backend
torch.distributed.init_process_group('xla')
# multi-processing: guarantee every employee has identical preliminary weights
torch.manual_seed(0)
dataset = FakeDataset()
mannequin = VisionTransformer()
# load mannequin to XLA machine
machine = xm.xla_device()
mannequin = mannequin.to(machine)
optimizer = torch.optim.Adam(mannequin.parameters())
data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers)
data_loader = pl.MpDeviceLoader(data_loader, machine)
loss_function = torch.nn.CrossEntropyLoss()
summ = 0
depend = 0
t0 = time.perf_counter()
for step, (inputs, targets) in enumerate(data_loader, begin=1):
optimizer.zero_grad()
outputs = mannequin(inputs)
loss = loss_function(outputs, targets)
loss.backward()
xm.optimizer_step(optimizer)
batch_time = time.perf_counter() - t0
if step > 10: # skip first steps
summ += batch_time
depend += 1
t0 = time.perf_counter()
if step > 500:
break
print(f'common step time: {summ/depend}')
if __name__ == '__main__':
prepare()
# Initialization command:
# torchrun --nproc_per_node=2 prepare.py
Operating our baseline mannequin on the 2 cores of our AWS Inferentia occasion, leads to a coaching velocity of 251.98 samples per second.
Within the subsequent sections, we’ll iteratively apply quite a lot of potential optimization strategies and assess their impression on step time efficiency. Whereas we gained’t go into the total particulars of every methodology, we’ll present references for additional studying (e.g., here). Importantly, the record we’ll current isn’t all-inclusive — there are lots of strategies past what we’ll cowl. We are going to arrange the strategies into three classes: PyTorch optimizations, OpenXLA optimizations, and Neuron-specific optimizations. Nevertheless, the order of presentation isn’t binding. In reality, a number of the strategies are interdependent — for instance, making use of the combined precision optimization might unencumber sufficient machine reminiscence to allow growing the batch measurement.
In earlier posts (e.g., here) we’ve got lined the subject of PyTorch mannequin efficiency evaluation and optimization on GPU, extensively. Most of the strategies we mentioned are related to different AI accelerators. On this part we’ll revisit few of those strategies and apply them to AWS Inferentia.
Multi-process Knowledge Loading
In multi process data loading the enter information is ready in a number of devoted CPU processes moderately than in the identical course of that runs the coaching step. This permits for overlapping the information loading and coaching which might enhance system utilization and result in a big speed-up. The variety of processes is managed by the num_workers parameter of the PyTorch DataLoader. Within the following block we run our script with num_workers set to 1:
prepare(num_workers=1)
This transformation leads to a coaching velocity of 253.56 samples per second for a lift of lower than 1%.
Batch Dimension Optimization
One other necessary hyperparameter that may affect coaching velocity is the coaching batch measurement. Typically, we’ve got discovered that growing the batch measurement improves system utilization and leads to higher efficiency. Nevertheless, the consequences can range primarily based on the mannequin and platform. Within the case of our toy mannequin on AWS Inferentia, we discover that operating with a batch measurement of 8 samples per neuron core leads to a velocity of 265.68 samples per second — roughly 5% sooner than a batch measurement of 16 samples per core.
prepare(batch_size=8, num_workers=1)
PyTorch Computerized Combined Precision
One other widespread methodology for reinforcing efficiency is to make use of decrease precision floats such because the 16-bit BFloat16. Importantly, some mannequin elements won’t be suitable with decreased precision floats. PyTorch’s Automatic Mixed Precision (AMP) mode makes an attempt to match probably the most acceptable floating level kind to every mannequin operation routinely. Though, the Neuron compiler presents totally different choices for using mixed precision, it additionally supports the option of using PyTorch AMP. Within the code block under we embrace the modifications required to make use of PyTorch AMP.
def prepare(batch_size=16, num_workers=0):
# Initialize XLA course of group for torchrun
import torch_xla.distributed.xla_backend
torch.distributed.init_process_group('xla')# multi-processing: guarantee every employee has identical preliminary weights
torch.manual_seed(0)
dataset = FakeDataset()
mannequin = VisionTransformer()
# load mannequin to XLA machine
machine = xm.xla_device()
mannequin = mannequin.to(machine)
optimizer = torch.optim.Adam(mannequin.parameters())
data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers)
data_loader = pl.MpDeviceLoader(data_loader, machine)
loss_function = torch.nn.CrossEntropyLoss()
summ = 0
depend = 0
t0 = time.perf_counter()
for step, (inputs, targets) in enumerate(data_loader, begin=1):
optimizer.zero_grad()
# use PyTorch AMP
with torch.autocast(dtype=torch.bfloat16, device_type='cuda'):
outputs = mannequin(inputs)
loss = loss_function(outputs, targets)
loss.backward()
xm.optimizer_step(optimizer)
batch_time = time.perf_counter() - t0
if step > 10: # skip first steps
summ += batch_time
depend += 1
t0 = time.perf_counter()
if step > 500:
break
print(f'common step time: {summ/depend}')
if __name__ == '__main__':
# disable neuron compilar casting
os.environ["NEURON_CC_FLAGS"] = "--auto-cast=none"
torch.cuda.is_bf16_supported = lambda: True
prepare(batch_size=8, num_workers=1)
The resultant coaching velocity is 196.64 samples per second, about 26% decrease than the default mixed precision setting of the Neuron compiler. It’s necessary to notice that whereas this submit focuses on efficiency, in real-world situations, we might additionally want to judge the impact of the combined precision coverage we select on model accuracy.
As mentioned in a previous post, Neuron Cores are handled as XLA devices and the torch-neuronx Python package deal implements the PyTorch/XLA API. Consequently, any optimization alternatives offered by the OpenXLA framework, and particularly these provided by the PyTorch/XLA API, will be leveraged on AWS Inferentia and Trainium. On this part we take into account just a few of those alternatives.
BFloat16 Precision
OpenXLA helps the choice of casting all floats to BFloat16 through the XLA_USE_BF16 surroundings variable, as proven within the code block under:
if __name__ == '__main__':
os.environ['XLA_USE_BF16'] = '1'
prepare(batch_size=8, num_workers=1)
The resultant coaching velocity is 394.51 samples per second, almost 50% sooner than the velocity of the default mixed precision choice.
Multi-process System Loading
The PyTorch/XLA MpDeviceLoader and its inner ParallelLoader, that are liable for loading enter information on to the accelerator, embrace quite a lot of parameters for controlling the switch of information from the host to the machine. Within the code block under we tune batches_per_execution setting which determines the variety of batches copied to the machine for every execution cycle of the ParallelLoader. By growing this setting, we purpose to scale back the overhead of the host-to-device communication:
data_loader = torch.utils.information.DataLoader(dataset,
batch_size=batch_size,
num_workers=num_workers
)
data_loader = pl.MpDeviceLoader(data_loader,
machine, batches_per_execution=10)
On account of this optimization, the coaching velocity elevated to 1,027.39 samples per second, representing an extra 260% speed-up.
Torch Compilation with OpenXLA Backend
In earlier posts (e.g., here), we’ve got demonstrated the potential efficiency beneficial properties from utilizing PyTorch’s graph compilation providing. Though OpenXLA consists of its personal graph creation and Simply-In-Time (JIT) compilation mechanisms, torch.compile can present further acceleration by eliminating the necessity for tracing the mannequin operations at each step. The next code snippet demonstrates using the devoted openxla backend for compiling the mannequin:
mannequin = mannequin.to(machine)
mannequin = torch.compile(backend='openxla')
Though torch.compile is at the moment not yet supported by the Neuron SDK, we embrace its point out in anticipation of its future launch.
On this part we take into account a number of the optimization alternatives provided by the AWS Neuron SDK and, extra particularly, by the Neuron compiler.
Combined Precision
The Neuron SDK helps quite a lot of mixed precision settings. Within the code block under we program the compiler to solid all floats to BFloat16 through the NEURON_CC_FLAGS surroundings variable.
if __name__ == '__main__':
os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type bf16"
prepare(batch_size=8, num_workers=1)
This outcomes (unsurprisingly) in an identical coaching velocity to the OpenXLA BFloat16 experiment described above.
FP8
One of many distinctive options of NeuronCoreV2 is its assist of the eight-bit floating level kind, fp8_e4m3. The code block under demonstrates tips on how to configure the Neuron compiler to routinely solid all floating-point operations to FP8:
if __name__ == '__main__':
os.environ["NEURON_CC_FLAGS"] = "--auto-cast all --auto-cast-type fp8_e4m3"
prepare(batch_size=8, num_workers=1)
Whereas FP8 can speed up coaching in some circumstances, sustaining steady convergence will be tougher than when utilizing BFloat16 due its decreased precision and dynamic vary. Please see our previous post for extra on the potential advantages and challenges of FP8 coaching.
Within the case of our mannequin, utilizing FP8 really harms runtime efficiency in comparison with BFloat16, lowering the coaching velocity to 940.36 samples per second.
Compiler Optimizations
The Neuron compiler consists of quite a lot of controls for optimizing the runtime efficiency of the compiled graph. Two key settings are model-type and opt-level. The model-type setting applies optimizations tailor-made to particular mannequin architectures, similar to transformers, whereas the opt-level setting permits for balancing compilation time towards runtime efficiency. Within the code block under, we program the model-type setting to tranformer and the opt-level setting to the best efficiency choice. We additional specify the goal runtime machine, inf2, to make sure that the mannequin is optimized for the goal machine.
if __name__ == '__main__':
os.environ['XLA_USE_BF16'] = '1'
os.environ["NEURON_CC_FLAGS"] = "--model-type transformer "
"--optlevel 3"
" --target inf2"
prepare(batch_size=8, num_workers=1)
The above configuration resulted in a coaching velocity of 1093.25 samples per second, amounting to a modest 6% enchancment.
We summarize the outcomes of our experiments within the desk under. Take into account that the impact of every of the optimization strategies we mentioned will rely drastically on the mannequin and the runtime surroundings.
The strategies we employed resulted in a 435% efficiency enhance in comparison with our baseline experiment. It’s possible that further acceleration could possibly be achieved by revisiting and fine-tuning a number of the strategies we mentioned, or by making use of different optimization strategies not lined on this submit.
Our objective has been display a number of the out there optimization methods and display their potential impression on runtime efficiency. Nevertheless, in a real-world state of affairs, we would want to evaluate the way by which every of those optimizations impression our mannequin convergence. In some circumstances, changes to the mannequin configuration could also be needed to make sure optimum efficiency with out sacrificing accuracy. Moreover, utilizing a efficiency profiler to establish bottlenecks and measure system useful resource utilization is important for guiding and informing our optimization actions.
These days, we’re lucky to have all kinds of techniques on which to run our ML workloads. Irrespective of which platform we select, our objective is to maximise its capabilities. On this submit, we centered on AWS Inferentia and reviewed a number of strategies for accelerating ML workloads operating on it. You should definitely take a look at our other posts for extra optimization methods throughout varied AI accelerators.