What You’ll Do Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability,