PyTorch training

  • Enhancing PyTorch Training with TraceML


    Real-time observability for PyTorch training (TraceML)TraceML has been updated to enhance real-time observability during PyTorch training, particularly for long or remote runs. Key improvements include live monitoring of dataloader fetch times to identify input pipeline stalls, tracking GPU step time drift using non-blocking CUDA events, and monitoring CUDA memory to detect leaks before out-of-memory errors occur. Optional layer-wise timing and memory tracking are available for deeper debugging, and the tool is designed to complement existing profilers. Currently tested on single-GPU setups, with plans for multi-GPU support, TraceML aims to address common issues like step drift and memory creep across various training pipelines. Feedback is sought from users to refine signal detection. This matters because it helps optimize machine learning training processes by identifying and addressing runtime issues early.

    Read Full Article: Enhancing PyTorch Training with TraceML

  • Testing Octaspace Cloud GPU Performance & Pricing


    Testing Octaspace Cloud GPU – quick notes on performance and pricingOctaspace Cloud GPU offers a compelling option for those in need of reliable GPU resources for tasks like PyTorch training and Stable Diffusion fine-tuning. The platform supports RTX 4090 and A100 instances, with a user-friendly setup process that includes easy integration of custom Docker images. Performance on the A100 instance is comparable to that of Lambda, with stable disk I/O and no unexpected slowdowns. Notably, Octaspace is consistently more affordable than competitors like RunPod and Lambda while providing similar performance. However, the platform only accepts cryptocurrency payments and has a limited number of locations. For users without local GPU access, Octaspace presents a cost-effective and reliable alternative. This matters because it provides an affordable and efficient solution for intensive computational tasks, which can be crucial for developers and researchers working with machine learning and AI models.

    Read Full Article: Testing Octaspace Cloud GPU Performance & Pricing