Deep Dives

  • Optimizing GPU Utilization for Cost and Climate Goals


    idle gpus are bleeding money, did the math on our h100 cluster and it's worse than I thoughtA cost analysis of GPU infrastructure revealed significant financial and environmental inefficiencies, with idle GPUs costing approximately $45,000 monthly due to a 40% idle rate. The setup includes 16x H100 GPUs on AWS, costing $98.32 per hour, resulting in $28,000 wasted monthly. Challenges such as job queue bottlenecks, inefficient resource allocation, and power consumption contribute to the high costs and carbon footprint. Implementing dynamic orchestration and better job placement strategies improved utilization from 60% to 85%, saving $19,000 monthly and reducing CO2 emissions. Making costs visible and optimizing resource sharing are essential steps towards more efficient GPU utilization. This matters because optimizing GPU usage can significantly reduce operational costs and environmental impact, aligning with financial and climate goals.

    Read Full Article: Optimizing GPU Utilization for Cost and Climate Goals

  • Four Ways to Run ONNX AI Models on GPU with CUDA


    Not One, Not Two, Not Even Three, but Four Ways to Run an ONNX AI Model on GPU with CUDARunning ONNX AI models on GPUs with CUDA can be achieved through four distinct methods, enhancing flexibility and performance for machine learning operations. These methods include using ONNX Runtime with CUDA execution provider, leveraging TensorRT for optimized inference, employing PyTorch with its ONNX export capabilities, and utilizing the NVIDIA Triton Inference Server for scalable deployment. Each approach offers unique advantages, such as improved speed, ease of integration, or scalability, catering to different needs in AI model deployment. Understanding these options is crucial for optimizing AI workloads and ensuring efficient use of GPU resources.

    Read Full Article: Four Ways to Run ONNX AI Models on GPU with CUDA

  • Pydantic AI Durable Agent Demo


    Pydantic AI Durable Agent DemoPydantic AI has introduced two new demos showcasing durable agent patterns using DBOS: one demonstrating large fan-out parallel workflows called "Deep Research," and the other illustrating long sequential subagent chaining known as "Twenty Questions." These demos highlight the importance of durable execution, allowing agents to survive crashes or interruptions and resume precisely where they left off. The execution of these workflows is fully observable in the DBOS console, with detailed workflow graphs and management tools, and is instrumented with Logfire to trace token usage and cost per step. This matters because it showcases advanced techniques for building resilient AI systems that can handle complex tasks over extended periods.

    Read Full Article: Pydantic AI Durable Agent Demo

  • Boosting GPU Utilization with WoolyAI’s Software Stack


    Co-locating multiple jobs on GPUs with deterministic performance for a 2-3x increase in GPU UtilizationTraditional GPU job orchestration often leads to underutilization due to the one-job-per-GPU approach, which leaves GPU resources idle when not fully saturated. WoolyAI's software stack addresses this by allowing multiple jobs to run concurrently on a single GPU with deterministic performance, dynamically managing the GPU's streaming multiprocessors (SMs) to ensure full utilization. This approach not only maximizes GPU efficiency but also supports running machine learning jobs on CPU-only infrastructure by executing kernels remotely on a shared GPU pool. Additionally, it allows existing CUDA PyTorch jobs to run seamlessly on AMD hardware without modifications. This matters because it significantly increases GPU utilization and efficiency, potentially reducing costs and improving performance in computational tasks.

    Read Full Article: Boosting GPU Utilization with WoolyAI’s Software Stack

  • MayimFlow: Preventing Data Center Water Leaks


    MayimFlow wants to stop data center leaks before they happenMayimFlow, a startup founded by John Khazraee, aims to prevent water leaks in data centers before they occur, using IoT sensors and machine learning models to provide early warnings. Data centers, which consume significant amounts of water, face substantial risks from even minor leaks, potentially leading to costly downtime and disruptions. Khazraee, with a background in infrastructure for major tech companies, has assembled a team experienced in data centers and water management to tackle this challenge. The company envisions expanding its leak detection solutions beyond data centers to other sectors like commercial buildings and hospitals, emphasizing the growing importance of water management in various industries. This matters because proactive leak detection can save companies significant resources and prevent disruptions in critical operations.

    Read Full Article: MayimFlow: Preventing Data Center Water Leaks

  • 3D Furniture Models with LLaMA 3.1


    Gen 3D with local llmAn innovative project has explored the potential of open-source language models like LLaMA 3.1 to generate 3D furniture models, pushing these models beyond text to create complex 3D mesh structures. The project involved fine-tuning LLaMA with a 20k token context length to handle the intricate geometry of furniture, using a specialized dataset of furniture categories such as sofas, cabinets, chairs, and tables. Utilizing GPU infrastructure from verda.com, the model was trained to produce detailed mesh representations, with results available for viewing on llm3d.space. This advancement showcases the potential for language models to contribute to fields like e-commerce, interior design, AR/VR applications, and gaming by bridging natural language understanding with 3D content creation. This matters because it demonstrates the expanding capabilities of AI in generating complex, real-world applications beyond traditional text processing.

    Read Full Article: 3D Furniture Models with LLaMA 3.1

  • Advancements in Local LLMs: Trends and Innovations


    Build a Local Voice Agent Using LangChain, Ollama & OpenAI WhisperIn 2025, the local LLM landscape has evolved with notable advancements in AI technology. The llama.cpp has become the preferred choice for many users over other LLM runners like Ollama due to its enhanced performance and seamless integration with Llama models. Mixture of Experts (MoE) models have gained traction for efficiently running large models on consumer hardware, striking a balance between performance and resource usage. New local LLMs with improved capabilities and vision features are enabling more complex applications, while Retrieval-Augmented Generation (RAG) systems mimic continuous learning by incorporating external knowledge bases. Additionally, advancements in high-VRAM hardware are facilitating the use of more sophisticated models on consumer machines. This matters as it highlights the ongoing innovation and accessibility of AI technologies, empowering users to leverage advanced models on local devices.

    Read Full Article: Advancements in Local LLMs: Trends and Innovations

  • AI’s Impact on Future Healthcare


    OpenAI’s leaked 2025 user priority roadmapAI is set to transform healthcare by automating tasks such as medical note generation, which will alleviate the administrative load on healthcare workers. It is also expected to enhance billing, coding, and revenue cycle management by minimizing errors and identifying lost revenue opportunities. Specialized AI agents and knowledge bases will offer tailored advice by accessing specific medical records, while AI's role in diagnostics and medical imaging will continue to grow, albeit under human supervision. Additionally, AI trained on domain-specific language models will improve the handling of medical terminology, reducing clinical documentation errors and potentially decreasing medical errors, which are a significant cause of mortality. This matters because AI's integration into healthcare could lead to more efficient, accurate, and safer medical practices, ultimately improving patient outcomes.

    Read Full Article: AI’s Impact on Future Healthcare

  • Pros and Cons of AI


    Advantages and Disadvantages of Artificial IntelligenceArtificial intelligence is revolutionizing various sectors by automating routine tasks and tackling complex problems, leading to increased efficiency and innovation. However, while AI offers significant benefits, such as improved decision-making and cost savings, it also presents challenges, including ethical concerns, potential job displacement, and the risk of biases in decision-making processes. Balancing the advantages and disadvantages of AI is crucial to harness its full potential while mitigating risks. Understanding the impact of AI is essential as it continues to shape the future of industries and society at large.

    Read Full Article: Pros and Cons of AI

  • Running SOTA Models on Older Workstations


    Surprised you can run SOTA models on 10+ year old (cheap) workstation with usable tps, no need to break the bank.Running state-of-the-art models on older, cost-effective workstations is feasible with the right setup. Utilizing a Dell T7910 with a physical CPU (E5-2673 v4, 40 cores), 128GB RAM, dual RTX 3090 GPUs, and NVMe disks with PCIe passthrough, it's possible to achieve usable tokens per second (tps) speeds. Models like MiniMax-M2.1-UD-Q5_K_XL, Qwen3-235B-A22B-Thinking-2507-UD-Q4_K_XL, and GLM-4.7-UD-Q3_K_XL can run at 7.9, 6.1, and 5.5 tps respectively. This demonstrates that high-performance AI workloads can be managed without investing in the latest hardware, making advanced AI more accessible.

    Read Full Article: Running SOTA Models on Older Workstations