performance optimization

  • Using Amazon Bedrock: A Developer’s Guide


    Practical notes on using Amazon Bedrock (from a dev perspective)Python remains the leading programming language for machine learning due to its comprehensive libraries and versatility. For tasks requiring high performance, C++ and Rust are favored, with Rust offering additional safety features. Julia is noted for its performance, though its adoption is slower. Kotlin, Java, and C# are utilized for platform-specific applications, while Go, Swift, and Dart are chosen for their ability to compile to native code. R and SQL are essential for statistical analysis and data management, respectively, and CUDA is employed for GPU programming to enhance machine learning speeds. JavaScript is commonly used for integrating machine learning into web projects. Understanding the strengths of these languages helps developers choose the right tool for their specific machine learning needs.

    Read Full Article: Using Amazon Bedrock: A Developer’s Guide

  • Llama.cpp vs Ollama: Code Generation Throughput


    llama.cpp vs Ollama: ~70% higher code generation throughput on Qwen-3 Coder 32B (FP16)A notable performance discrepancy has been observed between llama.cpp and Ollama in terms of code generation throughput when running the Qwen-3 Coder 32B model locally. The analysis reveals that llama.cpp achieves approximately 70% higher throughput compared to Ollama, despite both using the same model weights and hardware. Potential reasons for this difference include variations in CUDA kernels, attention implementations, context or batching defaults, scheduler or multi-GPU utilization, and overhead from Ollama's runtime or API layer. Understanding these differences is crucial for optimizing performance in machine learning applications. This matters because optimizing code generation throughput can significantly impact computational efficiency and resource utilization in AI model deployment.

    Read Full Article: Llama.cpp vs Ollama: Code Generation Throughput

  • mlship: Easy Model Serving for Popular ML Frameworks


    [P] mlship – One-command model serving for sklearn, PyTorch, TensorFlow, and HuggingFacePython is the leading programming language for machine learning due to its extensive libraries, ease of use, and versatility. C++ and Rust are preferred for performance-critical tasks, with C++ being favored for inference and low-level optimizations, while Rust is noted for its safety features. Julia, Kotlin, Java, and C# are also used, each offering unique advantages for specific platforms or performance needs. Other languages like Go, Swift, Dart, R, SQL, and JavaScript serve niche roles in machine learning, from native code compilation to statistical analysis and web interface development. Understanding the strengths of each language can help in selecting the right tool for specific machine learning tasks.

    Read Full Article: mlship: Easy Model Serving for Popular ML Frameworks

  • Exploring Programming Languages for AI


    Self-Hosted AI in Practice: My Journey with Ollama, Production Challenges, and Discovering KitOpsPython remains the leading programming language for machine learning due to its comprehensive libraries and user-friendly nature. For tasks requiring high performance, languages like C++ and Rust are favored, with C++ being ideal for inference and low-level optimizations, while Rust offers safety features. Julia, although noted for its performance, is not as widely adopted. Other languages such as Kotlin, Java, and C# are used for platform-specific applications, and Go, Swift, and Dart are chosen for their ability to compile to native code. R and SQL are essential for data analysis and management, and CUDA is utilized for GPU programming to enhance machine learning tasks. JavaScript is commonly used for full-stack machine learning projects, particularly those involving web interfaces. Understanding the strengths and applications of these languages is crucial for selecting the right tool for specific machine learning tasks.

    Read Full Article: Exploring Programming Languages for AI

  • WebGPU LLM in Unity for NPC Interactions


    WebGPU llama.cpp running in browser with Unity to drive NPC interactions (demo)An experiment with in-browser local inference using WebGPU has been integrated into a Unity game, where a large language model (LLM) serves as the NPCs' "brain" to drive decisions at interactive rates. Significant modifications were made to the WGSL kernels to reduce reliance on fp16 and support more operations for forward inference, with unexpected challenges in integrating with Unity due to Emscripten toolchain mismatches. While the WebGPU build offers a performance boost of 3x-10x over CPU depending on hardware, it remains about 10x less efficient than running directly on bare-metal hardware via CUDA. Optimizing WGSL kernels could help bridge this performance gap, and further exploration is needed to understand the limits of WebGPU performance. This matters because it highlights the potential and challenges of using WebGPU for efficient in-browser AI applications, which could revolutionize how interactive web experiences are developed.

    Read Full Article: WebGPU LLM in Unity for NPC Interactions

  • Programming Languages for AI/ML


    Cybersecurity Focussed AI/MLPython remains the dominant programming language for machine learning and AI due to its extensive libraries, ease of use, and versatility. However, for performance-critical tasks, languages like C++ and Rust are preferred for their optimization capabilities and safety features. Julia, Kotlin, Java, C#, Go, Swift, and Dart are also utilized for specific applications, such as platform-specific ML tasks or when native code performance is needed. Additionally, R and SQL are important for statistical analysis and data management, while CUDA is employed for GPU programming to enhance ML task performance. Understanding the strengths and applications of these languages is crucial for optimizing machine learning and AI projects.

    Read Full Article: Programming Languages for AI/ML

  • Semantic Caching for AI and LLMs


    Semantic Caching Explained: A Complete Guide for AI, LLMs, and RAG SystemsSemantic caching is a technique used to enhance the efficiency of AI, large language models (LLMs), and retrieval-augmented generation (RAG) systems by storing and reusing previously computed results. Unlike traditional caching, which relies on exact matching of queries, semantic caching leverages the meaning and context of queries, enabling systems to handle similar or related queries more effectively. This approach reduces computational overhead and improves response times, making it particularly valuable in environments where quick access to information is crucial. Understanding semantic caching is essential for optimizing the performance of AI systems and ensuring they can scale to meet increasing demands.

    Read Full Article: Semantic Caching for AI and LLMs

  • CNN in x86 Assembly: Cat vs Dog Classifier


    I implemented a Convolutional Neural Network (CNN) from scratch entirely in x86 Assembly, Cat vs Dog ClassifierAn ambitious project involved implementing a Convolutional Neural Network (CNN) from scratch in x86-64 assembly to classify images of cats and dogs, using a dataset of 25,000 RGB images. The project aimed to deeply understand CNNs by focusing on low-level operations such as memory layout, data movement, and SIMD arithmetic, without relying on any machine learning frameworks or libraries. Key components like Conv2D, MaxPool, Dense layers, activations, forward and backward propagation, and the data loader were developed in pure assembly, achieving a performance approximately 10 times faster than a NumPy version. Despite the challenges of debugging at this scale, the implementation successfully runs inside a lightweight Debian Slim Docker container, showcasing a unique blend of low-level programming and machine learning. This matters because it demonstrates the potential for significant performance improvements in neural networks through low-level optimizations.

    Read Full Article: CNN in x86 Assembly: Cat vs Dog Classifier

  • Four Ways to Run ONNX AI Models on GPU with CUDA


    Not One, Not Two, Not Even Three, but Four Ways to Run an ONNX AI Model on GPU with CUDARunning ONNX AI models on GPUs with CUDA can be achieved through four distinct methods, enhancing flexibility and performance for machine learning operations. These methods include using ONNX Runtime with CUDA execution provider, leveraging TensorRT for optimized inference, employing PyTorch with its ONNX export capabilities, and utilizing the NVIDIA Triton Inference Server for scalable deployment. Each approach offers unique advantages, such as improved speed, ease of integration, or scalability, catering to different needs in AI model deployment. Understanding these options is crucial for optimizing AI workloads and ensuring efficient use of GPU resources.

    Read Full Article: Four Ways to Run ONNX AI Models on GPU with CUDA

  • Autoscaling RAG Components on Kubernetes


    Retrieval-augmented generation (RAG) systems enhance the accuracy of AI agents by using a knowledge base to provide context to large language models (LLMs). The NVIDIA RAG Blueprint facilitates RAG deployment in enterprise settings, offering modular components for ingestion, vectorization, retrieval, and generation, along with options for metadata filtering and multimodal embedding. RAG workloads can be unpredictable, requiring autoscaling to manage resource allocation efficiently during peak and off-peak times. By leveraging Kubernetes Horizontal Pod Autoscaling (HPA), organizations can autoscale NVIDIA NIM microservices like Nemotron LLM, Rerank, and Embed based on custom metrics, ensuring performance meets service level agreements (SLAs) even during demand surges. Understanding and implementing autoscaling in RAG systems is crucial for maintaining efficient resource use and optimal service performance.

    Read Full Article: Autoscaling RAG Components on Kubernetes