An experiment with in-browser local inference using WebGPU has been integrated into a Unity game, where a large language model (LLM) serves as the NPCs’ “brain” to drive decisions at interactive rates. Significant modifications were made to the WGSL kernels to reduce reliance on fp16 and support more operations for forward inference, with unexpected challenges in integrating with Unity due to Emscripten toolchain mismatches. While the WebGPU build offers a performance boost of 3x-10x over CPU depending on hardware, it remains about 10x less efficient than running directly on bare-metal hardware via CUDA. Optimizing WGSL kernels could help bridge this performance gap, and further exploration is needed to understand the limits of WebGPU performance. This matters because it highlights the potential and challenges of using WebGPU for efficient in-browser AI applications, which could revolutionize how interactive web experiences are developed.
The exploration of using WebGPU for local inference in a browser environment is an exciting development in the realm of AI-driven gaming. By integrating a language model as the decision-making “brain” for non-playable characters (NPCs) in a Unity game, the project showcases the potential for more interactive and dynamic gaming experiences. This approach allows for real-time decision-making by NPCs, enhancing the overall immersion and complexity of the game. The use of WebGPU, which is designed to offer high-performance graphics and compute capabilities in web applications, highlights the push towards harnessing more powerful computing technologies directly within web browsers.
One of the key technical challenges faced in this endeavor was the integration of WebGPU with Unity, particularly due to the complexities of the Emscripten toolchain. Emscripten is a compiler toolchain that allows developers to compile C and C++ code to WebAssembly, which is essential for running such code in web environments. The mismatches and configuration issues encountered underscore the difficulties in bridging different development ecosystems. Successfully overcoming these challenges through the creation of a self-contained WebAssembly (WASM) module demonstrates the potential for future projects to leverage similar methodologies for complex web-based applications.
Despite the performance improvements offered by WebGPU over CPU, the technology still lags significantly behind the capabilities of running models on bare-metal hardware using CUDA. This performance gap is an area ripe for further exploration and optimization. The potential for enhancing the WGSL kernels, which are responsible for executing compute operations in WebGPU, suggests that there is room for improvement. Understanding the limitations and pushing the boundaries of WebGPU performance could lead to significant advancements in how local inference is handled in-browser, potentially making it a more viable option for a wider range of applications.
Discussing benchmarks and performance metrics such as tokens per second (tok/s) and first-token latency is crucial for evaluating the effectiveness of this approach. Comparing the performance across CPU, CUDA, and WebGPU can provide valuable insights into the strengths and weaknesses of each platform. Additionally, sharing experiences and tips on stability and performance, as well as any non-obvious challenges encountered, can benefit other developers working with similar technologies. The ability to perform local inference directly in the browser opens up new possibilities for AI-driven applications, making this exploration not only a technical achievement but also a step towards more accessible and interactive web experiences.
Read the original article here


Leave a Reply
You must be logged in to post a comment.