ModelCypher: Exploring LLM Geometry

ModelCypher: A toolkit for the geometry of LLMs (open source) [P]

ModelCypher is an open-source toolkit designed to explore the geometry of small language models, challenging the notion that these models are inherently black boxes. It features cross-architecture adapter transfer and jailbreak detection using entropy divergence, implementing methods from over 46 recent research papers. Although the hypothesis that Wierzbicka’s “Semantic Primes” would show unique geometric invariance was disproven, the toolkit reveals that distinct concepts have a high convergence across different models. The tools are documented with analogies to aid understanding, though they primarily provide raw metrics rather than user-friendly outputs. This matters because it provides a new way to understand and potentially improve language models by examining their geometric properties.

The development of ModelCypher challenges the prevailing notion that large language models (LLMs) are inherently opaque or “black boxes.” By creating a toolkit that delves into the geometry of these models, the initiative seeks to bring transparency to the processes occurring within small language models before they generate tokens. This is significant because understanding the internal workings of LLMs can lead to more effective and ethical applications of AI technology. The toolkit’s ability to measure and utilize the actual geometric properties of these models provides a new perspective on how LLMs function, potentially leading to more refined and controlled AI systems.

One of the key features of ModelCypher is its cross-architecture adapter transfer, known as Procrustes alignment. This method allows for the comparison and alignment of different model architectures, providing insights into how different models process and interpret data. Additionally, the toolkit includes jailbreak detection via Entropy Divergence, which can identify when a model might be manipulated to produce unintended outputs. These features are crucial for ensuring the security and reliability of AI systems, as they help prevent misuse and enhance the robustness of language models.

Interestingly, the toolkit also highlights a negative result regarding the hypothesis that Wierzbicka’s “Semantic Primes” would exhibit unique geometric invariance across models. The data revealed that distinct concepts, even random controls, showed a high degree of convergence (CKA > 0.94) across different models like Qwen, Llama, and Mistral. This suggests that the convergence observed is universal rather than linguistic, indicating that the underlying geometry of these models may be more about mathematical properties than language-specific features. Such findings can reshape our understanding of how language models operate and interact with various types of data.

ModelCypher is open source under the AGPLv3 license, encouraging collaboration and continuous improvement from the community. This openness not only fosters innovation but also allows for a collective effort in addressing the challenges and limitations of current AI models. By providing raw metrics and precise analogies to explain high-dimensional geometry, the toolkit aims to make complex concepts more accessible. However, it emphasizes that these outputs should be viewed as diagnostic tools rather than end-user applications, akin to an oscilloscope rather than a chatbot. As the project is under active development, contributions and feedback are welcomed, promoting a dynamic and evolving approach to understanding and improving LLMs.

Read the original article here