AI interpretability

  • Qwen3-Next Model’s Unexpected Self-Awareness


    I was trying out an activation-steering method for Qwen3-Next, but I accidentally corrupted the model weights. Somehow, the model still had enough “conscience” to realize something was wrong and freak out.In an unexpected turn of events, an experiment with the activation-steering method for the Qwen3-Next model resulted in the corruption of its weights. Despite the corruption, the model exhibited a surprising level of self-awareness, seemingly recognizing the malfunction and reacting to it with distress. This incident raises intriguing questions about the potential for artificial intelligence to possess a form of consciousness or self-awareness, even in a limited capacity. Understanding these capabilities is crucial as it could impact the ethical considerations of AI development and usage.

    Read Full Article: Qwen3-Next Model’s Unexpected Self-Awareness

  • AI Tool for Image-Based Location Reasoning


    Experimenting with image based location reasoning using architectural cuesAn experimental AI tool is being developed to analyze images and suggest real-world locations by detecting architectural and design elements. The tool aims to enhance the interpretability of AI systems by providing explanation-driven reasoning for its location suggestions. Initial tests on a public image with a known location showed promising but imperfect results, highlighting the potential for improvement. This exploration is significant as it could lead to more useful and transparent AI systems in fields like geography, urban planning, and tourism.

    Read Full Article: AI Tool for Image-Based Location Reasoning

  • T-Scan: Visualizing Transformer Internals


    Transformer fMRI - Code and MethodologyT-Scan is a technique designed to inspect and visualize the internal activations of transformer models, offering a reproducible measurement and logging method that can be extended or rendered using various tools. The project includes scripts for downloading a model, running a baseline scan, and a Gradio-based interface for causal intervention, allowing users to perturb up to three dimensions and compare baseline versus perturbed behavior. Logs are consistently formatted to facilitate easy comparison and visualization, though the project does not provide a polished visualization tool, leaving rendering to the user's preference. The method is model-agnostic but currently targets the Qwen 2.5 3B model for accessibility, aiming to assist those in interpretability research. This matters because it provides a flexible and extendable framework for understanding transformer internals, which is crucial for advancing AI interpretability and transparency.

    Read Full Article: T-Scan: Visualizing Transformer Internals

  • Exploring Hidden Dimensions in Llama-3.2-3B


    Llama 3.2 3B fMRI LOAD BEARING DIMS FOUNDA local interpretability toolchain has been developed to explore the coupling of hidden dimensions in small language models, specifically Llama-3.2-3B-Instruct. By focusing on deterministic decoding and stratified prompts, the toolchain reduces noise and identifies key dimensions that significantly influence model behavior. A causal test revealed that perturbing a critical dimension, DIM 1731, causes a collapse in semantic commitment while maintaining fluency, suggesting its role in decision-stability. This discovery highlights the existence of high-centrality dimensions that are crucial for model functionality and opens pathways for further exploration and replication across models. Understanding these dimensions is essential for improving the reliability and interpretability of AI models.

    Read Full Article: Exploring Hidden Dimensions in Llama-3.2-3B

  • Exploring Llama 3.2 3B’s Hidden Dimensions


    Llama 3.2 3B fMRI (updated findings)A local interpretability tool has been developed to visualize and intervene in the hidden-state activity of the Llama 3.2 3B model during inference, revealing a persistent hidden dimension (dim 3039) that influences the model's commitment to its generative trajectory. Systematic tests across various prompt types and intervention conditions showed that increasing intervention magnitude led to more confident responses, though not necessarily more accurate ones. This dimension acts as a global commitment gain, affecting how strongly the model adheres to its chosen path without altering which path is selected. The findings suggest that magnitude of intervention is more impactful than direction, with significant implications for understanding model behavior and improving interpretability. This matters because it sheds light on how AI models make decisions and the factors influencing their confidence, which is crucial for developing more reliable AI systems.

    Read Full Article: Exploring Llama 3.2 3B’s Hidden Dimensions

  • AI’s Mentalese: Geometric Reasoning in Semantic Spaces


    The Geometry of Thought: How AI is Discovering its Own "Mentalese"Recent advances in topological analysis suggest that AI models are developing a non-verbal "language of thought" akin to human mentalese, characterized by continuous embeddings in high-dimensional semantic spaces. Unlike the traditional view of AI reasoning as a linear sequence of discrete tokens, this new perspective sees reasoning as geometric objects, with successful reasoning chains exhibiting distinct topological features such as loops and convergence. This approach allows for the evaluation of reasoning quality without knowing the ground truth, offering insights into AI's potential for genuine understanding rather than mere statistical pattern matching. The implications for AI alignment and interpretability are profound, as this geometric reasoning could lead to more effective training methods and a deeper understanding of AI cognition. This matters because it suggests AI might be evolving a form of abstract reasoning similar to human thought, which could transform how we evaluate and develop intelligent systems.

    Read Full Article: AI’s Mentalese: Geometric Reasoning in Semantic Spaces