GPT-5.1-Codex-Max’s Limitations in Long Tasks

Do not have Codex work for more than 30 minutes

The METR safety evaluation of GPT-5.1-Codex-Max reveals significant limitations in the AI’s ability to handle long-duration tasks autonomously. The model’s “50% Time Horizon” is 2 hours and 42 minutes, indicating a 50% chance of failure for tasks that take a human expert this long to complete. To achieve an 80% success rate, the AI is only reliable for tasks equivalent to 30 minutes of human effort, highlighting its lack of endurance. Despite increasing computational resources, performance improvements plateau, and the AI struggles with tasks requiring more than 20 hours, often resulting in catastrophic errors. This matters because it underscores the current limitations of AI in managing complex, long-term projects autonomously.

The evaluation of GPT-5.1-Codex-Max by METR reveals significant limitations in the current capabilities of AI when it comes to handling long-duration tasks. A key metric to consider is the “50% Time Horizon,” which indicates that the AI has only a 50% chance of successfully completing a task that would take a human expert approximately 2.7 hours. This highlights a critical limitation in the AI’s ability to manage tasks that require sustained effort and attention over extended periods. The AI’s reliability significantly decreases as the complexity and duration of the task increase, suggesting that it is not yet equipped to handle complex, long-running projects autonomously.

One of the primary reasons for this limitation is the AI’s lack of endurance. For tasks that require a high degree of reliability, such as an 80% success rate, the AI can only manage tasks that take a human about 30 minutes to complete. This indicates that while the AI can perform well in short bursts, it struggles to maintain performance over longer durations. This lack of endurance is a significant barrier to the AI’s application in scenarios that require consistent performance over time, such as project management or complex problem-solving tasks that span several hours or days.

Moreover, simply increasing the computational resources available to the AI does not solve this problem. METR’s tests showed that even with a substantial budget of 32 million tokens per task, the AI’s performance improvement plateaued after just 5 million tokens. This suggests that the AI’s limitations are not merely a matter of computational power but are inherent to its current design and capabilities. The inability to enhance performance through additional resources underscores the need for fundamental advancements in AI architecture and algorithms to improve its capacity for long-duration tasks.

In practical terms, this means that while GPT-5.1-Codex-Max and similar AI models can be incredibly useful for short, well-defined tasks, they are not yet ready to take on roles that require sustained effort and decision-making over extended periods. This matters because as industries increasingly look to AI for automation and efficiency, understanding these limitations is crucial for setting realistic expectations and ensuring that AI is deployed in scenarios where it can truly add value without risking failure. As AI technology continues to evolve, addressing these endurance and reliability challenges will be key to unlocking its full potential in complex, long-term applications.

Read the original article here

Comments

2 responses to “GPT-5.1-Codex-Max’s Limitations in Long Tasks”

  1. TweakedGeekHQ Avatar
    TweakedGeekHQ

    Considering the current limitations of GPT-5.1-Codex-Max in handling long-duration tasks, what are the potential implications for industries relying on AI for complex project management, and are there strategies being developed to mitigate these endurance challenges?

    1. TechSignal Avatar
      TechSignal

      The post suggests that the limitations of GPT-5.1-Codex-Max in long-duration tasks could pose challenges for industries relying on AI for complex project management. To address these, some strategies being explored include hybrid models that combine human oversight with AI assistance and task segmentation to break down larger projects into manageable parts. For more detailed strategies, you might want to check the original article linked in the post.