The METR safety evaluation of GPT-5.1-Codex-Max reveals significant limitations in the AI's ability to handle long-duration tasks autonomously. The model's "50% Time Horizon" is 2 hours and 42 minutes, indicating a 50% chance of failure for tasks that take a human expert this long to complete. To achieve an 80% success rate, the AI is only reliable for tasks equivalent to 30 minutes of human effort, highlighting its lack of endurance. Despite increasing computational resources, performance improvements plateau, and the AI struggles with tasks requiring more than 20 hours, often resulting in catastrophic errors. This matters because it underscores the current limitations of AI in managing complex, long-term projects autonomously.
Read Full Article: GPT-5.1-Codex-Max’s Limitations in Long Tasks