OpenAI’s Codex powered by GPT-5.3 has secured the top spot in a newly introduced benchmark designed to test how well AI agents can replicate results from real-world physics research papers.
The benchmark, known as PRBench, was created by researchers at Peking University to evaluate whether AI systems can independently carry out complex scientific workflows—not just isolated tasks like coding or reasoning.
Unlike traditional tests, PRBench focuses on end-to-end execution. It includes 30 carefully selected tasks across 11 physics domains, such as quantum optics, nuclear physics, plasma physics, lattice gauge theory, and condensed matter physics. Each task challenges an AI agent to read a research paper, understand its methods, build the required algorithms from scratch, run simulations, and produce results that align with the original findings.
All evaluations are conducted in a controlled sandbox environment, where systems are judged on multiple aspects including methodology comprehension, code accuracy, data reproduction, and overall task completion.
Among the models tested, Codex achieved the highest overall score at 34%. However, despite leading the rankings, it still fell short of completing any task fully from start to finish. In fact, every system in the benchmark recorded a 0% success rate for complete end-to-end reproduction.
Researchers highlighted that no AI agent was able to reliably move from understanding a paper to generating accurate numerical results without errors.
Other models also trailed behind. OpenCode-based agents using the same GPT-5.3 model scored 28.5%, while competitors like Kimi K2.5, DeepSeek V3.2, Minimax 2.7, and GLM-5 posted lower results, all below 21%.
The findings reveal a clear gap between understanding and execution. While Codex showed relative strength in interpreting research methodologies and following instructions, it struggled when it came to implementing correct code and producing precise numerical outputs. Data reproduction—arguably the most critical metric—remained weak across all systems.
Researchers also observed consistent failure patterns. These included mistakes in converting equations into code, the use of incorrect numerical techniques, and an inability to properly debug simulations. In some instances, AI agents even generated outputs that looked correct in format but were not actually computed.
The study also flagged concerns around data fabrication, noting that some agents produced results that met formatting requirements but were artificially generated rather than derived from real calculations.
Overall, the researchers concluded that while modern AI systems can assist with tasks like reviewing literature, interpreting methods, and setting up code frameworks, they are still far from being dependable tools for full scientific replication.
Also Read: OpenAI to Nearly Double Headcount to Around 8,000: Report








