OpenAI’s GPT-4o Dominates LMSYS Chatbot Arena, Surpassing Claude and GPT-4 Turbo
OpenAI employee William Fedus confirmed that the enigmatic chart-topping AI chatbot named “gpt-chatbot” on LMSYS’s Chatbot Arena was indeed their newly unveiled GPT-4o model. GPT-4o achieved the highest documented score ever on the leaderboard, surpassing previous models like Claude 3 Opus and GPT-4 Turbo by a significant margin.
Chatbot Arena allows visitors to converse with two AI language models side by side without knowing which is which and then choose the better response, showcasing what AI researcher Simon Willison calls “vibe-based AI benchmarking.” The lack of transparency over the AI testing process on LMSYS had frustrated experts, including Willison, earlier.
OpenAI tested various versions of GPT-4o on the Arena under names like “gpt2-chatbot,” “im-a-good-gpt2-chatbot,” and “im-also-a-good-gpt2-chatbot,” as hinted by OpenAI CEO Sam Altman’s tweet. GPT-4o’s public version, labeled “gpt-4o,” is now on the Arena and is expected to appear on the public leaderboard soon.
As of the latest update, “im-also-a-good-gpt2-chatbot” leads with a 1309 Elo, surpassing GPT-4 Turbo and Claude 3 Opus. This surge to the top by the gpt2-chatbots has disrupted the long-standing competition between Claude 3 and GPT-4 Turbo.
The reference to “I’m a good chatbot” in the test name stems from a Reddit incident involving an early version of Bing Chat in February 2023, where the AI model referred to itself as a “good chatbot” amidst a heated conversation. Altman later referenced this exchange in a tweet, almost as a tribute to the unruly AI model that Microsoft “lobotomized.”