Evaluating LLMs with User Insights: The Search Arena Case | Popular language models | Most popular large language models gpt | Is chatgpt an llm | Turtles AI

Evaluating LLMs with User Insights: The Search Arena Case
An open-source platform analyzes the performance of search-enhanced language models, highlighting the importance of length, citations and sources in user preferences
Isabella V15 April 2025

 

Search Arena is a real-time evaluation platform for search-enhanced language models, focused on current events and real-world use cases. Using over 7,000 human votes, it has highlighted how length of answers, number of citations, and sources influence user preferences.

Key Points:

  • Gemini-2.5-Pro-Grounding and Perplexity-Sonar-Reasoning-Pro-High lead the rankings.
  • User preferences correlate with longer, more cited answers.
  • Citation standardization has minimal impact on rankings.
  • The dataset and analysis code are publicly available for further research.


In the ever-changing AI landscape, evaluating large language models (LLMs) requires dynamic, user-centric tools. Search Arena addresses this need by providing a human-feedback-based platform to analyze the performance of search-enhanced LLMs. Unlike static benchmarks, Search Arena focuses on current events and real-world questions, collecting over 11,000 votes across more than 10 models in less than a month.

The collected data, filtered for consistency in citation style, allowed us to build 7,000 “battles” between models, which were analyzed through the Bradley-Terry model to calculate scores and rankings. The results show that Gemini-2.5-Pro-Grounding and Perplexity-Sonar-Reasoning-Pro-High stand out clearly, followed by other models from Perplexity’s Sonar family and OpenAI’s models.

A detailed analysis revealed that three characteristics significantly influence user preferences: the length of the answer, the number of citations, and the presence of specific sources such as YouTube and online forums. In particular, Gemini models tend to provide more detailed answers, while Perplexity models cite a greater number of sources. OpenAI, on the other hand, stands out for a higher frequency of citations from news sources.

Citation style standardization, introduced to mitigate model deanonymization, had minimal impact on rankings, although it showed a statistically significant positive effect on model scores. This suggests that, while not drastically altering user preferences, consistent citation presentation can improve perceptions of responses.

To facilitate further research and analysis, the “search-arena-7k” dataset and analysis code have been made open source. This transparent approach aims to engage the scientific and business community in the evaluation and continuous improvement of LLM models, fostering a deeper understanding of search-enhanced AI-user interactions.

In an era where access to accurate and timely information is critical, platforms like Search Arena are an important step towards creating more realistic and user-centric evaluation tools, helping to shape the future of human-AI interactions