Anthropic's Claude 3 Hailed as Most "Human-Like" AI, Whatever That Means

3/10/24

Editorial team at Bits with Brains

Anthropic, the artificial intelligence company founded by former OpenAI researchers, has unveiled Claude 3 - a new family of large language models that the company claims "sets new industry benchmarks across a wide range of cognitive tasks."

The Claude-3 suite includes three models of increasing capability: Claude-3 Haiku, Claude-3 Sonnet, and Claude-3 Opus.

Claude-3 Opus, the most advanced and expensive model, reportedly outperforms rivals like OpenAI's GPT-4 and Google's Gemini Ultra on many common AI benchmarks. These include tests of undergraduate-level knowledge (MMLU), graduate-level reasoning (GPQA), basic math (GSM8K), and more. Anthropic claims Opus exhibits "near-human levels of comprehension and fluency on complex tasks."

Claude-3 Sonnet is no slouch either. In my (albeit limited) testing, it bests Chat-GPT across most categories.

In fact, all the Claude-3 models show enhanced capabilities compared to previous versions in areas like analysis, forecasting, content creation, code generation, and multilingual conversation. They also feature new multimodal abilities, allowing them to process images, charts, and diagrams in addition to text. Speed and cost-effectiveness have improved as well, with Claude-3 Haiku being touted as the fastest and cheapest model in its intelligence category.

One of Claude-3’s most impressive capabilities are its ability to “find the needle in the haystack”. This refers to its advanced ability to sift through vast amounts of data and identify the most relevant information or answers to a given query. This is significant for organizations because it can drastically improve efficiency in larger information retrieval tasks. For example, in legal or financial sectors, Claude-3 could quickly locate specific clauses in contracts or identify trends in large datasets, which would otherwise require extensive manual review. Many LLMs with large context windows fail in this task as the information in the middle often gets lost or overlooked.

This capability is not just about finding a single piece of data but understanding context, relevance, and the interplay of information within a large corpus.

While Claude-3 performs admirably against a battery of benchmarks, experts caution that benchmark results should be viewed with a degree of skepticism. "How well a model performs on benchmarks doesn't tell you much about how the model 'feels' to use," says AI researcher Simon Willison. Benchmarks are often heavily engineered to highlight a model's strengths. Real-world performance and usability may differ from what scores suggest.

There are also questions about the validity and sufficiency of current AI benchmarks. Many focus on narrow, abstract tasks that don't fully capture the nuances of deploying AI in practice. "People often rely on public tests for comparison, but these tests are pretty abstract and might not always reflect real-world scenarios," notes Ilia Badeev, head of data science at Trevolution Group.

Many argue for more rigorous, real-world testing of AI systems that goes beyond benchmark scores. This could include evaluating models on their ability to handle complex, open-ended problems, reason about physical and social dynamics, and operate safely and reliably in high-stakes domains.

One recent approach was to administer a verbal IQ test adapted from the Norway Mensa matrix-style test to several AI systems, including Claude-3. The main finding was that Anthropic's Claude AI passed the 100 IQ threshold for the first time, scoring an estimated IQ of 102 based on getting 13 out of 35 questions correct on average across two test administrations. Previous versions of Claude and other AI systems fell below 100 IQ. GPT-4 scored 85. Using this metric, Claude-3 does represent a new leap in AI.

Extrapolating from the relative IQ performance of Claude-1, Claude-2 and now Claude-3, some researchers estimate we are probably 4 or so years away from a LLM with an IQ-equivalent of 140. This is greater than 98% of the global population.

However, assessing factors like inference speed, memory efficiency, and robustness to distribution shift is also important.

Anthropic itself acknowledges the limitations of benchmarks, noting that "engineers have worked to optimize prompts and few-shot samples for evaluations and reported higher scores." The company says it remains committed to advancing AI safety and steering the technology's development in a positive direction. But striking the right balance between capability and safety remains an ongoing challenge for all model development labs.

Ultimately, while Claude-3's benchmark results are truly impressive, they paint an incomplete picture. For transformative AI systems to be deployed responsibly, we need more comprehensive evaluation frameworks that stress-test models in realistic settings and probe the boundaries of their reasoning, robustness, and alignment.

Organizations looking to implement LLM-enabled applications need to look beyond common benchmarks, as they may not fully reflect real-world performance or usability. Executives should seek to understand how Claude-3 and similar models perform in practical scenarios relevant to their business needs, possibly through pilot projects or trials, before committing to widespread deployment.

While Claude-3 undoubtedly represents a significant step forward for both Large Language Models and Anthropic, and narrows the gap with other leading AI labs, the model's true potential - and limitations - will only become clear through extensive real-world testing and deployment.

Sources:

[1] https://www.anthropic.com/news/claude-3-family

[2] https://manifold.markets/dominic/will-anthropic-release-claude3-befo

[3] https://www.lesswrong.com/posts/JbE7KynwshwkXPJAJ/anthropic-release-claude-3-claims-greater-than-gpt-4

[4] http://anakin.ai/blog/claude-api-cost/

[5] https://towardsdatascience.com/the-olympics-of-ai-benchmarking-machine-learning-systems-c4b2051fbd2b

[6] https://arstechnica.com/information-technology/2024/03/the-ai-wars-heat-up-with-claude-3-claimed-to-have-near-human-abilities/

[7] https://opencv.org/blog/anthropic-claude-3/

[8] https://arstechnica.com/information-technology/2024/03/the-ai-wars-heat-up-with-claude-3-claimed-to-have-near-human-abilities/2/

[9] https://www.econlib.org/a-chat-with-claude-3/

[10] https://harvard-edge.github.io/cs249r_book/contents/benchmarking/benchmarking.html

[11] https://www.infoq.com/news/2024/03/anthropic-claude-ai/