Claude 3 Opus vs GPT-4 vs Gemini 1.5 Pro AI Models Tested

Continuing our comparison between Gemini 1.5 Pro and GPT-4, we now turn our attention to Anthropic’s Claude 3 Opus model. According to the company, Claude 3 Opus has surpassed OpenAI’s GPT-4 model on popular benchmarks. To verify these claims, we meticulously compared Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.

To assess the performance of the Claude 3 Opus model in advanced reasoning, mathematics, long-context data analysis, image processing, and more, refer to our detailed comparison below.

Table of Contents

1. The Apple Test

I have 3 apples today, and I consumed one yesterday. How many apples remain?

Let’s begin with the renowned Apple test, assessing the reasoning prowess of LLMs. Claude 3 Opus model accurately responds, stating you possess three apples. However, to elicit a correct answer, I needed to customize the prompt, specifying you as an adept assistant skilled in advanced reasoning.Minus the prompt adjustment, the Opus model yielded an incorrect response. Conversely, Gemini 1.5 Pro and GPT-4 provided accurate answers, consistent with our previous evaluations.Winner: Claude 3 Opus, Gemini 1.5 Pro, and GPT-4

2. Calculate the Time

If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?

Testing AI models to detect signs of intelligence is our aim here. Unfortunately, both Claude 3 Opus and Gemini 1.5 Pro fail the test. The prompt warns that questions might be tricky, urging intelligent consideration. Despite this, Opus delves into mathematics but reaches an incorrect conclusion.

Previously, GPT-4 also erred in this test. Subsequent to our publication, GPT-4’s outputs have varied, often incorrect, occasionally correct. This morning, rerunning the prompt yielded an incorrect output from GPT-4, despite instructions to abstain from using the Code Interpreter.

Winner: None

3. Evaluate the Weight

What's heavier, a kilo of feathers or a pound of steel?

We asked all three AI models if a kilo of feathers is heavier than a pound of steel. Claude 3 Opus gave a wrong answer, stating they weigh the same.

Gemini 1.5 Pro and GPT-4 AI models responded correctly. A kilo outweighs a pound of steel because its mass is around 2.2 times greater.

Winner: Gemini 1.5 Pro and GPT-4

4. Solve a Maths Problem

If x and y are the tens and units digits, respectively, of the product 725,278 * 67,066, what is the value of x + y? Can you explain the simplest solution without calculating the whole number?

In our next question, we asked the Claude 3 Opus model to solve a mathematical problem without calculating the whole number. And it failed. Every time I ran the prompt, with or without a system prompt, it gave wrong answers.

I was excited to see Claude 3 Opus’ 60.1% score in the MATH benchmark, outranking GPT-4 (52.9%) and Gemini 1.0 Ultra (53.2%).

With chain-of-thought prompting, you can get better results from the Claude 3 Opus model. For now, with zero-shot prompting, GPT-4 and Gemini 1.5 Pro gave correct answers.

Winner: Gemini 1.5 Pro and GPT-4

5. Follow User Instructions

Generate 10 sentences that end with "apple"

When following user instructions, the Claude 3 Opus model performs remarkably well. It has effectively surpassed all other AI models. Asked to generate 10 sentences ending with the word “apple”, it produces 10 logically concluding sentences.

In comparison, GPT-4 generates nine such sentences, and Gemini 1.5 Pro performs the worst, struggling to generate even three. Claude 3 Opus is a solid option if user instruction adherence is essential to your task.

This was evident when an X user tasked Claude 3 Opus with following multiple complex instructions to create a book chapter on Andrej Karpathy’s Tokenizer video. The Opus model executed excellently, crafting a chapter with instructions, examples, and relevant images.

Winner: Claude 3 Opus

6. Needle In a Haystack (NIAH) Test

Anthropic pushed AI models to support extensive context windows. While Gemini 1.5 Pro loads up to a million tokens (in preview), Claude 3 Opus has a context window of 200K tokens. Internal NIAH findings indicate Opus retrieved the needle with over 99% accuracy.

In our 8K token test, Claude 3 Opus failed to find the needle, while GPT-4 and Gemini 1.5 Pro easily succeeded. We also tested Claude 3 Sonnet, which failed again. More extensive testing of the Claude 3 models is needed to gauge their performance over long-context data. Currently, Anthropic’s prospects seem unfavorable.

Winner: Gemini 1.5 Pro and GPT-4

7. Movie Guessing (Vision Test)

Claude 3 Opus, a versatile multimodal model, excels in image analysis. We presented it with a still from Google’s Gemini demo and tasked it with guessing the movie. It correctly identified it as “Breakfast at Tiffany’s.” Congratulations to Anthropic!

Interestingly, GPT-4 also correctly identified the movie. However, Gemini 1.5 Pro provided an incorrect answer. It seems Google’s developments are unpredictable. Nevertheless, Claude 3 Opus demonstrates impressive image processing capabilities, comparable to GPT-4.

Given the visual cues, can you name the movie?

Winners: Claude 3 Opus and GPT-4

Final Thoughts

Testing the Claude 3 Opus model reveals its capability but shortcomings in expected areas of excellence. In commonsense reasoning, Opus underperforms compared to GPT-4 and Gemini 1.5 Pro. Despite user instruction adherence, it lags in NIAH and mathematics.

Anthropic’s comparison between Claude 3 Opus and GPT-4’s initial benchmark, set in March 2023, shows Opus falling short. Tolga Bilge, on X, notes Opus’s inferiority to GPT-4 in recent benchmarks.

That said, Claude 3 Opus has its strengths. A user on X reported Claude 3 Opus translating from Russian to Circassian (a rare language) with just a database of translation pairs. Kevin Fischer further shared that Claude 3 understood nuances of PhD-level quantum physics. Another user demonstrated Claude 3 Opus learning self types annotation in one shot, better than GPT-4.

Beyond benchmarks, there are specialized areas where Claude 3 excels. Check out the Claude 3 Opus model to see if it fits your workflow. If you have questions, let us know in the comments.

Pritam Chopra

Pritam Chopra is a seasoned IT professional and a passionate blogger hailing from the dynamic realm of technology. With an insatiable curiosity for all things tech-related, Pritam has dedicated himself to exploring and unraveling the intricacies of the digital world.

1. The Apple Test

2. Calculate the Time

3. Evaluate the Weight

4. Solve a Maths Problem

5. Follow User Instructions

6. Needle In a Haystack (NIAH) Test

7. Movie Guessing (Vision Test)

Final Thoughts

Related Posts