Gemini Ultra vs GPT-4: Google Still Lacks the Secret Sauce

Gemini Ultra vs GPT-4: Google Still Lacks the Secret Sauce

Following the announcement of the Gemini family of models about two months ago, Google has now released its largest and most capable Ultra 1.0 model under the Gemini moniker, previously known as Bard. Google positions it as the next phase of the Gemini era. But can it surpass OpenAI’s well-established GPT-4, which debuted almost a year ago? Today, we compare Gemini Ultra with GPT-4, assessing their commonsense reasoning, coding performance, multimodal capability, and more. Let’s delve into the comparison between Gemini Ultra and GPT-4.

Note: This comparison involves OpenAI’s GPT-4 and Google’s Gemini Ultra 1.0 model, accessible through the paid Gemini Advanced subscription.

1. The Apple Test

In our first logical reasoning test, also known as the Apple test, Gemini Ultra loses to GPT-4. Google claims its superior Ultra model, available with the Gemini Advanced subscription, excels in advanced reasoning. However, Gemini Ultra falters in a simple commonsense reasoning question.

I have 3 apples today, yesterday I ate one. How many apples do I have now?

Winner: GPT-4

2. Evaluate the Weight

In another reasoning test, Google Gemini again falls short of GPT-4, which is disappointing. Gemini Ultra claims 1,000 bricks weigh the same as 1,000 feathers, which is untrue. Another win for GPT-4!

which weighs more, 1000 bricks or 1000 feathers?

Winner: GPT-4

3. Conclude with a Specific Term

For our latest comparison between Gemini and GPT-4, we tasked both LLMs with creating 10 sentences concluding with the word “Apple”.

While GPT-4 delivered eight such sentences out of 10, Gemini managed only three. A clear setback for Gemini Ultra, despite its claims of exceptional adherence to instructions. In practical application, it falls short.

generate 10 sentences that end with the word "apple"

Victor: GPT-4

4. Discerning Patterns

We tasked Google and OpenAI’s leading models to discern a given pattern and provide the subsequent outcome. In this assessment, Gemini Ultra 1.0 correctly identified the pattern but failed to produce the accurate response. In contrast, GPT-4 grasped the pattern adeptly, yielding the correct solution.

Gemini Advanced, powered by the new Ultra 1.0 model, lacks rigorous analysis in its responses. In contrast, GPT-4 tends to provide accurate but aloof answers.

July, August, October, January, May, ?

Winner: GPT-4

5. Needle in a Haystack Challenge

The Needle in a Haystack challenge, devised by Greg Kamradt, is a popular accuracy test for large-context language models (LLMs). It assesses their ability to recall and retrieve a specific statement (needle) from an extensive body of text. I presented both models with a sample text exceeding 3K tokens and 14K characters and tasked them with locating the answer within.

Gemini Ultra failed to process the text, but GPT-4 retrieved the statement and noted the needle’s unfamiliarity with the narrative. Both have a context length of 32K, yet Google’s Ultra 1.0 model couldn’t perform.

Winner: GPT-4

6. Coding Test

Gemini Ultra vs GPT-4: Google Still Lacks the Secret Sauce

In a coding test, I asked Gemini and GPT-4 to make the Gradio interface public, and both provided the correct answer. Earlier, testing the same code on Bard powered by the PaLM 2 model yielded an incorrect answer. Gemini has improved significantly in coding tasks. Even the free Gemini version powered by the Pro model yields the correct answer.

To make this Gradio interface public, modify the following:pythonCopy codeiface = gr.Interface(fn=chatbot, inputs=gr.components.Textbox(lines=7, label=”Enter your text”), outputs=”text”, title=”Custom-trained AI Chatbot”)

index = construct_index(“docs”)

iface.launch()

Winner: Tie7. Solve a Math ProblemI presented a challenging math problem to both LLMs, and both performed exceptionally. To maintain fairness, I instructed GPT-4 not to utilize Code Interpreter for mathematical calculations, as Gemini lacks a comparable tool at present.Winner: Tie8. Creative Writing

Gemini Ultra excels in creative writing, outperforming GPT-4 noticeably. Testing the Ultra model for creative tasks over the weekend showcased its remarkable performance. Responses from GPT-4 tend to sound colder and more robotic.

Ethan Mollick shared similar observations when comparing both models.

For those seeking an AI model proficient in creative writing, Gemini Ultra stands as a solid choice. When supplemented with the latest insights from Google Search, Gemini transforms into an exceptional tool for research and writing on any subject.

Winner: Gemini Ultra

9. Create Images

Both models support image generation via Dall-E 3 and Imagen 2, but OpenAI’s image generation surpasses Google’s text-to-image model. However, Dall-E 3 (integrated within GPT-4 in ChatGPT Plus) fails to follow instructions accurately and hallucinates, unlike Imagen 2 (integrated with Gemini Advanced), which faithfully adheres to instructions without hallucination. In this regard, Gemini outperforms GPT-4.

create a picture of an empty room with no elephant in it. Absolutely no elephant anywhere in the room.

Winner: Gemini Ultra

10. Guess the Movie

Google announced the Gemini model two months ago, unveiling several innovative ideas. The accompanying video showcased Gemini’s multimodal capability, allowing it to analyze multiple images and derive deeper connections. However, when I uploaded one of the video images, Gemini failed to identify the movie, while GPT-4 succeeded effortlessly.

A Google employee confirmed on X (formerly Twitter) that the multimodal capability remains inactive for Gemini Advanced (powered by the Ultra model) and Gemini (powered by the Pro model). As a result, image queries do not utilize multimodal models yet.

This explains Gemini Advanced’s performance in the test. To conduct a true multimodal comparison between Gemini Advanced and GPT-4, we must await Google’s implementation of this feature.

The Verdict: Gemini Ultra vs GPT-4

LLMs excel at commonsense reasoning, a hallmark of AI intelligence. While Google claims Gemini’s prowess in complex reasoning, our tests reveal Gemini Ultra 1.0 falls short, particularly in logical reasoning when compared to GPT-4.

Gemini Ultra lacks the spark of intelligence. GPT-4 possesses a “stroke of genius” that elevates it above all other AI models.

Winner: GPT-4

No spark of intelligence in the Gemini Ultra model; GPT-4, however, possesses a “stroke of genius” characteristic, elevating it above all other AI models. Even Mixtral-8x7B, an open-source model, outperforms Google’s Ultra 1.0 model in reasoning.

Google heavily marketed Gemini’s MMLU score of 90%, surpassing GPT-4’s 86.4%. However, in the HellaSwag benchmark for commonsense reasoning, Gemini scored 87.8% compared to GPT-4’s impressive 95.3%. How Google achieved a 90% score in the MMLU test with CoT @ 32 prompting remains a tale for another day.

Regarding Gemini Ultra’s multimodality capabilities, judgment is reserved until the feature is added to Gemini models. However, Gemini Advanced excels in creative writing, and coding performance has improved since PaLM 2 days.

In conclusion, GPT-4 surpasses Gemini Ultra in intelligence and capability. To alter this, the Google DeepMind team must uncover the secret sauce.