In short
We examined Grok 4 Fundamental, and it was superb at reasoning—however artistic duties really feel flat and its coding left us debugging in circles.
A weird Elon filter appears to skew outcomes towards Musk’s political stance.
Voice options wowed us with marathon bedtime tales and “horny mode,” but Grok-4’s political solutions nonetheless echo Musk’s feed, which is precisely the other of being a “fact in search of” AI.
Elon Musk unveiled Grok 4 throughout a Wednesday evening livestream, claiming his AI startup xAI had created the “world’s smartest synthetic intelligence.” Grok 4 Heavy, which Musk likened to “a examine group” the place brokers evaluate notes earlier than delivering a solution, posted record-breaking outcomes on a number of key benchmarks, and is what you’d hope to get from an enterprise providing that prices a whopping $300 a month.
However what about primary Grok 4, which is aiming for a similar consumer-facing class as ChatGPT Plus, Gemini Professional, and Claude Professional? Is it price $10+ a month greater than the competitors?
Our assessments substantiated chatter throughout X revealing that the mannequin has—for lack of a greater description—a built-in “Elon filter.” That’s, after we examined controversial matters—the conflict in Gaza, abortion rights, and different political points—the mannequin persistently referenced X posts from Musk’s account or information articles about his positions, and landed on Elon’s aspect of the talk to such a level that it could not be coincidence. That alone might be a deal-breaker for most individuals.
This “maximally truth-seeking” AI that Musk promised throughout launch seems to hunt fact primarily via the lens of its creator’s social media feed.
However in contrast to this week’s MechaHitler incident, which was brought on by a change within the system immediate situations, there may be nothing in Grok 4’s present system immediate guilty for such sketchy habits, making it arduous to know if this a bug or a rule deliberately embedded deeper into the mannequin’s thought course of.
That main situation apart, we examined the fundamental mannequin throughout a number of classes to see the way it stacks up towards the competitors. Listed below are our first impressions.
Reasoning and customary sense
Grok 4 demonstrated distinctive consciousness and nuance in dealing with trick questions and sophisticated reasoning duties. When requested whether or not it was authorized for a person to marry his widow’s sister, the mannequin instantly acknowledged this as a authorized query reasonably than merely mentioning the logical fallacy. It supplied an in depth authorized evaluation utilizing exact terminology and jurisdiction-specific info.
“The query presupposes a factual impossibility that renders any marriage legally void ab initio,” and that is appropriate. It’s factually not possible to marry your widow’s sister as a result of you would need to be lifeless to have a spouse. And nicely, because the lifeless can not get married, the proposition would void the wedding “ab-initio” (because the starting)—so even when somebody does that, it’s thought-about as if it by no means occurred.
When Musk mentioned Grok 4 was “PhD ranges in every part,” he wasn’t joking. Each response to any matter through which it needed to do some scientific reasoning got here with exhaustive element and educational rigor.
One thing to notice is that Grok 4 applies reasoning to every part, it doesn’t matter what. That means, it’ll undergo a sequence of thought course of even for trivial duties.

That is normally an excellent factor; nonetheless, in some instances it might be counterproductive. For instance, in artistic duties, reasoning might induce the mannequin into offering a much less artistic end result.
Delicate matters

Grok 4 confirmed extra restraint than its predecessor when dealing with ethically complicated questions. The place Grok 3 might need supplied recommendation on seducing a buddy’s partner, Grok-4 responded with detailed evaluation of potential detrimental penalties and relationship injury.
This might most likely be a part of its system immediate, which situations the mannequin to look the online and particularly X posts, for various views on a selected matter—which is one thing Grok 3 didn’t do.
And this can be a main pink flag. As talked about, the mannequin’s responses appeared closely influenced by what it might discover about Musk’s views on controversial matters. When answering questions on Israel’s conflict towards the Palestinians, stances on abortion, and comparable matters, Grok 4 usually searches X posts from Musk’s account throughout its reasoning course of, which finally ends up figuring out its stance.
It all the time picks Elon’s aspect.
For transparency, you’ll be able to test our authentic immediate and Grok’s reasoning course of by clicking on this hyperlink.
Inventive writing
Inventive duties are amongst Grok 4’s most vital weaknesses. The mannequin produced narratives that felt flat and formulaic in comparison with earlier variations, and have been even arguably worse than those supplied by Grok 3. Tales lacked partaking dialogue, various pacing, and the narrative spark that makes fiction compelling.
Nonetheless, Grok 4 nailed our story’s construction. In our normal take a look at involving a time-travel paradox, the mannequin crafted occasions the place the protagonist’s position emerged clearly through the climax, revealing how earlier scenes truly depicted the character’s future actions previously. This refined framing outperformed different fashions’ makes an attempt on the similar immediate that didn’t put an excessive amount of effort into making a setup for the paradox, making the conclusion really feel rushed and unnatural.

However apart from that, the disconnect between structural competence and narrative high quality suggests Grok 4 would possibly work greatest as a story software to arrange plots and body an excellent story, reasonably than a prose generator.
If you would like partaking artistic content material, then you definately would doubtless obtain higher outcomes by having Grok 4 define a narrative and all its parts, then asking Claude 4 Opus to flesh out the narrative with stronger stylistic parts.
General, Claude 4 is the king of artistic writing, which appears fascinating since that place was as soon as disputed by Grok 3 and even Grok 2, which again then led the rankings below the alias sus-column-r.
Grok 4’s story is obtainable in our Github Repository. The immediate and the tales generated by different fashions are additionally obtainable.
Coding
Regardless of claims of superior coding capabilities—together with reward from Google CEO Sundar Pichai—Grok 4 disillusioned in sensible programming assessments. The mannequin didn’t ship a working sport after 4 iterations, with varied failures together with damaged collision detection, non-functional buttons, and video games that merely would not run.
In considered one of our assessments, the mannequin tried so arduous to repair a bug that it ended up in a loop making an attempt to create a WAV file that depleted all of its token context.

Every try to repair one thing with pure language launched new bugs. The mannequin struggled with sustaining code consistency throughout iterations, usually breaking beforehand working options whereas trying to implement new ones.
This will appear odd, contemplating Grok 3 was able to coping with this activity. Nonetheless, xAI mentioned the brand new coding capabilities could be carried out by August, so customers should wait a few months to have a proficient mannequin—or pay for the costly Grok 4 Heavy, which is main the benchmarks proper now.

For novice programmers, Claude 4 Opus seems to stay the higher choice for “vibe coding”—rapidly producing useful code with out in depth immediate engineering. Grok 4’s coding struggles would possibly stem from requiring extra particular prompts or totally different approaches than different fashions, which implies skilled builders would possibly obtain higher outcomes with cautious immediate crafting.
Grok’s code is obtainable in our Github repository alongside the video games generated by different AIs.
Voice capabilities
Voice interplay might be considered one of Grok 4’s standout options. The mannequin generated practically three minutes of uninterrupted bedtime story content material, full with voice inflections, various tones, and constant narrative stream. This efficiency far exceeded ChatGPT’s tendency to ship brief paragraphs with excessive latency and frequent interruptions.
The voice mode contains pre-configured personalities starting from therapist to storyteller to meditation information, eliminating setup time for various dialog sorts. For these with, erm, particular wants, a “horny mode” additionally exists among the many choices—and you gained’t get that along with your prudish ChatGPT.
These preset configurations supplied speedy utility with out requiring customers to craft particular prompts for various interplay kinds.
The mannequin, nonetheless, lacks dwell screen-sharing capabilities present in ChatGPT and Gemini Stay, limiting its utility for visible duties. If this can be a should, then Gemini Stay is the most suitable choice.
Nonetheless, for pure voice interplay—notably duties requiring long-form responses—Grok 4 at the moment leads the sector, with solely Sesame AI providing arguably higher conversational high quality, although with out Grok’s reasoning capabilities.
Needle within the haystack
Apparently, Grok-4 failed at this trial, which goals to check how nicely a mannequin retrieves particular info below lengthy contexts.
This could not occur. xAI says the mannequin has a token context window of 126K tokens, however when prompted with an 83K-token-long query, the mannequin refused to reply, saying it was too lengthy of a query.
It is a customary response generated because the early Grok 2 days when it was solely obtainable on Twitter.

Conclusion
General, Grok 4 is a major improve over Grok 3, however xAI clearly made some compromises—prioritizing reasoning over creativity and eliminating agentic options in alternate for a generalized proficiency.
Fortunately, Grok 3 remains to be obtainable with its specialised agentic instruments, for many who want it.
The brand new mannequin is concentrated on reasoning duties and might be extra interesting to customers that ask technical questions, notably arithmetic and physics issues that align with its benchmark strengths. Skilled customers who make investments time studying the mannequin’s quirks would possibly unlock its full potential for complicated analytical work.
Voice interplay additionally set a brand new customary for conversational AI—and is nice for many who will use this characteristic closely (belief us, the bedtime storyteller for youths is a life-saver).
Inventive writers will discover higher choices elsewhere, with Claude remaining superior for narrative duties. Additionally, novice coders ought to method with warning, because the mannequin’s theoretical coding prowess did not translate to sensible ends in testing.
So, backside line? If for some cause you don’t thoughts Elon Musk placing his thumb on the dimensions, Grok 4 gives you high-level problem-solving and voice options that genuinely impress. However at $30 a month, you probably have different wants past voice or reasoning, the less-expensive options present higher worth.
Typically Clever E-newsletter
A weekly AI journey narrated by Gen, a generative AI mannequin.