- The Shift
- Posts
- Every model failed this benchmark
Every model failed this benchmark
Plus, ✨Google's TurboQuant compresses AI memory 6x with zero accuracy loss, how to run tasks on your computer from your phone with Claude, and more!
Welcome back to The Shift. Let’s get straight to what matters in AI today…
Today we have:
🧪 Every Frontier AI Model Just Scored Under 1% on a New Intelligence Test
✨How to Run Tasks on Your Computer From Your Phone With Claude
⚡ Google's TurboQuant Compresses AI Memory 6x With Zero Accuracy Loss
🔨Tools you cannot miss
🧪 Every Frontier AI Model Just Scored Under 1% on a New Intelligence Test
The ARC Prize Foundation launched ARC-AGI-3, a benchmark that tests whether AI can learn on the fly by solving puzzles in unfamiliar environments. Every major model failed. Every human passed.

The Shift:
1. All Top Models Scored Near Zero - Gemini 3.1 Pro scored 0.37%. GPT-5.4 hit 0.26%. Claude Opus 4.6 managed 0.25%. Grok 4.2 scored 0%. Meanwhile, 100% of human testers solved every environment on their first attempt with no instructions.
2. This Isn't the Same as Previous Benchmarks - On ARC-AGI-2, these same models score between 65-85%. ARC-AGI-3 is fundamentally different. It tests 135 novel interactive environments where agents must adapt in real time, not pattern-match from training data.
3. The Methodology Is Controversial - Scoring uses a squared efficiency penalty that critics say is designed to produce low numbers. Extended-thinking models were excluded. Some argue the benchmark is unfair by design.
ARC founder François Chollet's core argument: today's models only perform well when humans build scaffolding around them. If it's truly AGI, there should be no human in the loop. With OpenAI renaming its division "AGI Deployment" and a $2M Kaggle prize now live, this benchmark is a direct challenge to the industry's biggest claims. You can play it here.
Together with Hubspot
The Future of AI in Marketing. Your Shortcut to Smarter, Faster Marketing.
Unlock a focused set of AI strategies built to streamline your work and maximize impact. This guide delivers the practical tactics and tools marketers need to start seeing results right away:
7 high-impact AI strategies to accelerate your marketing performance
Practical use cases for content creation, lead gen, and personalization
Expert insights into how top marketers are using AI today
A framework to evaluate and implement AI tools efficiently
Stay ahead of the curve with these top strategies AI helped develop for marketers, built for real-world results.
✨How to Run Tasks on Your Computer From Your Phone With Claude
Text a task. Leave your desk. Come back to finished work. Here's how to set it up.

Step 1: Enable Claude Computer Download the Claude desktop app at claude.ai/download. Go to Settings > Desktop App > General > turn on Browser Use > turn on Computer Use. Pro ($20) or Max ($100) plan required.
Step 2: Set up Dispatch Open the Claude desktop app, click Dispatch in the left sidebar, and connect your phone (iOS or Android). This links your phone to your desktop session.
Step 3: Send a task from your phone Text Claude exactly what you need. Be specific — which file, which app, which meeting. Vague instructions produce vague results.
The magic prompt: "Hey Claude, [do this task]. I'm away from my desk."
If Claude can't use a connector like Slack or Calendar, add: "Use my computer directly."
Step 4: Let Claude work Claude opens apps, finds files, switches tools, and completes the task using your actual machine, not copies. It asks permission before touching anything new.
Step 5: Come back to finished work Add "Text me when it's done" to get a notification. Review the output on your Mac when you're back.
Things worth trying: export a deck as PDF and attach it to a calendar invite, pull metrics into a weekly report, organize your Downloads folder, or update your calendar from Slack messages. You can try Claude Computer here.
⚡ Google's TurboQuant Compresses AI Memory 6x With Zero Accuracy Loss
Google Research introduced TurboQuant, an algorithm that shrinks the memory AI models use during long conversations by over 6x while losing almost no accuracy. It also speeds up processing up to 8x on Nvidia H100 chips.

The Shift:
Why This Problem Matters - AI models keep a running log of every conversation. As chats get longer, that storage balloons, slowing responses and driving up costs. TurboQuant compresses that log down to 3 bits without any retraining or fine-tuning.
It Scored Perfectly on the Hardest Tests - On needle-in-haystack benchmarks, which test whether a model can find one detail buried in massive text, TurboQuant lost zero accuracy. It also delivered up to 8x faster processing with no extra runtime cost.
It Beats Rivals in Search Too - TurboQuant outperformed existing methods in vector search, the technology behind semantic matching in search engines. It achieved better recall without needing dataset-specific tuning that competitors rely on.
The paper, set for ICLR 2026 in April, has implications for anything running on large-scale vector infrastructure. Faster search, cheaper inference, and longer conversations without degradation. For Google-scale systems, this is foundational efficiency work.
Together with The Code
Learn how to code faster with AI in 5 mins a day
You're spending 40 hours a week writing code that AI could do in 10.
While you're grinding through pull requests, 200k+ engineers at OpenAI, Google & Meta are using AI to ship faster.
How?
The Code newsletter teaches them exactly which AI tools to use and how to use them.
Here's what you get:
AI coding techniques used by top engineers at top companies in just 5 mins a day
Tools and workflows that cut your coding time in half
Tech insights that keep you 6 months ahead
Sign up and get access to the Ultimate Claude code guide to ship 5X faster.
🔨AI Tools for the Shift
💡 Painkiller Ideas - Discover real pain points, validate them with AI research, and follow a proven playbook from idea to revenue.
🏠 Pedra - Instantly stage empty properties with AI in seconds to create realistic, buyer-ready visuals.
🔍 Parse - Deploy autonomous agents that audit your tools like Stripe and GitHub to uncover hidden risks your team missed.
🗺️ Funizy - Plan your perfect day in any city with personalized itineraries tailored to your style.
🎬 Plot Party - Generate high-quality storyboards with consistent characters, styles, and scenes in under five minutes.
🚀Quick Shifts
🔦 Kimberly-Clark, LG, and Burberry are not smarter than your team, they just stopped relying on tools that cannot hear. Syncly Social catches spoken brand mentions, untagged creator placements, and competitor trend signals inside video audio across TikTok, IG, and YouTube. See what you have been missing and get started free.
🏢 Meta is laying off hundreds of employees across multiple teams as it ramps up massive investments in AI infrastructure, shifting focus away from metaverse initiatives toward its long-term AI strategy.
🎬 Disney’s major bets on AI and the metaverse are facing setbacks, as its Sora partnership collapses and Epic’s metaverse plans stall, raising doubts about its future in emerging digital experiences.
🍎 Apple’s deal with Google gives it access to Gemini for training smaller, efficient AI models, allowing Apple to build optimized “student” models tailored for on-device performance.
🤖 Reddit will require accounts with suspicious, bot-like behavior to verify they’re human through methods like biometrics or passkeys, alongside new labels to identify registered automated profiles.
🔊 Mistral, a French AI company, launched Voxtral TTS, an open-source speech model that supports nine languages, enables realistic voice cloning from short samples, and delivers fast, real-time performance for enterprise voice applications.
🧩 Prompt of the Day
Form Field Reduction Strategy
Simplify forms to reduce friction, increase completion rates, and improve conversions.
Paste the prompt: Drop this into ChatGPT and fill in your form details.
Prompt to paste
Create a form field reduction strategy for [Insert signup or checkout flow]. Include:
Current Fields: [List all existing form fields]
Necessary vs Optional: [Mark each field as required or optional]
Reduction Plan: [Suggest fewer fields and simplifications for faster completion]
Use Case: Improve form completion rate, reduce friction, and increase conversions.
That’s all for today’s edition see you tomorrow as we track down and get you all that matters in the daily AI Shift!
If you loved this edition let us know how much:
How good and useful was today's edition |
Forward it to your pal to give them a daily dose of the shift so they can 👇



Reply