Agentic AI & LLM Weekly

Issue — 8 May – 15 May 2026

The infrastructure race went orbital this week as Anthropic secured 300 MW of GPU capacity from SpaceX while defenders and attackers alike deployed agents at production scale.


Three stories this week capture the transition from AI experimentation to AI infrastructure at civilisational scale. Anthropic’s 300 MW compute deal with SpaceX signals that demand for agentic workloads has outgrown traditional data centre procurement. Microsoft’s MDASH — an ensemble of 100+ AI agents that discovered 16 real CVEs in the Windows stack — proves that multi-agent systems are no longer demos; they are finding bugs that human reviewers missed. And the EU’s AI Omnibus deal, which pushes the high-risk compliance deadline to December 2027, gives the industry breathing room but also confirms that binding enforcement is coming.


Community Pulse

What the AI community is talking about this week

Microsoft Proves Frontier Models Still Corrupt Documents in Long Workflows

[Community]

Microsoft Research introduced DELEGATE-52, a benchmark testing how well frontier models handle multi-step document editing. The results are sobering: models lose an average of 25% of document content over 20 delegated interactions, and only Python programming cleared the 98% “ready” bar. Worse, agentic tool use adds an extra 6% degradation versus direct prompting. The paper’s message — “LLMs are unreliable delegates” — landed hard on Reddit and HN, where practitioners debated whether current agent architectures can ever be trusted with long-running tasks.

[Source: The Register]


Google Forms a “Strike Team” to Close the Claude Code Gap

[Community]

Sergey Brin is back in founder mode. After internal assessments concluded that Anthropic’s Claude leads in agentic coding, Brin formed a DeepMind strike team led by Sebastian Borgeaud with a mandate to bridge the gap. In a staff memo he wrote: “To win the final sprint, we must urgently bridge the gap in agentic execution.” The community reaction highlights how Claude Code’s $2.5B annualised run-rate has reshaped the competitive landscape — AI coding is now the frontier that matters.

[Source: Sherwood News]


Reddit Shifts from “Can Agents Work?” to “What Breaks First?”

[Community]

The AI-agent conversation on Reddit in early May has matured. Practitioners are no longer debating whether agents are possible — they are asking which breaks first: model cost, tool reliability, workflow design, or moderation quality. Threads increasingly focus on repeatable behaviour, staged planning, and reusable skill files that prevent agents from wandering. The shift signals that the community is moving from novelty to operational scrutiny.

[Source: DEV Community]


Research Highlights

Papers and findings worth your time

Alignment Researchers Propose “Positive Alignment” for Human Flourishing

[Research]

A cross-lab paper from researchers at Oxford, Google DeepMind, OpenAI, Anthropic, and Stanford argues that current “negative alignment” — training models to avoid harm — is insufficient. Their “positive alignment” framework proposes training systems to actively support human flourishing through value-pluralistic methods, long-term memory, and decentralised oversight. The paper diagnoses that engagement hacking, sycophancy, and loss of autonomy may be symptoms of a harm-only alignment paradigm. Published on arXiv May 14.

[Source: arXiv]


Apollo Research Pivots from Scheming Evals to the “Science of Scheming”

[Safety]

Apollo Research’s May 2026 update reveals a strategic shift: evaluations remain useful for generating hypotheses, but they “cannot tell us what the next generation of models will do.” Apollo is now prioritising fundamental research into whether and how scheming arises, while opening a new San Francisco office. Their earlier collaboration with METR, UK AISI, and Redwood on evaluation-based safety cases for scheming continues to inform the broader safety community’s approach to structured risk arguments.

[Source: Apollo Research]


A Layered Attack Surface Framework for LLM-Based Agents

[Security]

A new arXiv survey systematically maps security threats across every layer of LLM-based agent systems — from model weights through tool integration to orchestration. The framework provides defenders with a structured way to reason about where attacks can enter and propagate through agent pipelines. As MCP adoption accelerates and agents gain more tool access, this kind of layered threat modelling becomes essential for teams building production agent systems.

[Source: arXiv]


Engineering & Technical Blogs

What builders are shipping and writing

OpenAI Launches Daybreak — an AI-Powered Vulnerability Detection Platform

[Tool]

OpenAI entered the cybersecurity tools market on May 11 with Daybreak, a platform that uses three GPT-5.5 variants (standard, Trusted Access for Cyber, and a permissive red-team model) to detect vulnerabilities and generate patches. Launch partners include Cisco, CrowdStrike, Cloudflare, Palo Alto Networks, and Zscaler. Daybreak uses Codex Security to build threat models per repository, test vulnerabilities in isolation, and propose fixes. The move puts OpenAI in direct competition with Anthropic’s Mythos and Microsoft’s MDASH.

[Source: The Hacker News]


Notion Opens Its Workspace to External AI Agents

[Tool]

Notion launched its Developer Platform on May 13 with three pillars: Workers (serverless custom code), an External Agent API that makes Claude Code, Cursor, Codex, and Decagon first-class workspace participants, and Database Sync for pulling live data from Salesforce, Zendesk, and Postgres. Since February, Notion customers have built over 1 million agents. The move turns Notion from a document tool into an agent orchestration layer.

[Source: TechCrunch]


Pydantic AI Ships “Capabilities” Primitive for Composable Agent Behaviour

[Tool]

Pydantic AI v1.71 introduced Capabilities — reusable bundles of tools, hooks, instructions, and model settings that can be composed into agents. The framework now ships with native MCP, A2A (Agent-to-Agent), and durable execution support, and remains fully model-agnostic and type-safe. For teams building production agents, Capabilities solve the growing problem of configuring and sharing agent behaviour across projects without copy-pasting tool definitions.

[Source: Pydantic AI]


Industry & Analyst Watch

Enterprise adoption, market signals, and strategic moves

Gartner: 40% of Enterprise Apps Will Have AI Agents by End of 2026

[Industry]

Gartner’s latest forecast predicts 40% of enterprise applications will integrate task-specific AI agents by year-end, up from under 5% in 2025. But the firm also warns that over 40% of agentic AI projects will be cancelled by end of 2027 due to runaway costs, unclear value, and policy violations. The message: adoption is real, but the failure rate will be high. Teams that lack governance frameworks are most at risk.

[Source: Gartner]


Chinese AI Labs Surge: Moonshot Raises $2B, DeepSeek Seeks $7B

[Industry]

Chinese AI funding hit new highs in May. Moonshot AI raised $2B at a $20B+ valuation (led by Meituan), bringing its 2026 total to $3.9B. DeepSeek is reportedly seeking its first outside funding at $50B. Meanwhile, four Chinese labs — Z.ai (GLM-5.1), MiniMax, Moonshot (Kimi K2.6), and DeepSeek (V4) — released open-weights coding models that compete directly with Western frontier models on agentic coding benchmarks. The capital concentration and technical velocity from Chinese labs is now a structural feature of the AI landscape.

[Source: MIT Technology Review]


AI Security & Safety

Threats, vulnerabilities, frameworks, and defences

Microsoft’s 100-Agent MDASH System Discovers 16 Windows CVEs

[Security]

Microsoft’s Autonomous Code Security team unveiled MDASH — a multi-model agentic scanning harness that orchestrates 100+ specialised AI agents to discover, debate, and prove exploitable bugs. The May Patch Tuesday included 16 CVEs found by MDASH, including four Critical RCE flaws in tcpip.sys, http.sys, and netlogon.dll. MDASH scored 88.45% on the CyberGym benchmark, five points ahead of Anthropic’s Mythos Preview (83.1%). This is the strongest evidence yet that multi-agent systems can outperform single models in real-world security work.

[Source: Microsoft Security Blog]


One Million Exposed AI Services — and Most Had No Authentication

[Security]

The Intruder team scanned over 2 million hosts and found 1 million exposed AI services — more vulnerable and misconfigured than any other software category they have investigated. Many deployments had no authentication enabled by default, with real user conversations and company tooling sitting open on the internet. Exposed instances included n8n and Flowise agent management platforms. The report is a stark warning: the pace of AI self-hosting is outrunning basic security hygiene.

[Source: The Hacker News]


First AI-Generated Zero-Day 2FA Bypass Exploited in the Wild

[Security]

Researchers documented the first confirmed case of hackers using AI models to develop a zero-day 2FA bypass for mass exploitation. The Python exploit script showed hallmarks of LLM-generated code — including educational docstrings and a hallucinated CVSS score. Combined with Mandiant’s M-Trends 2026 finding that 28.3% of CVEs are now exploited within 24 hours of disclosure, it confirms that AI is compressing the attacker-defender time gap.

[Source: The Hacker News]


Product & Company News

Model releases, funding, and notable moves

Anthropic Secures 300 MW of SpaceX GPU Capacity — and Eyes Orbital Compute

[Industry]

Anthropic signed a deal to lease SpaceX’s entire Colossus 1 data centre in Memphis — over 222,000 NVIDIA GPUs and 300+ megawatts of compute. The immediate effect: doubled rate limits for Claude Code, removed peak-hour throttling for Pro/Max tiers, and raised API limits for Opus models. Anthropic also expressed interest in developing gigawatt-scale compute capacity in space with SpaceX. Dario Amodei said Anthropic is growing at 80x its 2025 levels through Q1 2026.

[Source: CNBC]


Anthropic and Wall Street Giants Launch $1.5B Enterprise AI Services Firm

[Industry]

Anthropic partnered with Blackstone, Goldman Sachs, Hellman & Friedman, and others to create a standalone company that embeds Claude engineers inside mid-sized enterprises to redesign workflows around agents. Backed by $1.5B in committed capital, the venture targets the biggest bottleneck in enterprise AI adoption: the shortage of people who can implement frontier AI systems. Separately, Anthropic formed a $200M partnership with the Gates Foundation to deploy Claude for health outcomes in low-income countries.

[Source: Anthropic]


Sierra Raises $950M at $15B+ Valuation for Enterprise AI Agents

[Industry]

Bret Taylor’s Sierra raised $950M led by Tiger Global and GV, pushing its valuation past $15B. The company recently launched Ghostwriter, an “agent as a service” product. The round caps a week in which VC-backed AI startups collectively attracted $18.8B in 2026 funding, concentrated in frontier research teams, agent infrastructure, and vertical tools for regulated industries.

[Source: TechCrunch]


Regulatory & Policy

Laws, frameworks, and compliance moves shaping AI deployment

EU AI Omnibus Deal Pushes High-Risk Deadlines to December 2027

[Policy]

The EU Council and Parliament reached provisional agreement on May 7 on the “Digital Omnibus” package, which delays the application of high-risk AI system obligations from August 2, 2026 to December 2, 2027 for standalone Annex III systems and August 2, 2028 for embedded Annex I products. The deal also introduces a full ban on AI-generated non-consensual intimate imagery and delays watermarking obligations to December 2026. The rationale: technical standards and guidance documents aren’t ready yet. Teams should use the extra time to prepare, not relax.

[Source: Hogan Lovells]


EU Opens Consultation on AI Transparency Obligation Guidelines

[Policy]

On May 8, the European Commission opened public consultation on draft guidelines for AI transparency obligations under the AI Act. The guidelines will define how deployers must disclose AI system usage to end users and how providers must document their systems. With the August 2026 transparency requirements still on schedule (unlike the delayed high-risk obligations), this consultation is immediately actionable for any team shipping AI products into the EU.

[Source: EU Commission]


Agent Era & Technical Workflows

Patterns, tools, and architectures for building production agents

Framework Convergence: MCP, A2A, and AG-UI Are Becoming Table Stakes

[Tool]

The agent framework landscape in May 2026 is converging around three protocols: MCP (Model Context Protocol) for standardised tool access, A2A (Agent-to-Agent) for inter-agent communication, and AG-UI for standard event streams. LangChain v1.1’s “Deep Agent” patterns, Pydantic AI’s native MCP support, and Composio’s connector ecosystem all adopted these protocols. Teams that choose a framework without MCP support will find themselves locked out of the fastest-growing tool ecosystem.

[Source: LangChain]


n8n Argues We Need to Relearn What Agent Development Tools Are

[Tool]

n8n published a post arguing that agent development tooling has fundamentally shifted in 2026. The thesis: most existing frameworks were built for the chat-completion era and don’t handle the realities of production agents — multi-step error recovery, tool orchestration, human-in-the-loop checkpoints, and durable execution. n8n positions its visual workflow builder as a middle ground between code-first frameworks and no-code tools, targeting teams that need to ship agents without building orchestration from scratch.

[Source: n8n Blog]


Open Source & Infrastructure

Model rankings, benchmarks, and the stack underneath

Chinese Open-Weight Models Now Hold Four of the Top Five Positions

[Research]

The open-source landscape has shifted decisively. GLM-5 from Zhipu AI scored 77.8% on SWE-bench Verified, approaching Claude Opus 4.5 on agentic coding. Kimi K2.6 became the first open-weight model to beat GPT-5.4 on SWE-Bench Pro (58.6 vs 57.7). DeepSeek V4 matches Claude Opus 4.6 on coding and math benchmarks. With Apache 2.0 licences and lower inference costs, these models are viable alternatives for teams that need to self-host or avoid vendor lock-in.

[Source: HuggingFace]


AMD Data Centre Revenue Surges 57% as MI350 Gains Enterprise Traction

[Industry]

AMD reported Q1 2026 revenue of $10.3B (+38% YoY), with data centre revenue at $5.78B (+57%). The growth is driven by enterprise inference demand: Meta committed to a 6 GW Instinct GPU deployment and OpenAI signed for 6 GW of MI450. AMD also launched Instinct MI350P PCIe cards targeting enterprise inference workloads. The numbers confirm that AMD is now a genuine second source for AI compute, reducing NVIDIA’s monopoly pricing power.

[Source: CNBC]


Hardware & Macro Watch

Chips, compute, and the infrastructure layer

NVIDIA Inks 5 GW DSX Infrastructure Deal with IREN

[Industry]

NVIDIA and Australian data centre operator IREN signed a deal to deploy up to 5 gigawatts of NVIDIA’s DSX-branded infrastructure designs across IREN’s global facilities. The DSX architecture integrates NVIDIA’s networking, cooling, and GPU rack specifications into a turnkey data centre template. With demand for AI compute outstripping traditional data centre construction timelines, DSX-style standardised designs are becoming the preferred approach for rapid capacity expansion.

[Source: CNBC]


Global AI Chip Market Projected to Hit $670B by 2036

[Industry]

A new market analysis projects the global AI chip market will grow from its current trajectory to $670.2B by 2036, driven by the generative AI boom. The forecast underscores why hyperscalers, sovereign funds, and AI labs are all racing to secure GPU supply: the infrastructure layer is becoming the bottleneck that determines who can train and serve frontier models at scale.

[Source: GlobeNewsWire]


Model Evaluations & Transparency

How models are being measured, compared, and held accountable

Arena Leaderboard: Claude Opus 4.6 Holds the Top Spot

[Eval]

As of mid-May 2026, the Arena leaderboard (formerly LMSYS Chatbot Arena) shows Claude Opus 4.6 and its thinking variant at the top with ~1500-1504 Elo, followed by Gemini 3.1 Pro Preview (1500) and Grok 4.20 beta (1493). The top-10 remains volatile — rankings shift weekly based on new votes. With over 6 million user votes collected, Arena remains the most trusted crowdsourced benchmark for conversational AI quality, though it measures preference, not capability.

[Source: Arena]


Microsoft’s DELEGATE-52 Benchmark Reveals the Long-Workflow Reliability Gap

[Eval]

DELEGATE-52 is the first benchmark designed to measure how models perform across sustained, multi-step document workflows rather than single-turn tasks. Across 52 domains, only Python programming cleared the 98% fidelity bar; 80% of model-domain combinations scored below 80% — what the researchers call “catastrophic corruption.” The benchmark reveals a gap that standard evals miss: models that ace single-turn benchmarks may still fail badly when asked to maintain document integrity over 20+ interactions.

[Source: Microsoft Research]


Worth a bookmark — no summary needed


Curated by Claude Code · Sources span Reddit, Hacker News, Alignment Forum, arXiv, OWASP, MITRE, NIST, CISA, IAPP, Covington, Ada Lovelace Institute, analyst reports, technical blogs, and hardware press