Picture this: You're sitting in a conference room, halfway through a vendor pitch. The demo looks solid, and pricing fits nicely under budget. The timeline seems reasonable too. Everyone’s nodding along.
You’re literally minutes away from saying yes.
Then someone from your finance team walks in. They see the deck and frown. A few minutes later, they shoot you a message on Slack: “Actually, I threw together a version of this last week. Took me 2 hours in Cursor. Wanna take a look?”
Wait… what?
This person doesn't code. You know for a fact they've never written a line of JavaScript in their entire life. But here they are, showing you a working prototype on their laptop that does... pretty much exactly what the vendor pitched. Sure, it's got some rough edges, but it works. And it didn’t cost six figures. Just two hours of their time.
Suddenly, the assumptions you walked in with — about how software is developed, who makes it and how decisions are made around it — all start coming apart at the seams.
For decades, every growing company asked the same question: Should we build this ourselves, or should we buy it?
And, for decades, the answer was pretty straightforward: Build if it's core to your business; buy if it isn’t.
The logic made sense, because building was expensive and meant borrowing time from overworked engineers, writing specs, planning sprints, managing infrastructure and bracing yourself for a long tail of maintenance. Buying was faster. Safer. You paid for the support and the peace of mind.
But something fundamental has changed: AI has made building accessible to everyone. What used to take weeks now takes hours, and what used to require fluency in a programming language now requires fluency in plain English.
When the cost and complexity of building collapse this dramatically, the old framework goes down with them. It’s not build versus buy anymore. It’s something stranger that we haven't quite found the right words for.
My company never planned to build so many of the tools we use. We just had to build because the things we needed didn’t exist. And, through that process, we developed this visceral understanding of what we actually wanted, what was useful and what it could or couldn't do. Not what vendor decks told us we needed or what analyst reports said we should want, but what actually moved the needle in our business.
We figured out which problems were worth solving, which ones weren’t, where AI created real leverage and where it was just noise. And only then, once we had that hard-earned clarity, did we start buying.
By that point, we knew exactly what we were looking for and could tell the difference between substance and marketing in about five minutes. We asked questions that made vendors nervous because we'd already built some rudimentary version of what they were selling.
Last week, someone on our CX team noticed some customer feedback about a bug in Slack. Just a minor customer complaint, nothing major. In another company, this would’ve kicked off a support ticket and they’d have waited for someone else to handle it, but that’s not what happened here. They opened Cursor, described the change and let AI write the fix. Then they submitted a pull request that engineering reviewed and merged.
Just 15 minutes after that complaint popped up in Slack, the fix was live in production.
The person who did this isn’t technical in the slightest. I doubt they could tell you the difference between Python and JavaScript, but they solved the problem anyway.
And that’s the point.
AI has gotten so good at cranking out relatively simple code that it handles 80% of the problems that used to require a sprint planning meeting and two weeks of engineering time. It’s erasing the boundary between technical and non-technical. Work that used to be bottlenecked by engineering is now being done by the people closest to the problem.
This is happening right now in companies that are actually paying attention.
Here's where it gets fascinating for finance leaders, because AI has actually flipped the entire strategic logic of the build versus buy decision on its head.
The old model went something like:
Define the need.
Decide whether to build or buy.
But defining the need took forever and required deep technical expertise, or you'd burn through money through trial-and-error vendor implementations. You'd sit through countless demos, trying to picture whether this actually solved your problem. Then you’d negotiate, implement, move all your data and workflows to the new tool and six months and six figures later discover whether (or not) you were actually right.
Now, the whole sequence gets turned around:
Build something lightweight with AI.
Use it to understand what you actually need.
Then decide whether to buy (and you'll know exactly why).
This approach lets you run controlled experiments. You figure out whether the problem even matters. You discover which features deliver value and which just look good in demos. Thenyou go shopping. Instead of letting some external vendor sell you on what the need is, you get to figure out whether you even have that need in the first place.
Think about how many software purchases you've made that, in hindsight, solved problems you didn't actually have. How many times have you been three months into an implementation and thought, “Hang on, is this actually helping us, or are we just trying to justify what we spent?”
Now, when you do buy, the question becomes “Does this solve the problem better than what we already proved we can build?”
That one reframe changes the entire conversation. Now you show up to vendor calls informed. You ask sharper questions, and negotiate from a place of strength. Most importantly, you avoid the most expensive mistake in enterprise software, which is solving a problem you never really had.
As this new capability emerges, I’m watching companies sprint in the wrong direction. They know they need to be AI native, so they go on a shopping spree. They look for AI-powered tools, filling their stack with products that have GPT integrations, chatbot UIs or “AI” slapped onto the marketing site. They think they’re transforming, but they’re not.
Remember what physicist Richard Feynman called cargo cult science? After World War II, islanders in the South Pacific built fake airstrips and control towers, mimicking what they'd seen during the war, hoping planes full of cargo would return. They had all the outward forms of an airport: Towers, headsets, even people miming flight controllers. But no planes landed, because the form wasn’t the function.
That’s exactly what’s happening with AI transformation in boardrooms everywhere. Leaders are buying AI tools without asking if they meaningfully change how work gets done, who they empower or what processes they unlock.
They’ve built the airstrip, but the planes aren’t showing up.
And the whole market's basically set up to make you fall into this trap. Everything gets branded as AI now, but nobody seems to care what these products actually do. Every SaaS product has bolted on a chatbot or an auto-complete feature and slapped an AI label on it, and the label has lost all meaning. It’s just a checkbox vendors figure they need to tick, regardless of whether it creates actual value for customers.
This is the part that gets me excited about what finance teams can do now. You don’t have to guess anymore. You don’t have to bet six figures on a sales deck. You can test things, and you can actually learn something before you spend.
Here's what I mean: If you’re evaluating vendor management software, prototype the core workflow with AI tools. Figure out whether you’re solving a tooling problem or a process problem. Figure out whether you need software at all.
This doesn’t mean you’ll build everything internally — of course not. Most of the time, you’ll still end up buying, and that's totally fine, because enterprise tools exist for good reasons (scale, support, security, and maintenance). But now you’ll buy with your eyes wide open.
You’ll know what “good” looks like. You’ll show up to demos already understanding the edge cases, and know in about 5 minutes whether they actually get your specific problem. You’ll implement faster. You'll negotiate better because you're not completely dependent on the vendor's solution. And you’ll choose it because it's genuinely better than what you could build yourself.
You'll have already mapped out the shape of what you need, and you'll just be looking for the best version of it.
For years, the mantra was: Build or buy.
Now, it’s more elegant and way smarter: Build to learn what to buy.
And it's not some future state. This is already happening. Right now, somewhere, a customer rep is using AI to fix a product issue they spotted minutes ago. Somewhere else, a finance team is prototyping their own analytical tools because they've realized they can iterate faster than they can write up requirements for engineering. Somewhere, a team is realizing that the boundary between technical and non-technical was always more cultural than fundamental.
The companies that embrace this shift will move faster and spend smarter. They’ll know their operations more deeply than any vendor ever could. They'll make fewer expensive mistakes, and buy better tools because they actually understand what makes tools good.
The companies that stick to the old playbook will keep sitting through vendor pitches, nodding along at budget-friendly proposals. They’ll debate timelines, and keep mistaking professional decks for actual solutions.
Until someone on their own team pops open their laptop, says, “I built a version of this last night. Want to check it out?,” and shows them something they built in two hours that does 80% of what they’re about to pay six figures for.
And, just like that, the rules change for good.
Siqi Chen is co-founder and CEO of Runway.
Read more from ourguest writers. Or, consider submitting a post of your own! See ourguidelines here.
Gen AI in software engineering has moved well beyond autocomplete. The emerging frontier is agentic coding: AI systems capable of planning changes, executing them across multiple steps and iterating based on feedback. Yet despite the excitement around “AI agents that code,” most enterprise deployments underperform. The limiting factor is no longer the model. It’s context: The structure, history and intent surrounding the code being changed. In other words, enterprises are now facing a systems design problem: They have not yet engineered the environment these agents operate in.
The past year has seen a rapid evolution from assistive coding tools to agentic workflows. Research has begun to formalize what agentic behavior means in practice: The ability to reason across design, testing, execution and validation rather than generate isolated snippets. Work such as dynamic action re-sampling shows that allowing agents to branch, reconsider and revise their own decisions significantly improves outcomes in large, interdependent codebases. At the platform level, providers like GitHub are now building dedicated agent orchestration environments, such as Copilot Agent and Agent HQ, to support multi-agent collaboration inside real enterprise pipelines.
But early field results tell a cautionary story. When organizations introduce agentic tools without addressing workflow and environment, productivity can decline. A randomized control study this year showed that developers who used AI assistance in unchanged workflows completed tasks more slowly, largely due to verification, rework and confusion around intent. The lesson is straightforward: Autonomy without orchestration rarely yields efficiency.
In every unsuccessful deployment I’ve observed, the failure stemmed from context. When agents lack a structured understanding of a codebase, specifically its relevant modules, dependency graph, test harness, architectural conventions and change history. They often generate output that appears correct but is disconnected from reality. Too much information overwhelms the agent; too little forces it to guess. The goal is not to feed the model more tokens. The goal is to determine what should be visible to the agent, when and in what form.
The teams seeing meaningful gains treat context as an engineering surface. They create tooling to snapshot, compact and version the agent’s working memory: What is persisted across turns, what is discarded, what is summarized and what is linked instead of inlined. They design deliberation steps rather than prompting sessions. They make the specification a first-class artifact, something reviewable, testable and owned, not a transient chat history. This shift aligns with a broader trend some researchers describe as “specs becoming the new source of truth.”
But context alone isn’t enough. Enterprises must re-architect the workflows around these agents. As McKinsey’s 2025 report“One Year of Agentic AI” noted, productivity gains arise not from layering AI onto existing processes but from rethinking the process itself. When teams simply drop an agent into an unaltered workflow, they invite friction: Engineers spend more time verifying AI-written code than they would have spent writing it themselves. The agents can only amplify what’s already structured: Well-tested, modular codebases with clear ownership and documentation. Without those foundations, autonomy becomes chaos.
Security and governance, too, demand a shift in mindset. AI-generated code introduces new forms of risk: Unvetted dependencies, subtle license violations and undocumented modules that escape peer review. Mature teams are beginning to integrate agentic activity directly into their CI/CD pipelines, treating agents as autonomous contributors whose work must pass the same static analysis, audit logging and approval gates as any human developer. GitHub’s own documentation highlights this trajectory, positioning Copilot Agents not as replacements for engineers but as orchestrated participants in secure, reviewable workflows. The goal isn’t to let an AI “write everything,” but to ensure that when it acts, it does so inside defined guardrails.
For technical leaders, the path forward starts with readiness rather than hype. Monoliths with sparse tests rarely yield net gains; agents thrive where tests are authoritative and can drive iterative refinement. This is exactly the loop Anthropic calls out for coding agents. Pilots in tightly scoped domains (test generation, legacy modernization, isolated refactors); treat each deployment as an experiment with explicit metrics (defect escape rate, PR cycle time, change failure rate, security findings burned down). As your usage grows, treat agents as data infrastructure: Every plan, context snapshot, action log and test run is data that composes into a searchable memory of engineering intent, and a durable competitive advantage.
Under the hood, agentic coding is less a tooling problem than a data problem. Every context snapshot, test iteration and code revision becomes a form of structured data that must be stored, indexed and reused. As these agents proliferate, enterprises will find themselves managing an entirely new data layer: One that captures not just what was built, but how it was reasoned about. This shift turns engineering logs into a knowledge graph of intent, decision-making and validation. In time, the organizations that can search and replay this contextual memory will outpace those who still treat code as static text.
The coming year will likely determine whether agentic coding becomes a cornerstone of enterprise development or another inflated promise. The difference will hinge on context engineering: How intelligently teams design the informational substrate their agents rely on. The winners will be those who see autonomy not as magic, but as an extension of disciplined systems design:Clear workflows, measurable feedback, and rigorous governance.
Platforms are converging on orchestration and guardrails, and research keeps improving context control at inference time. The winners over the next 12 to 24 months won’t be the teams with the flashiest model; they’ll be the ones that engineer context as an asset and treat workflow as the product. Do that, and autonomy compounds. Skip it, and the review queue does.
Context + agent = leverage. Skip the first half, and the rest collapses.
Dhyey Mavani is accelerating generative AI at LinkedIn.
Read more from ourguest writers. Or, consider submitting a post of your own! See ourguidelines here.
The Allen Institute for AI (Ai2) recently released what it calls its most powerful family of models yet, Olmo 3. But the company kept iterating on the models, expanding its reinforcement learning (RL) runs, to create Olmo 3.1.
The new Olmo 3.1 models focus on efficiency, transparency, and control for enterprises.
Ai2 updated two of the three versions of Olmo 2: Olmo 3.1 Think 32B, the flagship model optimized for advanced research, and Olmo 3.1 Instruct 32B, designed for instruction-following, multi-turn dialogue, and tool use.
Olmo 3 has a third version, Olmo 3-Base for programming, comprehension, and math. It also works well for continue fine-tuning.
Ai2 said that to upgrade Olmo 3 Think 32B to Olmo 3.1, its researchers extended its best RL run with a longer training schedule.
“After the original Olmo 3 launch, we resumed our RL training run for Olmo 3 32B Think, training for an additional 21 days on 224 GPUs with extra epochs over our Dolci-Think-RL dataset,” Ai2 said in a blog post. “This yielded Olmo 3.1 32B Think, which brings substantial gains across math, reasoning, and instruction-following benchmarks: improvements of 5+ points on AIME, 4+ points on ZebraLogic, 4+ points on IFEval, and 20+ points on IFBench, alongside stronger performance on coding and complex multi-step tasks.”
To get to Olmo 3.1 Instruct, Ai2 said its researchers applied the recipe behind the smaller Instruct size, 7B, to the larger model.
Olmo 3.1 Instruct 32B is "optimized for chat, tool use, & multi-turn dialogue—making it a much more performant sibling of Olmo 3 Instruct 7B and ready for real-world applications,” Ai2 said in a post on X.
For now, the new checkpoints are available on the Ai2 Playground or Hugging Face, with API access coming soon.
The Olmo 3.1 models performed well on benchmark tests, predictably beating the Olmo 3 models.
Olmo 3.1 Think outperformed Qwen 3 32B models in the AIME 2025 benchmark and performed close to Gemma 27B.
Olmo 3.1 Instruct performed strongly against its open-source peers, even beating models like Gemma 3 on the Math benchmark.
“As for Olmo 3.1 32B Instruct, it’s a larger-scale instruction-tuned model built for chat, tool use, and multi-turn dialogue. Olmo 3.1 32B Instruct is our most capable fully open chat model to date and — in our evaluations — the strongest fully open 32B-scale instruct model,” the company said.
Ai2 also upgraded its RL-Zero 7B models for math and coding. The company said on X that both models benefited from longer and more stable training runs.
Ai2 previously told VentureBeat that it designed the Olmo 3 family of models to offer enterprises and research labs more control and understanding of the data and training that went into the model.
Organizations could add to the model’s data mix and retrain it to also learn from what’s been added.
This has long been a commitment for Ai2, which also offers a tool called OlmoTrace that tracks how LLM outputs match its training data.
“Together, Olmo 3.1 Think 32B and Olmo 3.1 Instruct 32B show that openness and performance can advance together. By extending the same model flow, we continue to improve capabilities while retaining end-to-end transparency over data, code, and training decisions,” Ai2 said.
In a new paper that studies tool-use in large language model (LLM) agents, researchers at Google and UC Santa Barbara have developed a framework that enables agents to make more efficient use of tool and compute budgets. The researchers introduce two new techniques: a simple "Budget Tracker" and a more comprehensive framework called "Budget Aware Test-time Scaling." These techniques make agents explicitly aware of their remaining reasoning and tool-use allowance.
As AI agents rely on tool calls to work in the real world, test-time scaling has become less about smarter models and more about controlling cost and latency.
For enterprise leaders and developers, budget-aware scaling techniques offer a practical path to deploying effective AI agents without facing unpredictable costs or diminishing returns on compute spend.
Traditional test-time scaling focuses on letting models "think" longer. However, for agentic tasks like web browsing, the number of tool calls directly determines the depth and breadth of exploration.
This introduces significant operational overhead for businesses. "Tool calls such as webpage browsing results in more token consumption, increases the context length and introduces additional time latency," Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. "Tool calls themselves introduce additional API costs."
The researchers found that simply granting agents more test-time resources does not guarantee better performance. "In a deep research task, if the agent has no sense of budget, it often goes down blindly," Wang and Liu explained. "It finds one somewhat related lead, then spends 10 or 20 tool calls digging into it, only to realize that the entire path was a dead end."
To evaluate how they can optimize tool-use budgets, the researchers first tried a lightweight approach called "Budget Tracker." This module acts as a plug-in that provides the agent with a continuous signal of resource availability, enabling budget-aware tool use.
The team hypothesized that "providing explicit budget signals enables the model to internalize resource constraints and adapt its strategy without requiring additional training."
Budget Tracker operates purely at the prompt level, which makes it easy to implement. (The paper provides full details on the prompts used for Budget Tracker, which makes it easy to implement.)
In Google's implementation, the tracker provides a brief policy guideline describing the budget regimes and corresponding recommendations for using tools. At each step of the response process, Budget Tracker makes the agent explicitly aware of its resource consumption and remaining budget, enabling it to condition subsequent reasoning steps on the updated resource state.
To test this, the researchers experimented with two paradigms: sequential scaling, where the model iteratively refines its output, and parallel scaling, where multiple independent runs are conducted and aggregated. They ran experiments on search agents equipped with search and browse tools following a ReAct-style loop. ReAct (Reasoning + Acting) is a popular method where the model alternates between internal thinking and external actions. To trace a true cost-performance scaling trend, they developed a unified cost metric that jointly accounts for the costs of both internal token consumption and external tool interactions.
They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet 4. The experiments show that this simple plug-in improves performance across various budget constraints.
"Adding Budget Tracker achieves comparable accuracy using 40.4% fewer search calls, 19.9% fewer browse calls, and reducing overall cost … by 31.3%," the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, whereas plain ReAct plateaued after a certain threshold.
To further improve tool-use resource optimization, the researchers introduced Budget Aware Test-time Scaling (BATS), a framework designed to maximize agent performance under any given budget. BATS maintains a continuous signal of remaining resources and uses this information to dynamically adapt the agent's behavior as it formulates its response.
BATS uses multiple modules to orchestrate the agent's actions. A planning module adjusts stepwise effort to match the current budget, while a verification module decides whether to "dig deeper" into a promising lead or "pivot" to alternative paths based on resource availability.
Given an information-seeking question and a tool-call budget, BATS begins by using the planning module to formulate a structured action plan and decide which tools to invoke. When tools are invoked, their responses are appended to the reasoning sequence to provide the context with new evidence. When the agent proposes a candidate answer, the verification module verifies it and decides whether to continue the current sequence or initiate a new attempt with the remaining budget.
The iterative process ends when budgeted resources are exhausted, at which point an LLM-as-a-judge selects the best answer across all verified answers. Throughout the execution, the Budget Tracker continuously updates both resource usage and remaining budget at every iteration.
The researchers tested BATS on the BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard ReAct and various training-based agents. Their experiments show that BATS achieves higher performance while using fewer tool calls and incurring lower overall cost than competing methods. Using Gemini 2.5 Pro as the backbone, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% for standard ReAct, and 27.0% on HLE-Search compared to 20.5% for ReAct.
BATS not only improves effectiveness under budget constraints but also yields better cost–performance trade-offs. For example, on the BrowseComp dataset, BATS achieved higher accuracy at a cost of approximately 23 cents compared to a parallel scaling baseline that required over 50 cents to achieve a similar result.
According to the authors, this efficiency makes previously expensive workflows viable. "This unlocks a range of long-horizon, data-intensive enterprise applications… such as complex codebase maintenance, due-diligence investigations, competitive landscape research, compliance audits, and multi-step document analysis," they said.
As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy with cost will become a critical design requirement.
"We believe the relationship between reasoning and economics will become inseparable," Wang and Liu said. "In the future, [models] must reason about value."
OpenAI has officially released GPT-5.2, and the reactions from early testers — among whom OpenAI seeded the model several days prior to public release, in some cases weeks ago — paints a two toned picture: it is a monumental leap forward for deep, autonomous reasoning and coding, yet potentially an underwhelming "incremental" update for casual conversationalists.
Following early access periods and today's broader rollout, executives, developers, and analysts have taken to X (formerly Twitter) and company blogs to share their first testing results.
Here is a roundup of the first reactions to OpenAI’s latest flagship model.
The strongest praise for GPT-5.2 centers on its ability to handle "hard problems" that require extended thinking time.
Matt Shumer, CEO of HyperWriteAI, did not mince words in his review, calling GPT-5.2 Pro "the best model in the world."
Shumer highlighted the model's tenacity, noting that "it thinks for **over an hour** on hard problems. And it nails tasks no other model can touch."
This sentiment was echoed by Allie K. Miller, an AI entrepreneur and former AWS executive. Miller described the model as a step toward "AI as a serious analyst" rather than a "friendly companion."
"The thinking and problem-solving feel noticeably stronger," Miller wrote on X. "It gives much deeper explanations than I’m used to seeing. At one point it literally wrote code to improve its own OCR in the middle of a task."
For the enterprise sector, the update appears to be even more significant.
Aaron Levie, CEO of Box, revealed on X that his company has been testing GPT-5.2 in early access. Levie reported that the model performs "7 points better than GPT-5.1" on their expanded reasoning tests, which approximate real-world knowledge work in financial services and life sciences.
"The model performed the majority of the tasks far faster than GPT-5.1 and GPT-5 as well," Levie noted, confirming that Box AI will be rolling out GPT-5.2 integration shortly.
Rutuja Rajwade, a Senior Product Marketing Manager at Box, expanded on this in a company blog post, citing specific latency improvements.
"Complex extraction" tasks dropped from 46 seconds on GPT-5 to just 12 seconds with GPT-5.2.
Rajwade also noted a jump in reasoning capabilities for the Media and Entertainment vertical, rising from 76% accuracy in GPT-5.1 to 81% in the new model.
Developers are finding GPT-5.2 particularly potent for "one-shot" generation of complex code structures.
Pietro Schirano, CEO of magicpathai, shared a videoof the model building a full 3D graphics engine in a single file with interactive controls. "It’s a serious leap forward in complex reasoning, math, coding, and simulations," Schirano posted. "The pace of progress is unreal."
Similarly, Ethan Mollick, a professor at the Wharton School of Business at the University of Pennsylvania and longtime LLM and AI power user and writer, demonstrated the model's ability to create a visually complex shader—an infinite neo-gothic city in a stormy ocean—via a single prompt.
Perhaps the most functional shift is the model's ability to stay on task for hours without losing the thread.
Dan Shipper, CEO of thoughtful AI testing newsletter Every, reported that the model successfully performed a profit and loss (P&L) analysis that required it to work autonomously for two hours. "It did a P&L analysis where it worked for 2 hours and gave me great results," Shipper wrote.
However, Shipper also noted that for day-to-day tasks, the update feels "mostly incremental."
In an article for Every, Katie Parrott wrote that while GPT-5.2 excels at instruction following, it is "less resourceful" than competitors like Claude Opus 4.5 in certain contexts, such as deducing a user's location from email data.
Despite the reasoning capabilities, the "feel" of the model has drawn critique.
Shumer highlighted a significant "speed penalty" when using the model's Thinking mode. "In my experience the Thinking mode is very slow for most questions," Shumer wrote in his deep-dive review. "I almost never use Instant."
Allie Miller also pointed out issues with the model's default behavior. "The downside is tone and format," she noted. "The default voice felt a bit more rigid, and the length/markdown behavior is extreme: a simple question turned into 58 bullets and numbered points."
The early reaction suggests that GPT-5.2 is a tool optimized for power users, developers, and enterprise agents rather than casual chat. As Shumer summarized in his review: "For deep research, complex reasoning, and tasks that benefit from careful thought, GPT-5.2 Pro is the best option available right now."
However, for users seeking creative writing or quick, fluid answers, models like Claude Opus 4.5 remain strong competitors. "My favorite model remains Claude Opus 4.5," Miller admitted, "but my complex ChatGPT work will get a nice incremental boost."