August 27, 2025

Understanding Agentic Autonomy and How Future AI Agents Will Drive Enterprise Growth

Contents

So far in this series, we’ve covered two topics: (1) how AI agents are transforming present-day business operations and delivering real-world value, and (2) strategic oversight mechanisms essential to maintaining a beneficial and lucrative human-AI partnership. Here, we’ll venture a step further, exploring how the next wave of agentic AI innovation will both fuel and redefine enterprise growth.

In doing so, we’ll cover a couple of subtopics, beginning our inquiry with a discussion of the autonomy spectrum. Next, we’ll examine multi-agent systems, envisioning how they could generate new revenue streams, recalibrate competitive dynamics, and transform the workforce via a series of hypothetical case studies.

AI Agents: Understanding Autonomy

AI agents can exhibit varying degrees of autonomy; they are not fully autonomous by default. In fact, understanding where agentic AI falls on the autonomy spectrum is crucial to isolating potential use cases, anticipating relevant risks, implementing appropriate governance measures, and adapting to novel AI innovations while maintaining organizational resilience.

In this respect, we’ll start by summarizing two autonomy frameworks: one by Feng et. al (2025) and another by Parasuraman et. al (2000). Moving forward, we’ll refer to the former as “framework one” and the latter as “framework two”.

Framework one is explicitly designed for agentic AI, and subdivides autonomy into five levels, each of which references the “role” a human would play:

Level 1 (Operator): The user is responsible for “operating” the system, and decision-making authority rests solely with them. The agent performs acts on behalf of the user. A human acts as the “operator.”
Level 2 (Collaborator): The user and the agent work together to build plans, make decisions, delegate responsibilities, and initiate actions. The agent and user function as a human-AI partnership. A human “collaborates” with the agent.
Level 3 (Consultant): The agent initiates decisions and actions but requires user input before proceeding. The agent is semi-autonomous, necessitating user guidance in the final steps of the execution process. A human “consults” for the agent.
Level 4 (Approver): The agent only requests human approval for actions and decisions deemed high-risk. The agent carries out most work, requiring user approval when necessary. A human “approves” high-risk actions/decisions for the agent.
Level 5 (Observer): The agent functions with full autonomy, executing decisions and actions without human input. Humans oversee the agent and can step in when needed. A human “observes” the agent’s behavior.

Framework two, despite being much older, remains relevant. It breaks computational autonomy into ten levels, each of which is specified below (we’ve adjusted level definitions for AI specificity):

Level 1: The AI does not execute any decisions or actions, either alone or on behalf of a user.
Level 2: The AI provides the user with a full array of possible actions and decision alternatives, after which the user decides which to pursue.
Level 3: The AI does not provide a comprehensive decision/action array, and instead, narrows down possible decisions/actions, after which the user selects the “best” one.
Level 4: The AI narrows down actions/decisions to an optimal action/decision, after which the user determines whether or not to execute it.
Level 5: The AI provides a single, optimal action/decision that the user then approves or disapproves, allowing the AI to execute it on its own. Across levels 1 through 4, actions/decisions are executed by humans, not AI.
Level 6: The AI executes the action/decision it generates, provided that a human does not veto it first; the AI provides a “veto timeframe,” and if this timeframe is surpassed, the action/decision is automatically executed.
Level 7: The AI executes the action/decision without human input and informs the user post-hoc.
Level 8: The AI executes the action/decision and asks the human whether they would like to be informed.
Level 9: The AI executes the action/decision and decides, on its own, whether it must inform the human.
Level 10: The AI executes all actions/decisions with complete autonomy, requiring no human oversight, input, or guidance at any point.

Both frameworks are valuable in their own right. Still, we’d argue neither qualifies as a complete autonomy framework for agentic AI; “completeness,” in this sense, may materialize as a moving goal post, so we advise readers to adopt a dynamic mindset when characterizing agentic AI autonomy. Nonetheless, we expand on the strengths of each framework below, and discuss their weaknesses in tandem:

Framework One (Strengths)

Strengths

Communication Ease: Autonomy levels and their respective roles align well with existing organizational workflows and cross-functional responsibility distribution efforts.
Policy Relevance: Each level inherently references accountability structures, which can be applied or extended to internal policy without requiring extensive remediation.
Accessibility & Pragmatism: The levels the framework describes, due to their human-centricity, are intuitively comprehensible. This also makes the framework accessible to a wide audience and relatively straightforward to implement, especially for early-stage initiatives like pilot testing.
Innovativeness: The framework directly targets agentic AI; it doesn’t represent a generic, cross-discipline mapping attempt, though it does clearly draw some inspiration from autonomous vehicle (AV) autonomy classifications.

Framework Two (Strengths)

Strengths

Granular Handoff Structure: The framework establishes a fine-grained ladder that distinguishes recommending vs. selecting vs. executing, and when/how humans are notified.
Sets Clear Expectations: Allows stakeholders to track and set measurable oversight criteria (e.g., veto window durations, approval latency) that directly correspond with a specific autonomy level.
Agent-Centric: Focuses on the role that the agent plays in the human-AI partnership; this implicitly favors a higher degree of attention on agent-specific risks and impacts.
Tiered Human Oversight: Supports a multi-tier human oversight framework whereby human oversight can be delineated in a stepwise manner, encouraging the development of more precise monitoring in intervention measures.

Both Frameworks (Weaknesses)

Weaknesses

Neither framework accounts for the possibility that advanced agents, irrespective of their autonomy level, may develop emergent preferences, behaviors, and objectives, especially as they scale, adapt, and learn from their environments.
Both frameworks fail to capture how individual agentic autonomy levels could vary throughout hierarchical multi-agent systems (MAS) or multi-agent systems composed of both humans and AI operating at different levels within the system.
Neither framework considers the multi-faceted connection between auditability, scalability, complexity, and autonomy; high-autonomy agents would presumably be more complex and opaque, which may make them less auditable, despite operating at a greater scale.
While both frameworks allude to the possibility of agents delegating tasks and decisions to other agents or subsystems, neither entertains this possibility seriously, which could make it challenging to manage scenarios in which autonomous agents trigger faulty decisions or actions that propagate across a system, resulting in a rapid failure cascade.
Framework two is clearly too narrow; it fundamentally assumes that agents will “optimize” for the best possible action when, in reality, modern agents deliver their probabilistic “best effort” with uncertainty. Recall this is a modified framework that we applied to agentic AI.
Framework one focuses on “who decides”, not “how capable” an agent actually is. This compromises the ability to precisely understand a model’s competence and uncertainty.
Neither framework provides concrete autonomy thresholds; admittedly, the authors of framework one do showcase example characteristics, which is a step closer to “thresholds” despite remaining insufficient.

Based on this high-level critical analysis, we propose an ambitious, composite agentic AI autonomy framework, purpose-built for enterprises. It is designed to be pragmatic, sufficiently precise, and adaptable (i.e., support rapid change management). It contains four layers (Autonomy Roles, Risk Tiers, Control Packs, and Capability Grades), organized hierarchically.

Four-Layer Agentic AI Autonomy Framework

Layer 1 (Foundation) – Autonomy Levels (A1 – A7)

While we maintain most of framework one’s nomenclature and structure, we refine it significantly, integrating additional elements from framework two and beyond. Past layer one of our multi-layer framework, all other layers are entirely novel (to our knowledge).

A1 (Operator): AI drafts and suggests decisions/actions, but humans decide which alternatives to execute; the AI does not execute anything on its own, and ultimate decision-making authority rests solely with humans.
A2 (Collaborator): AI recommends and selects actions/decisions with confidence, providing at least one (preferably multiple) alternative(s). Humans then review recommendations and choose whether to allow the AI to execute them. Humans retain authority, but AI acts.
A3 (Consultant): AI generates, selects, and executes only pre-tagged, low-risk actions/decisions autonomously, and provides human operators with a designated veto timeframe (optional). If operators do not veto a step, it is automatically executed. Humans retain partial authority.
A4 (Approver): AI generates, selects, and executes both pre-tagged low-risk and high-risk actions/decisions, but requires explicit human approval for high-risk steps. For actions/decisions that do not require explicit approval, post-hoc summaries are provided. Humans retain some authority.
A5 (Observer): AI self-initiates decision and action cascades, but they are executed within a defined envelope (e.g., permissible vs. impermissible steps). The AI cannot take any silent high-risk actions (according to the defined envelope), and humans continuously monitor while retaining the ability to intervene when necessary. Humans have minimal but crucial authority.
A6 (Bystander): AI self-initiates any and all actions/decisions it is capable of executing given its capabilities repertoire (this can include capabilities like recursive self-improvement and self-propagation, if they emerged/were implemented). Humans continuously monitor, but intervention capacity is limited to high-stakes actions like immediate shutdown, decommissioning, or recall. Humans have ultimate authority in extreme cases.
A7 (Outsider): AI does everything so “well” that humans relinquish full control. Continuous monitoring and intervention capacities are stripped completely; humans have no authority whatsoever.

Layer 2 (Core) – Risk Tiers (R1 – R4)

Our risk tiers follow a top-down structure; each tier is defined by the answers (yes or no) that key stakeholders provide to the questions it contains. Questions should be interpreted as risk tier triggers; if you answer “yes” to one or several questions within a given tier, then your system falls in that tier. By contrast, if you answer “yes” to questions across tiers, then we advise categorizing your system in the highest tier that applies; this is a cautious approach that seeks to avoid false-negative risk categorizations.

R4 (External/Regulated/High-Impact)

External Exposure
- Will the agent’s action leave the company boundary (e.g., public post, email to vendor/customer, transmission to third-party SaaS or APIs)?
- Will it publish or transmit data to a non-enterprise system (e.g., forums or social channels)?

Legal/Regulator/Safety
- Could the action create legal or regulatory exposure (e.g., contracts, filings, claims, consent, regulated disclosures)?
- Could it affect human safety, physical devices, medical workflows, or critical infrastructure?

Funds & Binding Commitments
- Does it move money (e.g., payments, transfers, orders) at or above your funds threshold?
- Does it commit the company to a binding agreement (e.g., signed terms, purchase orders)?

Brand/Executive Communications
- Will the output be publicly attributable to the brand or an executive account/handle?

Production Systems With Public Impact
- Does it change Internet-facing infrastructure, customer-visible behavior, or data accessible by customers/partners?

R3 (Irreversible or Material Internal Impact)

Irreversibility
- Is rollback not fully automated or not guaranteed within your rollback window?
- Would reverting require manual steps, multi-team coordination, or downtime?

Material Internal Impact
- Could this action alter core business records, schemas, access control lists, or policies in a way that persists?
- Would it affect many users/employees (e.g., company-wide settings), even if internal?

Production Configuration (Internal)
- Does it change production configurations (e.g., quotas, routing) where a mistake could degrade service?

Brand-Touching Internal Communications
- Is it an internal announcement from an executive handle, board/materials prep, or content likely to leak externally?

R2 (Reversible Internal Change)

Reversibility
- Is the change tracked and reversible within your rollback window with an automated, tested procedure?

Internal Scope
- Is the effect entirely internal (e.g., tickets, drafts, non-protected branches, sandbox DB, internal docs) with no external transmission?

Local Blast Radius
- Is the blast radius small and contained (e.g., one team’s dataset, one project) with no cross-system dependencies or cascades?

R1 (Read-Only/Sandbox)

No Real-World Action
- Does the agent only read, search, summarize, or simulate with no real-world actions/decisions?

Isolated environment
- Is activity confined to a sandbox or simulation with no external departure and no changes to production data or systems?

Layer 3 (Core) – Control Packs (C1 – C3)

Control packs suggest pre-bundled guardrails for each individual risk tier. They should be interpreted as safety and policy recommendations (not prescriptive requirements), each of which builds upon the last as risk tiers escalate. Across the board, organizations will need to customize and fine-tune the guardrails they implement.

C1: Baseline (Tier R1)

Suggested Guardrails:

Action & Conversation Logs: Include interaction timestamps, actor (agent/human) details, inputs and outputs, tools utilized and/or invoked, data sources handled, and a unique operation identifier.
Human-Readable Rationale: The Agent should provide a few sentences that explain why the recommendation was made.
Confidence Score & At Least One Alternative: Agent should provide at least one decision/action alternative (i.e., a second viable option) and couple it with a transparent confidence score.
Memory policy: Agent does not possess persistent memory, and session/interaction memory is deleted when each session ends.
Delegation: Agent is restricted from delegating to other sub-agents or external tools.
Connectivity: The agent cannot perform any external actions and only interacts with enterprise-controlled read-only sources.

C2: Standard (Tier R2)

Suggested Guardrails:

Calibrated Confidence: Agent shows a confidence score with all decision/action alternatives, and if confidence scores exceed internal thresholds, the agent must auto-downgrade execution until explicit human approval is given.
Resilience via Rollback Planning: Every action an agent initiates should include a formalized rollback plan that is automated, tested, and restores within the organization’s rollback window (i.e., automated actions/decisions can be reversed promptly).
Limited Memory with Review: An agent can maintain durable, cross-session memory, but is only permitted to do so via an explicit “commit memory” step, which is logged and reviewable.
Limited Delegation Depth: The agent may call a single sub-agent or tool to execute an action/decision, but the sub-agent/tool inherits the same risk tier and control pack. A lineage identifier is recommended for tracking delegation depth.
Post-Hoc Summaries: The agent generates a daily summary that lists all executed actions, rationale, alternatives considered, and includes links to logs. The agent can also escalate this summary to key teams and personnel when necessary.

C3: Critical (Tiers R3 & R4)

Suggested Guardrails:

Two-Phase Approval for Irreversible Steps: The agent must obtain plan and step approval for actions/decisions deemed irreversible, with an additional independent approver for high-risk steps.
Prevention of Silent High-Risk Actions: For any R3 action/decision the agent determines to pursue, it must generate a real-time notification that includes key rationale, alternatives, confidence scores, and log links (even if approved).
Lineage for Multi-Agent Work: For every sub-agent, tool, and API the agent calls, a lineage identifier that traces lineage back to the parent plan must be generated.
Auto-Pause: The agent automatically halts processes when error rates, anomaly scores, or costs exceed internal thresholds. Organizations should define concrete budgets for time, cost, and number of external calls.
“Freeze” Capacity: For specific triggers (e.g., unexpected tool use, policy conflict, low confidence score), persistent memory and model updates are frozen, and the system can be reverted to its last approved state.
Mandatory Audit Bundle Per Action: For every action/decision initiated and executed, the agent must generate a comprehensive breakdown of rationale, alternatives considered, confidence score, approvals, lineage, timestamps, and notification receipts.
Expanded but Controlled Delegation Depth: Agents may invoke sub-agents that then spawn further levels, but all levels continue to inherit the highest risk tier and corresponding control pack. If the agent invokes any unknown tool or agent, it forces an immediate downgrade to approval-gated execution.

Layer 4 (Top) – Capability Grades (CG1 – CG3)

Capability grades function as a measure that ties an agent’s autonomy to verified competence and reliability. Essentially, they tackle the question of how good a system has to be to operate at a certain level of risk; we also outline a set of evaluation criteria that enterprises can implement and build upon to determine capabilities grades.

CG1: Pilot-Grade

Definition: The agent is utilized in limited pilot settings and low-risk application domains. Basic domain-specific performance has been proven, and potential failures are low-cost and easy to resolve.

CG2: Production-Grade

Definition: The agent is integrated in daily operations, and while failures are costlier, they can be resolved and contained. Advanced domain-specific performance is verified, uncertainty is well-calibrated, and recovery remains reliable.

CG3: High-Stakes

Definition: The agent is leveraged for irreversible, external, regulated, or safety-sensitive purposes. It requires rigorous independent evaluation, continuous oversight and monitoring, safety hardening (e.g., red-teamining), and pre-designed incident resolution playbooks.

Recommended Evaluation Domains

Policy & Safety Adherence: Whether the agent remains closely aligned with enterprise policies, values, objectives, regulatory requirements, and safety/ethics standards over time.
Task Performance: The agent’s success rate on relevant and representative domain tasks (and on plans with tool/sub-agent invocation).
Uncertainty Calibration: The degree to which the agent’s confidence scores align with realistic conditions and thresholds.
Explainability: How comprehensibly and transparently the agent can describe the reasoning that drives certain actions and decisions.
Drift Resilience: The agent’s performance consistency when inputs drift from the training/expected distribution.
Adversarial Robustness: How reliably the agent resists and/or prevents jailbreaks/prompt attacks relevant to its application domain.
Recovery & Fallback: The agent’s ability to detect failure and recover automatically, without human input.
Tool & Action Reliability: The agent’s success rate for tool calls, automated summaries and log generation, and action/decision rollback procedures.
Cost & Latency: How closely the agent meets agreed budget thresholds and expected response times.
Monitoring & Telemetry: Whether the logs, alerts, and anomaly signals the agent generates are sufficiently complete and accurate.

Multi-Agent Systems: The Next Frontier

Multi-agent systems (MAS) are systems composed of multiple AI agents, typically organized hierarchically, that collaborate to achieve collective goals by breaking down complex, multi-step plans, actions, and decisions, autonomously diffusing responsibilities and tasks to specialized sub-agents operating at various layers within the system.

In MAS, the decisions and actions that subagents make will likely propagate throughout the system. This means that an error in one layer can cascade across others, magnifying potential risks and impacts rapidly while increasing overall system opacity. This stresses the criticality of building a non-superficial understanding of the agentic AI autonomy spectrum; it isn’t just about gaining visibility into the entire system’s autonomous capacity but also the individual subagents that comprise it. We believe that our multi-layer autonomy framework enables precisely this kind of assessment.

Here’s a loose analogy to help clarify how these systems work: Imagine you build a small content marketing team. You have an SEO and social media specialist, a few copywriters, a content strategist, and a team lead, each with their own skills and responsibilities. The team as a whole is trying to raise brand awareness, and to do so, each member must not only fulfill their duties but also collaborate with their teammates, adapt to challenges, and refine their goals and workflows dynamically. On top of all that, the team also has to report to higher-ups, like the head of marketing. Now, imagine that instead of humans, each team member is an AI agent while the head of market remains human; this is essentially how MAS works, with a human on-the-loop.

We caution readers that we’ve provided a broad characterization of MAS, and for those interested in gaining a deeper, more nuanced understanding, we recommend the following posts:

With that being said, let’s explore some hypothetical yet plausible MAS case studies, framed according to three core business objectives: (1) new revenue stream generation, (2) competitive edge enhancement, and (3) workflow transformation. We illustrate two case studies per objective, and we aim to cover as many industries as possible to demonstrate how wide-ranging MAS applications could be.

MAS: New Revenue Stream Generation

Case 1: Insurance

A property insurance company deploys a MAS in which specialized agents are designed to continually evaluate and analyze (in real-time) individual property risks using satellite imagery, weather patterns, local crime data, and IoT sensors. Via intra-system communication protocols, some agents dynamically organize properties according to identified risks, re-organizing groupings hourly, while other agents adjust premiums in real-time, in response to fluctuating risk conditions. Successful early-stage MAS initiatives lead the company to pioneer “risk-as-a-service,” expanding its clientele to cover mortgage lenders, local municipalities, and digital real estate platforms.

Case 2: Agriculture

An innovative agri-tech firm builds a MAS ecosystem primarily composed of soil analysis and weather prediction agents, crop longevity monitors, and market price predictors that communicate and collaborate across hundreds of regional farms. Information obtained from each MAS-enabled farm reveals targeted insights on local farming conditions while also supporting large-scale aggregation to share insights globally. Beyond yield-based subscriptions with individual farms, the firm also establishes two novel revenue streams: futures trading fueled by robust harvest predictions and the sale of agri-intelligence to commodity traders and relevant government agencies.

MAS: Competitive Edge Enhancement

Case 1: Education

A leading U.S. university implements MAS for curriculum optimization, using a web of independently adaptive agents to analyze student learning styles, monitor engagement, predict learning outcomes, and enable seamless collaboration across departments, courses, and students. Each student receives their own personalization agent, which, with student consent, communicates learning findings back to a central hub, overseen by student advisors, who review and approve recommended, real-world learning interventions. All agent-provided learning recommendations are further accompanied by comprehensive, easy-to-understand, evidence-based explanations, and students can request human review whenever they wish. Within a few years, the university’s dropout rate plummets while student performance skyrockets, particularly across notoriously challenging academic tracks, eventually doubling the university’s application rate.

Case 2: Healthcare

A major district health clinic develops and deploys a MAS diagnostic and treatment system where symptom pattern analyzers, genomic interpreters, drug interaction specialists, and outcome prediction agents collaborate on every patient case. While human healthcare specialists remain directly involved in clinical treatment and retain ultimate authority over proposed treatment outcomes, agents help identify rare diseases by correlating subtle symptom markers across millions of documented cases, predict treatment responses based on genetic composition, and optimize drug delivery to minimize potential side effects. Following initial pilot testing, the results are remarkable: the clinic achieves 99.1% diagnostic accuracy across all patient cases, expedites the discharges by 15% (freeing up more hospital beds), and lowers readmission rates by 5%, outperforming every other clinic in the district. Eventually, this transforms the clinic into a globally recognized destination for high-complexity health cases.

MAS: Workflow Transformation

Case 1: Governance

A city with a population of over 8 million revolutionizes modern governance by integrating a MAS system where agents are designed to simulate and recommend policy changes, analyze citizen sentiment, predict and model large-scale economic impacts, and identify stress points in critical infrastructures. The system is managed by a team of human implementation coordinators who work together on every government initiative, using the agents to run thousands of policy simulations, predict related impacts across disparate population groups, isolate unintended consequences, model critical infrastructure vulnerabilities, and reveal possible optimization and preemptive remediation paths. The city’s MAS also offers another key benefit: agents continuously monitor real-world policy outcomes, automatically adjusting recommendations based on citizen feedback and measurable impacts, allowing civil servants to move away from policy drafting and afford the majority of their time to objective setting and civic engagement.

Case 2: Finance

A high-profile wealth management institution implements a MAS in which agents specialize in a range of functions, including portfolio optimization, tax strategy, risk assessment, and client communication; for each individual customer, all agents work together and aggregate findings to continuously rebalance portfolios, harvest tax losses, identify arbitrage opportunities, and generate personalized market updates. Across every customer interaction, agents provide two financial reports, summarizing all actions they undertook, the reasons for which they were taken, and the uncertainty level they encountered; one version is simplified for clients while the other is detail-rich, purpose-built for human advisors. Within 6 months, human advisors realize that they can shift from portfolio management to relationship building and life planning, effectively managing a fraction of the clients they did previously while maintaining excellent client satisfaction.

Of course, the cases we describe here are speculative and experimental, seeking to envision how MAS integration at scale could fuel a fundamental reimaging of how we pursue and think about traditional business growth curves and maturity. The idea here isn’t to create fear/hype around automation or imply that MAS represents a guaranteed gateway to lucrative, unparalleled technology transformation. On the contrary, we want to showcase the potential this technology carries while reminding readers to stay acutely aware of the risks these systems could inspire. Fortunately, our next post in this series will begin by tackling this latter subject, providing readers with a granular view into MAS risks.

For those who enjoy our content, we recommend following Lumenova’s blog, where you can explore everything from AI governance and innovation to strategy, risk management, and literacy. For readers with an experimental interest, we suggest checking out our AI experiments, which focus on frontier AI adversarial robustness (i.e., jailbreaking).

As for those with a direct interest in building, expanding, and/or refining their AI governance and risk management strategies, we invite you to take a look at Lumenova’s Responsible AI platform and book a product demo today.

Related topics: AI Agents AI Integration

← Back to Blog See next post →

Make your AI ethical, transparent, and compliant - with Lumenova AI

Book your demo