There’s a significant gap between the potential value of AI and the measurable value that enterprises have only recently begun to experience. The launch of ChatGPT in 2022 triggered a massive shift in how companies perceive AI. Pilots were launched. Promises of high returns were made. Innovation skyrocketed. LLMs, retrieval-augmented generation (RAG) pipelines and multi-agent systems are being embedded into decision-critical workflows — from contract analysis to customer support to financial approvals. The pace of technological change has become so rapid that many companies are now struggling to keep up.

But here’s one hard truth: only 3 out of 37 GenAI pilots are actually successful. While data quality has emerged as a major concern, experts are also concerned about surrounding issues — including security, observability, evaluation and integration. These are non-negotiable for GenAI security and success.

1. Security & data privacy: Beyond firewalls

Anthropic’s experiment with Claudius is an eye-opener. The experiment illuminates that security in generative AI isn’t about perimeter defenses — it’s about controlling what models and agents can see and do. Unlike traditional models, where erecting digital walls at the perimeter can secure the system, GenAI systems can be attacked with prompt injections, agentic manipulations or shadow models created by reverse engineering.

Perimeter defenses — like firewalls, authentication and DDoS protection — are crucial, but they only control who can access the system or how much data can flow in or out. However, in recent times, multiple options have emerged to ensure what models can see or do, including running inference inside secure enclaves, dynamic PII scrubbing, role-based data filtering and least-privilege access controls for agents. From my experiments, I can say that two strategies stand out: Confidential compute with policy-driven PII protection and fine-grained agent permissions.

Confidential compute + policy-driven PII protection

In fintech, healthtech, regtech and other domains, LLMs often process sensitive data — contracts, patient records and financials. There, even if you trust the cloud, regulators may not. Confidential computing ensures data-in-use protection, even from cloud operators. It guarantees strong compliance.

But there is a trade-off. The technology is still in its early days and it can incur significant costs. That said, it can be used in narrow use cases with regulated data. It gives satisfactory results when paired with dynamic PII scrubbing tools like Presidio or Immuta for adaptive protection based on geography, user role or data classification.

Fine-grained agent permissions (zero trust for LLMs)

Consider AI agents as untrusted by default and only grant them the exact access they need—nothing more. Blanket access is dangerous, much like handing an intern unrestricted control of your ERP system. Agents work more securely when each agent-tool pair gets a scoped capability token that defines precisely what it’s allowed to do.

For example, an Invoice Extractor agent might parse PDF files but have zero access to financial databases. Policy engines like OPA (Open Policy Agent) or Cerbos enforce these permissions at scale as centralized access managers.

Some teams experiment with blockchain-based audit trails for tamper-proof logging—useful in defense or supply chain scenarios, but typically unnecessary overhead for most enterprises.

2. Observability: Taming the black box

Debugging autonomous agents is exponentially harder than debugging chatbots. If you don’t have enough observability, you risk “black box chaos.” Without transparency, it will be really difficult for teams to understand, trust or improve the system.

Observability in GenAI means more than logging. You need to trace, debug, replay and validate agent decisions across unpredictable workflows. Implement early to move from firefighting to proactive reliability. I go ahead with two solutions-

Distributed tracing with agent graphs

Debugging and optimization are tricky in multi-agent systems as tasks are often delegated in unpredictable ways. Tools like OpenTelemetry, LangSmith and Grafana help in visualizing how agents make decisions, track their task flows and measure latency at each step.

These tools create clear interaction graphs and identify bottlenecks by explaining system behavior and speeding root cause analysis. However, detailed traces create storage overhead and data leakage risks if sensitive prompts or outputs aren’t properly safeguarded.

Replay & simulation environments

Many production issues in agentic systems are “one-off” bugs caused by unusual input or timing. Replay environments allow teams to re-run prompt chains and simulate edge cases, which is crucial for diagnosing failures and preventing regressions. Such setups ensure robust deployments and support more rigorous testing before pushing changes live.

However, don’t expect this solution to replicate a real-life scenario. The complexity and unpredictability of a real production environment is a completely different ball game. Use this solution as a complement to, rather than a substitute for, live monitoring.

3. Evaluation & model migration readiness

Traditional enterprise release cycles are no match for the rapidly evolving LLM landscape. The pace at which new models emerge puts the LLM ecosystem several steps ahead. If enterprises fail to keep up, they risk falling behind in innovation or incurring costly technical debt.

Switching to a new model or framework without a structured approach can cause performance regressions or unexpected behavior in production environments. Every time you switch models, there is a risk. LLMs don’t behave in the same way as other normal software upgrades. A new model might give different answers, miss compliance rules or fail in niche use cases your business depends on. Add to that, vendors change pricing often, APIs get deprecated and leadership often pushes for cost savings or better accuracy. 

Continuous evaluation and safe model migration are two possible solutions.

Continuous evaluation pipelines

Treat model evaluation in the context of large language models (LLMs) like CI/CD pipelines in software development. Test LLMs like you test code—continuously. Use curated test sets with domain Q&A, edge cases and red-team prompts to ensure models stay aligned with business goals and catch potential issues.

Weekly evaluations let teams catch regressions before they hit production or affect users. This proactive approach keeps models robust against evolving data and changing user needs.

However, frequent evaluation brings significant costs—token usage, infrastructure and human effort to maintain test sets.

Balance cost and evaluation by rotating test sets quarterly and incorporating anonymized real user data. This optimizes the process while simulating real-world scenarios and preserving privacy.

Dual-run migration strategy

Migrating to a new model in production demands precision and caution. Deploy a dual-run strategy to allow both the old and new models to operate in parallel. You can then compare their outputs in real-time, and make the final switch gated by predefined evaluation thresholds.

Let me explain this with an example. Financial services companies are very specific about their requirements, such as privacy, observability and other features. We dual-ran GPT-4 and Mistral for one such company for six weeks before making the switch to understand the pros and cons. This ensured a smooth transition as we monitored their outputs and only cut over when the new model consistently met or exceeded performance benchmarks.

Just a quick note: Treat LLMs as modular infrastructure components — not monolithic systems that are too risky to touch. With the right evaluation and migration strategies, enterprises can stay agile, reduce risk and continuously improve their AI capabilities.

4. Secure business integration: From POC to production

Most enterprises plug GenAI into workflows through basic APIs or chat interfaces. This works for prototypes but lacks enterprise-grade guardrails. Security, governance and accountability concerns quickly derail adoption.

True enterprise AI requires deep integration with robust security, governance and accountability built in. AI must be embedded within systems that enforce organizational policies, monitor behavior and ensure traceability. This means pairing AI capabilities with business rules, compliance requirements and operational standards.

Without proper integration, even high-performing models become liabilities—causing data leaks, unauthorized actions or biased decisions.

Policy-aware integration with enterprise systems

Guardrails are essential when integrating GenAI with core enterprise platforms like SAP, Salesforce or ServiceNow. These systems handle sensitive data and critical operations — unconstrained AI access dramatically increases risk.

Implement policy enforcement points (PEPs) as a compliance layer for AI actions. For example, an AI drafting sales proposals needs managerial oversight for approvals over $50,000. Without this guardrail, the system might approve controversial deals autonomously. The case of Air Canada is a great example where the bot gave the customer wrong information and the court found the company liable for it.  

This system can further be strengthened with a role-based data filtering. The system can ensure that the AI gets access to data that the human user is authorized to see. This would prevent inadvertent exposure of confidential information.

Impact analytics & risk dashboards

Traditional security logs are insufficient for understanding the real-world impact of GenAI applications. Businesses need visibility into how AI affects outcomes — such as whether it reduces escalation rates, flags problematic contracts or improves operational efficiency. For this, you need impact analytics dashboards to track both operational metrics and business KPIs.

However, there’s a risk that AI may optimize for the wrong metrics, such as approving borderline cases to reduce turnaround time. This might compromise quality or compliance.

Now this solution is probably the most advised. Organizations must implement human-in-the-loop checkpoints and conduct periodic audits to ensure AI decisions align with strategic goals and ethical standards. I would like to suggest one extra step. Create tiered thresholds.

For a low-risk action like drafting internal emails is a low-risk action, let the GenAI act autonomously. But from there onwards, you have to be extra careful. For a medium-risk action like customer responses, route a random sample for human review. For high-risk actions like contract approvals and financial changes, there are no shortcuts. You must enforce mandatory sign-off.

Where compromise leads to catastrophe

Security, observability, evaluation and integration are four crucial factors creating bottlenecks for enterprise AI adoption. Enterprises operate with huge datasets that are sensitive in nature. Any compromise there could be catastrophic.

  • Controlling what models and agents can see and do is crucial. Confidential computing with policy-driven PII protection and zero-trust safeguards for LLMs has emerged as two effective measures.
  • Observability can negate ‘black box chaos’. Distributed tracing with agent graphs, along with replay and simulation environments, has proven its worth as an efficient method. But don’t expect the second method to mimic a perfect real-life scenario.
  • Evaluation and model migration help enterprises avoid tech debt and streamline innovation. Continuous evaluation pipelines and a dual-run migration strategy can keep them abreast of the market. But enterprises must also factor in the cost. A spike in evaluation frequency could impact ROI.
  • 95% of POCs fail to move to production. This is because POC guardrails are no match for real-world security risks. Policy-aware integration and impact analytics with risk dashboards can ensure a smoother transition. A tiered threshold can improve performance.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More