Microsoft has developed a scanner designed to detect backdoors in open-weight AI models, addressing a critical blind spot for enterprises increasingly dependent on third-party LLMs.

In a blog post, the company said its research focused on identifying hidden triggers and malicious behaviors embedded during the training or fine-tuning of language models, which can remain dormant until activated by specific inputs.

Such backdoors can allow attackers to alter model behavior in subtle ways that enable data exposure or allow malicious activity to slip past traditional security controls unnoticed.

As enterprises increasingly rely on third-party and open-source models for applications ranging from customer support to security operations, the integrity of those models is under scrutiny.

“Unlike traditional software, where scanners look for coding mistakes or known vulnerabilities, AI risks can include hidden behavior planted inside a model,” said Sunil Varkey, a cybersecurity analyst. “A model may work normally but respond in harmful ways when it sees a secret trigger.”

That risk is more concerning because LLMs can be deployed without deep inspection, leaving security teams with limited visibility into their training or vulnerabilities.

Signatures that suggest backdoors

Microsoft’s researchers identified three observable indicators, or “signatures,” that suggest the presence of backdoors in language models.

One of the strongest indicators is a shift in how a model pays attention to a prompt when a hidden trigger is present. In backdoored models, trigger tokens tend to dominate the model’s attention, effectively overriding the rest of the input.

“We find that trigger tokens tend to ‘hijack’ the attention of backdoored models, creating a distinctive double triangle pattern,” Microsoft said.

The researchers also found that backdoored models may leak information about how they were poisoned. In some cases, specific prompts caused models to regurgitate fragments of the very training data used to insert the backdoor, including parts of the trigger itself.

Another key finding is that language model backdoors behave differently from traditional software backdoors. Rather than responding only to an exact trigger string, many backdoored models react to partial or approximate versions of the trigger.

Effectiveness of the scanner

Microsoft said the scanner does not require retraining models or prior knowledge of backdoor behavior and operates using forward passes only, avoiding gradient calculations or backpropagation to keep computing costs low.

The company also said it works with most causal, GPT-style language models and can be used across a wide range of deployments.

Analysts say that while the approach improves visibility into language model poisoning, it is an incremental advance rather than a breakthrough, noting that several leading EDR platforms already claim the ability to detect backdoors in open-weight LLMs.

The bigger question is how long such detection advantages will last.

“While this new scanner will help counter real-world attacker techniques currently, adversaries will adapt quickly to outflank this scanner,” said Keith Prabhu, founder and CEO of Confidis. “We are seeing a repeat of the ‘virus’ wars, where hackers kept evolving viruses to evade detection by using innovative techniques like polymorphic viruses.”

That said, the scanner is essential for companies that download open-source models to use or customize in their own systems, according to Varkey.

“For them, AI models become part of the supply chain, just like software libraries,” Varkey said. “The scanner is not a complete solution, but it is an important new layer of protection as AI adoption grows.”

Read More