I recently had the opportunity to review five popular SIEM solutions as part of a judging panel for a Security award. While each platform had its own unique flair, their core promises were remarkably consistent:

  • 24/7/365 SOC monitoring: Round-the-clock coverage backed by global experts to validate and prioritize alerts.
  • Proactive threat hunting: Active searches for hidden threats rather than just waiting for automated triggers.
  • AI and machine learning integration: Leveraging everything from basic anomaly detection to “Agentic AI” to reduce noise and accelerate investigations.
  • Active incident response and containment: Capabilities to isolate endpoints or disable compromised users to stop lateral movement.
  • Third-party tool integrations: Ingesting telemetry from the “native stack” and third-party tools like CrowdStrike or Microsoft Defender.
  • Continuous intelligence updates: Constant streams of new detection rules and playbooks based on global research.
  • Service level guarantees: Financial credits or pricing adjustments for broken SLOs.

These offerings are impressive, yet a glaring omission stood out: none of them discussed how they handle multi-tenancy. In a cloud-native world, it is very likely that most if not all of these providers operate on shared infrastructure. This means they are not immune to the “noisy neighbor” effect, a phenomenon where a single misbehaving tenant can degrade the security posture of everyone else on the platform.

The noisy neighbor effect

As security operations move toward cloud-native frameworks to handle the exponential growth of telemetry data (often reaching petabytes of logs), they rely on the elasticity of software-as-a-service (SaaS). However, the sharing of physical resources (including CPU, memory and I/O) among independent customers introduces a significant engineering risk.

When one tenant’s workload consumes a disproportionate share of these resources, it creates a bottleneck. For other tenants, this translates to increased ingestion latency, delayed threat detection and violated SLAs. In security, a “delayed” alert is often as useless as no alert at all.

The multi-tenant paradox

The core appeal of multi-tenant SIEM solutions is efficiency: shared infrastructure leads to lower costs and unified management. Yet, without deliberate engineering, this becomes a zero-sum game. In a naive system, a high-volume tenant can saturate the ingestion pipeline, causing “starvation” for smaller tenants. This breaks the real-time detection and response (RTDR) promise that these companies market so heavily.

The key distinction is that multi-tenancy does not have to be zero-sum. The fairness strategies explored in this article exist precisely to prevent that outcome, but only if vendors have invested in them. The silence in marketing materials suggests many have not.

Why fairness is an engineering problem

Engineering “fairness” is not merely about setting hard limits; it is about sophisticated resource orchestration. I highly recommend reading AWS’s paper on fairness in multitenant systems. A rigid cap might protect the system, but punish a client during a genuine security emergency when they need ingestion capacity most. Conversely, a completely open system is vulnerable to cascading failures.

To solve this, engineers must move beyond simple rate-limiting and embrace “fair share” scheduling, intelligent queuing and dynamic resource allocation. This article explores the architectural strategies required to ensure that every tenant receives the performance they were promised, even when their neighbor’s house is on fire.

The anatomy of a modern SIEM

To understand where fairness fails in a multi-tenant environment, we must first dissect the anatomy of a modern SIEM. It is no longer a monolithic database, but a distributed data pipeline designed to ingest, transform and analyze petabytes of telemetry. This pipeline relies on decoupling producers from consumers using message queues, ensuring that a spike in one layer does not necessarily lead to a total system failure.

The ingestion layer

The Ingestion Layer is the system’s front door. It is responsible for collecting raw telemetry from diverse sources such as EDR agents, cloud APIs and firewalls. To handle the “firehose” of incoming data, which can spike unpredictably during a security incident, this layer does not process data immediately. Instead, it acts as a high-throughput buffer, writing raw events directly into a raw event queue (typically Apache Kafka). This decoupling is critical because it ensures that even if downstream processing layers are slow, the system can still accept incoming logs without data loss.

The normalization layer

The normalization layer consumes raw events from the initial queue. Its primary role is to bring order to chaos by parsing heterogeneous log formats (JSON, XML or Syslog) into a structured schema like the common information model (CIM). This involves CPU-intensive tasks such as regex matching, field extraction and enrichment. Once processed, these structured events are published to a second normalized event queue. This central bus becomes the single source of truth for all downstream consumers.

The rule-based detection layer (real-time)

The first consumer of the normalized queue is the rule-based detection layer, often powered by engines like Apache Flink in the last 2-3 years. This layer is optimized for speed, executing low-latency, rule-based logic on events as they flow through the pipe. It handles high-volume, simple detections, such as “five failed logins in one minute,” in milliseconds. By alerting on these patterns immediately, it reduces the time-to-detect for critical threats without waiting for data to be indexed.

The ad-hoc search layer

Parallel to the streaming engine, the ad-hoc search layer also consumes from the normalized queue. This system (often utilizing Elasticsearch or Splunk indexers) is optimized for human interaction. It indexes the data to support sub-second search and retrieval, enabling security analysts to perform investigations and threat hunting. While the streaming layer finds known threats, this layer helps analysts find the unknown ones through interactive querying.

The storage layer (long-term retention)

Simultaneously, a third consumer reads from the normalized queue to persist data into the storage layer. This layer is architected for durability and cost-efficiency, typically writing data to object storage (like Amazon S3) in a columnar format (such as Parquet). This “cold storage” ensures compliance with data retention policies at a fraction of the cost of the high-performance search tier, effectively decoupling retention from compute.

The analytics and correlation layer (batch)

Finally, the analytics and correlation layer operates by consuming data from the storage layer. Unlike the streaming engine, which looks at individual events in motion, this layer executes complex queries over vast historical datasets. It runs scheduled jobs to detect sophisticated patterns, such as “beaconing to a rare domain over thirty days,” that require analyzing long time windows. By reading from storage rather than the real-time stream, it isolates these resource-intensive jobs from the ingestion and search pipelines.

Summary of SIEM layers

Layer Primary Function Key Challenge
Ingestion Collects raw logs and buffers them into a Raw Queue. Handling massive throughput spikes without data loss.
Normalization Parses raw logs into a common schema and publishes to a Normalized Queue. High CPU overhead from regex parsing and enrichment.
Rule-based detection Consumes normalized stream for fast, rule-based alerting. Managing state and windowing for millions of concurrent entities.
Ad-hoc search Indexes normalized data for fast, interactive investigation. Unpredictable resource consumption from complex analyst queries.
Storage Persists normalized data for long-term retention. Optimizing file formats (Parquet or Avro) for efficient read and write.
Analytics Executes complex batch queries against storage. Scheduling long-running jobs without impacting other workloads.

Strategies to encode fairness

Without deliberate intervention, shared infrastructure will always favor the loudest voice. To build a resilient SIEM, engineers must implement strategies that enforce isolation and ensure equitable resource distribution. These strategies generally fall into three categories: admission control, tenant-aware scheduling and resource partitioning.

Admission control and rate limiting

The first line of defense is at the very front of the ingestion pipeline. Admission control ensures that a single tenant cannot flood the raw event queue beyond a certain threshold. However, modern SIEMs move beyond “hard” rate limits (where data is simply dropped) and instead use “soft” limits or shaping.

A common approach is the token bucket algorithm. Each tenant is allocated a certain number of tokens per second, representing their licensed ingestion rate. During a spike, they can consume accumulated tokens to “burst” above their limit for a short duration. Once the bucket is empty, the system might begin “shaping” the traffic, introducing slight delays to the ingestion of that specific tenant’s logs to protect the system’s global stability without immediately discarding critical security data.

In practice: A tenant contracted at 10,000 events per second might be permitted to burst to 15,000 EPS for up to 60 seconds by drawing on their accumulated token reserve. A real incident generating 20,000 EPS would exhaust the bucket and trigger shaping: their logs slow down, but nothing is dropped. Meanwhile, every other tenant on the platform continues processing at full speed.

Tenant-aware fair share scheduling

Inside the processing layers (such as normalization or analytics), the system must decide which tenant’s tasks to execute next. In a naive “first-in, first-out” (FIFO) model, a massive batch of logs from one tenant will block everyone else.

Engineers solve this by implementing weighted fair queuing (WFQ). Instead of one giant queue for all events, the system maintains virtual queues for each tenant. The scheduler cycles through these queues, picking a small batch of events from each. This ensures that a small tenant with only ten events per second never has to wait behind a large tenant processing ten million. This “interleaving” of processing tasks guarantees that every customer makes progress, regardless of their neighbor’s activity.

In practice: In a Kafka-backed SIEM, this is implemented by assigning each tenant their own partition (or partition group) within a topic. Normalization consumers are then configured to process a bounded number of records per tenant per poll cycle, cycling through partitions in round-robin order. A tenant generating a 50x spike in log volume gets their own partition filling up, but the consumer never spends more than its fair share of processing time on that partition before moving to the next tenant.

Virtual resource isolation (quotas and reservations)

For components like the ad-hoc search layer, where resource usage is highly unpredictable, engineers use resource partitioning. This involves setting up logical boundaries within the shared compute pool.

Through resource quotas, the SIEM provider can cap the maximum CPU and memory a single tenant’s queries can consume at any given time. Some advanced architectures take this a step further with guaranteed reservations. A high-tier customer might be guaranteed a specific percentage of the cluster’s resources, ensuring that even during a global system spike, their SOC analysts can still run search queries with the same sub-second latency they expect.

In practice: In Elasticsearch, this is implemented via a combination of search thread pool sizing per node and query-level circuit breakers. A tenant’s queries can be routed to a dedicated set of nodes (using shard allocation filtering), and the circuit breaker limits can be configured per tenant at the coordinating node layer. The result is that a runaway analyst query generating an expensive aggregation across 90 days of data will hit its memory ceiling and fail gracefully, rather than cascading across the entire cluster.

Per-tenant buffering and decoupled processing

In a highly resilient SIEM, I favor that backpressure (where a downstream failure forces the front-end to stop accepting data) should be avoided. Instead of pressuring the ingestion layer to stop, the system utilizes the queues positioned between each layer as shock absorbers.

By implementing per-tenant virtual partitions within these queues, the system can ensure that a bottleneck in the storage or search layers only affects the processing speed of the responsible tenant. If one tenant’s data is being written too slowly, their specific virtual queue grows, while others continue to process at full speed. This results in delayed detection for the “noisy” tenant, but it guarantees data completeness. The system eventually catches up without ever dropping a log or impacting the real-time performance of the rest of the platform.

The ultimate isolation: Physical vs. logical

The strategies above address fairness within shared infrastructure. But for certain organizations, the right answer is no sharing at all.

In a modern cloud environment, it is entirely feasible to provision and allocate an entire, independent SIEM stack per tenant. This “cluster-per-tenant” model eliminates the noisy neighbor problem entirely because there are no neighbors. Each customer’s ingestion pipeline, normalization workers, search nodes and storage buckets are fully dedicated to their own workload.

The compliance implications alone make this worth serious consideration. Frameworks like FedRAMP, ITAR and CJIS often have explicit or implicit requirements around compute and data isolation that a shared multi-tenant cluster cannot satisfy without significant architectural gymnastics. A dedicated cluster satisfies these requirements cleanly, reduces audit surface area and simplifies the evidence chain during compliance reviews.

The trade-off is cost. Dedicated clusters carry substantially higher per-tenant overhead: idle compute must be provisioned to handle peak loads, management complexity scales with cluster count and the economies of scale that make shared SaaS attractive are partially surrendered. In practice, providers who offer this model typically charge a meaningful premium (often 2-3x the multi-tenant equivalent) and reserve it for enterprise or public sector customers with specific regulatory requirements.

The practical framework for security leaders evaluating this decision is straightforward. If your organization operates under a compliance framework that names compute or data isolation as a requirement, start with the dedicated cluster conversation. If your primary concern is detection performance and cost, invest time instead in understanding how deeply a vendor has engineered fairness into their shared environment, because that engineering is what determines whether the multi-tenant promise holds when it matters most.

Conclusion

The silence regarding multi-tenancy in major SIEM marketing is a risk that security leaders should not ignore. As telemetry volumes continue to explode, the engineering behind “fairness” becomes just as important as the AI detecting the threats.

An ideal SIEM solution should offer the best of both worlds: the flexibility of a multi-tenant cluster where fairness is deeply engineered into every layer, combined with the option to deploy dedicated, physically isolated clusters for organizations with extreme performance or compliance needs. Until SIEM providers are transparent about how they manage the noisy tenants next door, the promise of 24/7/365 protection remains vulnerable to the activity of a neighbor you didn’t even know you had.

This article is published as part of the Foundry Expert Contributor Network.
Want to join?

Read More