July 24th, 2025

Here Be Stateful Dragons: Distributed Storage at the Edge

Photo Credit: https://newhopeinternationalministries.wordpress.com/2016/09/19/here-be-dragons/

Stateful Dragons

Many edge deployments enjoy the luxury of stateless workloads and ephemeral data; data that might be useful, but is not critical to retain. The introduction of stateful applications and long-term persistence summons a whole new class of monsters to the edge.
While we often suggest adopting similar paradigms between edge and cloud, storage is one place where things can get a little different.
At the edge, there’s no friendly abstraction like S3 or EBS to lean on by default. These systems must be engineered and operated by the platform team, often with constrained hardware and limited connectivity. These storage services are rarely simple (to build and operate), even though it is easy to assume so when working in the cloud.
In this post, we’ll journey to the edge of the map and provide some tips for navigating the dangerous waters of persistent storage. Beware! In these stateful waters, “there be dragons”… but like the many great explorers before us, we go forward anyway.

Selecting the right storage pattern

The first–and most important–step is to understand the type of storage needs you have. There are three key patterns that are often present in edge deployments.

Ephemeral

The first pattern is the easiest to manage: ephemeral storage. In some edge deployments, data is mostly useful for near-term decision making and it may only be needed for a matter of minutes (or even seconds).
This includes data like…
  • Data in transient pub/sub topics that is deleted upon delivery to all subscribers.
  • Data written to disk but with a low durability SLA.
  • Data that is sent to the cloud (typically because it is nice to have but not critical) for retention and is not needed at the edge thereafter.
Solutions using ephemeral storage approaches have far fewer concerns. Replication of blocks, objects, or databases is typically not required (but is still sometimes applied) to meet operational targets.

Analytical

Analytical storage is storage intended for in-place analysis of data that was produced in the edge environment. Data production could be directly from edge applications, or from IoT or other connected devices in the environment (machines, robots, etc.). The desire for this architecture is driven by a number of factors: regulatory or compliance requirements, completely internet-disconnected states, or cost savings. Some examples include:
  • Telemetry data: Monitoring equipment health and performance in facilities where cloud connectivity is intermittent or restricted.
  • Historian pattern: Telemetry for trending, HMI, condition-based maintenance, SCADA, AI, and other use cases.
In this pattern, highly compressed columnar storage tends to be a favorite approach.

Long-term Cold Storage

Some classes of data–think security video footage–must be retained for extended periods but are not frequently accessed.
In the Edge Monsters experience, this data typically uses some sort of third-party vendor solution and, while it lives alongside the edge deployment, it is not usually integrated. However, we are seeing this pattern begin to change in some environments, particularly around telemetry, condition-based maintenance, warranty, or regulatory compliance-related data. This seems to be trending towards the emerging edge pattern of using object / blob storage combined with an OLAP query tool, like DuckDB.
Regardless of the storage class, the goal is to find the best match while considering all demands and constraints or an edge deployment.

Monster Tips:

  • Maximize ephemeral architectures when use cases allow, keeping complexity as low as possible. As always, this is a key to success at the edge.
  • Don’t ship or retain data that you are unlikely to use in the future.
  • For analytical use cases, we see a high degree of success when leveraging an object storage approach (something that implements the S3 interface) combined with a columnar format like Parquet and an OLAP database / tool like DuckDB.
  • Avoid block-level replication because it is wildly difficult to succeed at distributed scale. This is not to imply that configuring block storage is prohibitively difficult, but rather operating at full scale while accounting for distributed failures most definitely is.

Data and infrastructure characteristics

Next, we have to ask ourselves about the available storage infrastructure at the edge, our network, and the characteristics of our data.
  • How much data will you have coming into the edge?
  • How much will you need to retain? How long?
  • How fast can you write to the wire? (assuming you are WAN-connected)
  • How fast can you write to disk?
  • How many Drive Writes Per Day (DWPD) can your storage achieve within the warranty?
  • How will you manage back-pressure if there is a bottleneck?
  • Do you need a distributed storage system for reliability or durability?
  • What infrastructure currently exists at the edge, or what can you afford to deploy?
  • Are workloads consistent or bursty?
  • Do you have “catch-up” windows for processing data (hours/days that are non-operational)?
Answering these questions will yield a good understanding of the storage architecture that is needed and help determine if new capex investment is needed or if an existing edge footprint is capable of meeting the requirements.
What about a NAS appliance? Some environments may benefit from the use of a NAS for centralized storage. Be mindful of the impact of another device across the fleet in terms of complexity and cost. A single appliance will need to consider drive, power supply, and other redundancies or risk making the edge environment brittle.

Monster Tips:

  • Use NAS thoughtfully: A NAS appliance may work for some environments, but be mindful of the N+1 device problem (physical space, network ports, device support, power).
  • Understand the customers: Shared storage like NAS can complicate use cases by mixing local line-of-business data with edge computing workloads. Consider the impacts of one use case upon another before using shared storage.

Beginning with the end in mind

Distributed systems are inherently complex, and edge environments add layers of difficulty (remote access, limited resources, unattended operation, environmental factors). Operational mistakes are almost guaranteed, and traditional backup/restore strategies are often impossible. How do you build a system that is resilient to both hardware failure and human error in the field?
When selecting a solution, the implementation complexity is one factor, but more important is the ease of recovery when something goes wrong–and it will.
The most likely cause of early system failure is misconfiguration or mishandling of the complex distributed storage software. What do you do when failure happens and a restore/rebuild is necessary? Given the inevitable storage constraints at the edge, retaining tons of backups is a big challenge.

Monster Tips:

  • Design for failure and recovery: Human technical support is likely absent or very expensive. How will you handle recovery situations? Pick a technology that is recovery-friendly and easy to reason about when things hit the fan.
  • Simplicity trumps features: Prioritize the simplest architecture that meets your core reliability and performance needs. Avoid unnecessary complexity at all costs.
  • Practice failure recovery: Simulate disk failures, node failures, network partitions, and even full site loss. Can you recover? How long does it take? How much data is lost? This is the most critical operational exercise.

Open Source or Commercial?

While there are a lot of great commercial solutions in the storage space, open source is the best path forward when dealing with storage solutions at the edge. Proprietary solutions mean you’re beholden to a vendor for support and fixes. Open source means you need the deep expertise to debug storage kernel panics at 3 AM. Does your team have the skill to fix it?

Monster Tips:

  • Open-source wins: Beware of Black Boxes. Be extremely cautious of proprietary “software-defined storage” solutions where you can’t see under the hood or debug issues yourself. Open source is the best option when possible.
  • Keep an eye on cost: Commercial solutions can be great, but may come with a hefty price tag that is tough to swallow at hundreds, thousands, or tens of thousands of copies.

Wishlist and Recommendations

As an industry, we are still adapting data center and cloud storage solutions to the edge, often with significant compromises. There’s a clear need for storage technology designed specifically for the edge – lightweight, self-managing, resilient in harsh conditions, and cost-effective across varying scales (from single-digit TB to PB).
We would love to see a more elegant, open-source, edge-native distributed storage system available. Several of the existing solutions are good, but can be resource-intensive and complex to manage, especially across thousands of distributed environments.
Ceph is powerful but historically complex (though improving). OpenEBS shows promise but needs validation at scale. GlusterFS is deceptively simple to deploy but notoriously difficult to fix when things break.
This leaves the Edge Monsters wanting but knowing that we must proceed with deploying edge solutions now and deal with storage challenges in exchange for immediate business value. Given that, our current recommendations are…

Monster Tips:

  • S3 Interfaces: Object storage is great at the edge. The Edge Monsters love projects that implement the S3 API and leave the storage implementation to be determined, allowing flexibility across various edge computing requirements. If object storage works for your use case, this is a “no-lose” situation. Of course, you still must select an underlying, likely distributed, storage mechanism.
  • Ceph: Several edge monsters have successful deployments at scale with Ceph as a distributed file system. In cases where you have several NVMe drives available across nodes, Ceph is likely a good option.
  • MinIO: The Monsters love MinIO, but find that it can get prohibitively expensive to license at scale.
  • Rook: Another option is the Kubernetes-native Operator Rook, which brings Ceph along as the underlying distributed file system manager.
  • Longhorn: Some of the edge monsters have had success with implementing Longhorn for block storage replication in Kubernetes environments.
  • Keep Learning and Sharing: The edge storage space is rapidly evolving. Stay engaged with the community, explore new projects (like edge-specific database or storage solutions), and share your own experiences – both successes and failures.
  • Advocate for Edge-Native: As users and implementers, advocate for storage solutions that are built for the edge, not just ported to the edge.

Fear not the dragons 🐲

As we venture deeper into the uncharted waters of edge storage, the challenges are clear, but by thoughtfully matching storage patterns to workload needs, understanding the constraints and risks of edge environments, and leaning into resilient, open-source technologies, we can tame the dragons that lurk in these stateful seas. Remember, simplicity and reliability are your strongest allies. Stay curious, stay engaged, and keep pushing the boundaries of what’s possible at the edge.

Be sure to subscribe for updates and follow us on LinkedIn.

The Edge Monsters: Jim Beyers, Colin Breck, Brian Chambers, Michael Henry, Chris Milliet, Erik Nordmark, Joe Pearson, Jim Teal, Dillon TenBrink, Tilly Gilbert, Anna Boyle & Michael Maxey

OUR SPONSORS

CONTINUE READING

SUBSCRIBE NOW

Contact us