August 29th, 2024

No News is Good News: Observability at the Edge

Observability is the discipline of understanding the state of a system based on the data it produces about itself. One common approach to observability is to capture absolutely everything possible in case it happens to be useful later. Unlike cloud environments, where this pattern is often successful, edge deployments are characterized by their limitations and constraints: bandwidth, storage, and intermittent connectivity. These conditions require Edge Platform teams to rethink their approach to collecting telemetry and achieving observability in their environments.

How do we get “just enough” operational telemetry from the edge without exploding costs, network impacts, and potentially drowning in useless logs and metrics?

Observing the Edge

There are several principles and practices that Edge Platform teams should follow to optimize their edge observability.

Don’t accept monitoring defaults

For Edge deployments, we recommend a high degree of scrutiny on the default log and metric collection posture from off-the-shelf and open source tools. These defaults are often fine for cloud-based environments but do not consider the unique constraints of an edge environment.

Kubernetes is one good example of this: by default it wants to give you loads of metrics and other information about its health. A lot of this is nice to know, but most of these metrics will go un-looked-at and unused while potentially impacting the stability and/or cost of the edge environment.

Commercial monitoring tools that deploy agents often produce huge volumes of data (with the intent to ship them to their cloud) as well. Most Edge Monsters are collecting observability data with tools like Prometheus or Vector, but for those using commercial tools, invest the time to refine the configuration to collect only the metrics you truly need. Cardinality explosion from thousands of environments can quickly become a huge cost problem if not managed effectively.

Balance short-term and long-term data needs.

To balance the demands for short-term operational insights and long-term strategic analysis, we recommend a two-tier approach to telemetry data collection and retention:

Tier 1 utilizes a strategy that stores logs and metrics short-term at the edge, typically for seven days or less (may vary from company to company) based on Edge Monster experience. This allows for on-demand, need-based retrieval of robust events from “recent history” when troubleshooting issues without having to sync everything to the cloud at all times.
Tier 2 employs the model of exfiltrating and storing selective events to a cloud or data center environment for long-term retention, often spanning a year or more. Only select metrics and events should be sent for this purpose. This model enables platform teams to perform long-term strategic analysis across larger time horizons while still respecting the bandwidth constraints of the edge.

“Long-term trend analysis is essential to ensure you are not missing out on insights that could help you make improve edge environments’ autonomy in the long run”

So how do you decide what to store at the edge? Look to your constraints: if you are storage-bound, use a log-volume based approach. If storage is less of a constraint, it likely makes sense to use a time-based retention policy that is suitable for your needs.

Use a metrics-first mindset

Many developers turn to application logs (INFO, DEBUG, ERROR) as the default way to understand the health and performance of their application. Edge monsters suggest a pivot to a first-class metrics based approach. Metrics can come in many forms (system, application, business) and pack a bigger punch in terms of conveying actionable information about system state in a small payload. This suits the edge fantastically.

When selecting metrics, look for the “golden signals” that truly indicate system health and performance. In a REST API centric world, these are things like HTTP status responses (200, 404, 500, etc).

In some cases, business metrics (customer transactions, total sales from the devices, etc.) may be good indicators of overall system health that are difficult to synthesize from system and application metrics. We do not recommend using these types of metrics for business purposes, but they can be helpful indicators of overall system or ecosystem health.

Synthetic clients can help

When diagnosing system health, a synthetic client can help. A synthetic client is an edge-deployed application that can perform tasks like exercising critical shared services (database, messaging, AuthN/Z) and reporting back its experience in the form of useful metrics. Its report-back can also serve as a heartbeat that can be a useful data point in understanding if the edge deployment is online at all.

Use health status to… understand health

Edge applications should communicate their health status in a simple and standardized way. Investing time in developing a thoughtful and thorough health check is extremely important at the edge. A good health check – if trustable – can be the indicator that something has gone wrong… and a trigger to pull other metrics and logs that were retained locally so that further triage and troubleshooting can be performed.

Enterprises writing custom edge applications will benefit from this principle, but it may have the biggest benefit for commercial vendors writing applications for edge environments. In the vendor-supported-application scenario, enterprise platform teams may find conflict with vendor-application teams over how much operational telemetry can be collected vs. what is desired to support the application. We encourage vendors to develop sound health checks and to default to sending less out from edge environments until a failed health check dictates it is necessary.

We see an opportunity to standardize this health check “payload” to make downstream consumption and integration more seamless, so you can expect to see an open standard from Edge Monsters in the near future.

Autonomous Edge

“When a spaceship is launched into space, you can’t keep asking if it is okay; it has to be intelligent enough to let you know when there is an issue and give you the right information to help it out.”

A paradigm shift we encourage with edge operations and observability is moving towards event-driven models where “no news is good news.” In highly distributed and highly constrained environments, automation is your best friend. Don’t expect to ping edge environments regularly to see if everything is okay. Instead, generate meaningful metrics and events when things aren’t okay and when action is needed by a human. The ideal operational paradigm of the Edge is autonomous. Edge control planes should seek to involve humans only when absolutely necessary (someone unplugged something, and a device needs replacement).

If no human action is needed, allow the system to fix the issue and inform the humans about what took place in the form of more meaningful metrics.

Final Thoughts: Can the edge manage itself?

In many cases, the Edge can manage itself. We encourage defaulting towards an automated and autonomous operating model as much as possible since dealing with operational issues across fleets of tens-of-thousands of devices is extremely challenging.

By leveraging observability techniques like high-fidelity local storage, selective cloud reporting, and event-driven monitoring, edge platform teams can maintain visibility to their environments without getting alarmed by every operational fault that occurs. Nobody wants to restart a K8s pod when the pod can effectively restart itself. The best practice for the edge is to develop such a strong observability muscle that you can trust that, if something is wrong, the system will tell you. Until then, no news is good news. This is the way… the way to ensure our edge deployments remain resilient and available, that the edge can take care of itself, that we are informed about what has gone wrong and has been fixed, and that we are only asked to take action when human + tech is the answer.

In our next post, we will share about the best practices of deploying applications in edge environments. Be sure to subscribe for updates and follow us on LinkedIn.

The Edge Monsters: Brian Chambers, Erik Nordmark, Joe Pearson, Alex Pham, Jim Teal, Dillon TenBrink, Tilly Gilbert, Anna Boyle & Michael Maxey

Prepared by Anna Boyle, STL Partners

No News is Good News: Observability at the Edge

Observing the Edge

Don’t accept monitoring defaults

Balance short-term and long-term data needs.

Use a metrics-first mindset

Synthetic clients can help

Use health status to… understand health

Autonomous Edge

Final Thoughts: Can the edge manage itself?

OUR SPONSORS

SUBSCRIBE NOW

No News is Good News: Observability at the Edge

Observing the Edge

Don’t accept monitoring defaults

Balance short-term and long-term data needs.

Use a metrics-first mindset

Synthetic clients can help

Use health status to… understand health

Autonomous Edge

Final Thoughts: Can the edge manage itself?

OUR SPONSORS

The Edge is not the Cloud: Stop pretending they are the same thing

August 8, 2024

We created a monster!

August 7, 2022

Monster Operations

August 7, 2024

SUBSCRIBE NOW

Contact us