| Observability
| 可观测性
Observability provides deep visibility into modern distributed applications for faster, automated problem identification and resolution.
可观测性提供对现代分布式应用系统的深入可见性,以便更快且自动化地识别和解决问题。
01
What is observability?
什么是可观测性?
In general, observability is the extent to which you can understand the internal state or condition of a complex system based only on knowledge of its external outputs. The more observable a system, the more quickly and accurately you can navigate from an identified performance problem to its root cause, without additional testing or coding.
一般来说,可观测性是指仅通过复杂系统的外部输出就可了解其内部状态的能力。系统越可观测,您就可以越快速准确地定位性能问题的其根本原因,且无需额外的测试或编码。
In cloud computing, observability also refers to software tools and practices for aggregating, correlating and analyzing a steady stream of performance data from a distributed application and the hardware it runs on, in order to more effectively monitor, troubleshoot and debug the application to meet customer experience expectations, service level agreements (SLAs) and other business requirements.
在云计算中,可观测性还指对分布式应用系统及支撑其运行的基础设施的数据进行聚合、关联和分析的软件工具和实践,以便对应用系统进行更有效地监控、故障排除和调试,从而实现客户体验优化、服务水平协议 (SLA) 和其他业务目标。
A relatively new IT topic, observability is often mischaracterized as an overhyped buzzword, or a 'rebranding' of system monitoring in general and application performance monitoring (APM) in particular. In fact, observability is a natural evolution of APM data collection methods that better addresses the increasingly rapid, distributed and dynamic nature of cloud-native application deployments. Observability doesn’t replace monitoring – it enables better monitoring, and better APM.
(The term 'observability' comes from control theory, an area of engineering concerned with automating control a dynamic system - e.g., the flow of water through a pipe, or the speed of an automobile over inclines and declines - based on feedback from the system.)
作为一个相对较新的 IT 技术主题,可观测性经常被错误地描述为一个过度炒作的流行语,或者是一般系统监控,特别是应用程序性能监控 (APM) 的“品牌重塑”。事实上,可观测性是 APM 数据收集方法的自然演变,它更好地应对了云原生应用程序的高发布频率、分布式和动态的特性。可观测性不会取代监控——它可以实现更好的监控和更好的 APM。
(“可观测性”一词来自控制理论,这是一个与自动控制动态系统有关的工程领域 - 例如,基于系统的反馈控制管道的水流,或控制汽车在上坡和下坡时的速度)
02
Why do we need observability?
为什么我们需要可观测性?
For the past 20 years or so, IT teams have relied primarily on APM to monitor and troubleshoot applications. APM periodically samples and aggregates application and system data, called telemetry, that's known to be related to application performance issues. It analyzes the telemetry relative to key performance indicators (KPIs) and assembles the results in a dashboard for alerting operations and support teams to abnormal conditions that should be addressed to resolve or prevent issues.
在过去 20 年左右的时间里,IT 团队主要依靠 APM 来监控和排除应用程序故障。APM 定期对应用程序性能相关的数据(包含应用层和系统层)进行采样和聚合,称为遥测。APM分析和通过仪表盘展示与关键绩效指标 (KPI) 相关的数据,运营和支持团队观察该仪表盘可以及时发现或提前预测问题。
APM is effective enough for monitoring and troubleshooting monolithic applications or traditional distributed applications, where new code is released periodically and workflows and dependencies between application components, servers and related resources are well-known or easy to trace.
APM 足以有效地监控单体应用程序或传统分布式应用程序并对其进行故障排除,在这些应用程序中,新代码会定期发布,并且应用程序组件、服务器和相关资源之间的工作流和依赖关系是众所周知的或易于跟踪的。
But today organizations are rapidly adopting modern development practices – agile development, continuous integration and continuous deployment (CI/CD), DevOps, multiple programming languages – and cloud-native technologies such as microservices, Docker containers, Kubernetes and serverless functions. As a result, they're bringing more services to market faster than ever. But in the process they're deploying new application components so often, in so many places, in so many different languages and for such widely varying periods of time (for seconds or fractions of a second, in the case of serverless functions) that APM's once-a-minute data sampling can't keep pace.
但今天,组织正在迅速采用现代开发实践——敏捷开发、持续集成和持续部署(CI/CD)、DevOps、多种编程语言——以及云原生技术,如微服务、Docker容器、Kubernetes和无服务器技术。因此,他们以前所未有的速度将更多服务推向市场。但在这个过程中,他们如此频繁地、在如此多的地方、以如此多不同的语言以及如此广泛的不同时间段(对于无服务器功能的情况下为几秒或几分之一秒)部署新的应用程序组件,以至于 APM 的每分钟一次的数据采样跟不上。
What's needed is higher-quality telemetry – and a lot more of it – that can be used to create a high-fidelity, context-rich, fully correlated record of every application user request or transaction. Enter observability.
需要的是更高质量的遥测——以及更多的遥测——可用于高保真还原应用程序的每个用户请求或事务的丰富的上下文,完全相关的记录。
03
How does observability work?
可观测性如何工作?
Observability platforms discover and collect performance telemetry continuously by integrating with existing instrumentation built into application and infrastructure components, and by providing tools to add instrumentation to these components. Observability focuses on four main telemetry types:
可观测性平台通过与内置于应用程序和基础设施组件中的现有工具集成,或者为组件添加新的遥测工具,从而不断采集性能遥测数据。可观测性侧重于四种主要的遥测类型:
- Logs. Logs are granular, timestamped, complete and immutable records of application events. Among other things, logs can be used to create a high-fidelity, millisecond-by-millisecond record of every event, complete with surrounding context, that developers can 'play back' for troubleshooting and debugging purposes.
- 日志。日志是应用程序事件的原始粒度、有时间戳的、完整且不变的记录。此外,日志可用于高保真还原每个事件的每毫秒记录,以及上下文,开发人员可以对现场“回放”以进行故障排除和调试。
- Metrics. Metrics(sometimes called time series metrics) are fundamental measures of application and system health over a given period of time, such as how much memory or CPU capacity an application uses over a five-minute span, or how much latency an application experiences during a spike in usage.
- 指标。 指标(有时称为时序指标)是在给定时间段内应用程序和系统运行状况的基本衡量标准,例如应用程序在五分钟内使用了多少内存或 CPU 容量,或者在业务高峰时应用程序在一段时间内经历了多少延迟。
- Traces. Traces record the end-to-end 'journey' of every user request, from the UI or mobile app through the entire distributed architecture and back to the user.
- 跟踪。 跟踪记录每个用户请求的端到端“旅程”,从 UI 或移动应用程序到整个分布式架构再返回给用户。
- Dependencies(also called dependency maps) reveal how each application component is dependent on other components, applications and IT resources.
- 依赖关系(也称为依赖关系图)揭示了每个应用程序组件如何依赖于其他组件、应用程序和 IT 资源。
After gathering this telemetry, the platform correlates it in real-time to provide DevOps teams, site reliability engineering (SREs) teams and IT staff complete, contextual information – the what, where and why of any event that could indicate, cause, or be used to address an application performance issue.
可观测平台实时关联收集的遥测数据,并向 DevOps 团队、站点可靠性工程(SRE) 团队和 IT 人员还原完整上下文信息——对可能会发生的任何事件表明内容、地点和原因,从而能够快速解决应用性能问题。
Many observability platforms automatically discover new sources of telemetry as that might emerge within the system (such as a new API call to another software application). And because they deal with so much more data than a standard APM solution, many platforms include AIOps (artificial intelligence for operations) capabilities that sift the signals - indications of real problems - from noise (data unrelated to issues).
许多可观测性平台会自动发现系统中可能出现的新遥测数据源(例如对另一个软件应用程序的新 API 调用)。由于它们处理的数据比标准 APM 解决方案多得多,因此许多平台都包含AIOps(用于运营的人工智能)功能,可以从噪声(与问题无关的数据)中筛选出信号(实际问题的指示)。
04
Benefits of observability
可观测性的好处
The overarching benefit of observability is that with all other things being equal, a more observable system is easier to understand (in general and in great detail), easier to monitor, easier and safer to update with new code, and easier to repair than a less observable system. More specifically, observability directly supports the Agile/DevOps/SRE goals of delivering higher quality software faster by enabling an organization to:
可观测性的最大好处是,在所有其他条件相同的情况下,一个更可观测的系统更容易被理解(整体和部分),更容易被监控,更容易和更安全地发布代码,并且更容易修复故障。更具体地说,可观测性能够直接支持敏捷/DevOps/SRE 团队的该目标:更快地交付更高质量的软件:
- Discover and address 'unknown unknowns' - issues you don't know exist. A chief limitation of monitoring tools is that they only watch for 'known unknowns' - exceptional conditions you already know to watch for. Observability discovers conditions you might never know or think to look for, then tracks their relationship to specific performance issues and provides the context for identifying root causes to speed resolution.
- 发现并解决“未知的不确定问题”——超出现有认知的问题。监控工具的一个主要限制是它们只关注“已知的某个具体问题”——您已经知道需要注意的异常情况。可观测性能够发现以前不知道但可能会导致问题的条件,然后跟踪它们与特定性能问题的关系,并提供上下文来确定根本原因和快速解决问题。
- Catch and resolve issues early in development. Observability bakes monitoring into the early phases of software development process. DevOps teams can identify and fix issues in new code before they impact the customer experience or SLAs.
- 在开发早期发现并解决问题。可观测性将监控纳入软件开发过程的早期阶段。DevOps 团队可以在影响客户体验或 SLA 之前识别并修复新代码中的问题。
- Scale observability automatically. For example, you can specify instrumentation and data aggregation as part of a Kubernetes cluster configuration and start gathering telemetry from the moment it spins up, until it spins down.
- 自扩展性。 例如,您可以将检测和数据聚合指定为 Kubernetes 集群配置的一部分,并从它启动的那一刻开始收集遥测数据,直到它停止运行。
- Enable automated remediation and self-healing application infrastructure. Combine observability with AIOps machine learning and automation capabilities to predict issues based on system outputs and resolved them without management intervention.
- 支持故障自愈的基础架构。将可观测性与 AIOps 机器学习,自动化功能相结合,根据系统的输出来预测问题的发生并全自动的解决问题。
有疑问加站长微信联系(非本文作者)