文章基本信息

标题：Building a distributed system requires a methodical approach to requirements.
本地全文：下载
作者：Mark Cavage
期刊名称：ACM Queue (Online): tomorrow's computing today
电子版ISSN：1542-7749
出版年度：2013
卷号：11
期号：4
语种：English
出版社：Association for Computing Machinery
摘要：Mark Cavage Distributed systems are difficult to understand, design, build, and operate. They introduce exponentially more variables into a design than a single machine does, making the root cause of an application problem much harder to discover. It should be said that if an application does not have meaningful SLAs (service-level agreements) and can tolerate extended downtime and/or performance degradation, then the barrier to entry is greatly reduced. Most modern applications, however, have an expectation of resiliency from their users, and SLAs are typically measured by "the number of nines" (e.g., 99.9 or 99.99 percent availability per month). Each additional 9 becomes harder and harder to achieve. To complicate matters further, it is extremely common that distributed failures will manifest as intermittent errors or decreased performance (commonly known as brownouts). These failure modes are much more time-consuming to diagnose than a complete failure. For example, Joyent operates several distributed systems as part of its cloud-computing infrastructure. In one such system—a highly available, distributed key/value store—Joyent recently experienced transient application timeouts. For most users the system operated normally and responded within the bounds of its latency SLA. However, 5-10 percent of requests exceeded a predefined application timeout. The failures were not reproducible in development or test environments, and they would often "go away" for minutes to hours at a time. Troubleshooting this problem to root cause required extensive system analysis of the data-storage API (node.js), an RDBMS (relational database management system) used internally by the system (PostgreSQL), the operating system, and the end-user application that relied on the key/value system. Ultimately, the root problem was in application semantics that caused excessive locking, but determining the root cause required considerable data gathering and correlation, and consumed many working hours of time among engineers with differing areas of expertise.