【DKV】數(shù)據(jù)中心行業(yè)必讀：下一代彈性（下）

yi321yi 2019-03-06

展開全文

Next-Generation Resiliency

下一代彈性

FOCUS I SEPTEMBER 2017

By Andy Lawrence, Executive Director, Uptime Institute &451 Research and Todd Traver, Vice President IT Optimization and Strategy, Uptime Institute

Andy Lawrence，常務(wù)董事，Uptime Institute & 451 Research

Todd Traver，IT優(yōu)化和戰(zhàn)略副總裁， Uptime Institute

接上部分：數(shù)據(jù)中心行業(yè)必讀：下一代彈性（上）

Next-Generation Resiliency

None of these challenges are remotely new, and many systems for distributing data and locking and unlocking databases were developed in the 1980s. (Early papers by engineers at IBM and Tandem, among others, are still available. Influential relational database pioneer Ted Codd published rules for distributed database management systems in the 1980s.) However, cloud providers that have huge amounts of data in multiple locations, and that offer in- and out-of-region replication and backup, now have to deal with the these issues on an altogether new scale.

這些挑戰(zhàn)都不是新問題，很多分布式數(shù)據(jù)系統(tǒng)和鎖定解鎖數(shù)據(jù)庫是20世紀80年代開發(fā)的。（IBM和天騰工程師的早期論文仍然是可用的。有影響力的關(guān)系數(shù)據(jù)庫先鋒Ted Codd在20世紀80年代發(fā)表了分布式數(shù)據(jù)管理系統(tǒng)的規(guī)則。）然而，擁有在多地海量數(shù)據(jù)的云供應商提供區(qū)域內(nèi)外復制和備份，現(xiàn)在必須在一個全新的規(guī)模上處理這些問題。

Professor Eric Brewer of Stanford University (now VP of Infrastructure at Google) identified a key issue. His theorem (see Figure 2) states that it is not possible to design a distributed system that guarantees both availability and complete integrity in the face of the loss of a network partition or node.

斯坦福大學的Eric Brewer教授（現(xiàn)谷歌基礎(chǔ)設(shè)施副總裁）證實了一個關(guān)鍵問題。它的定理（如圖2）表明：當面對網(wǎng)絡(luò)分區(qū)或者節(jié)點失效時，不可能設(shè)計出一個可以同時保證可用性和完全的完整性的分布式系統(tǒng)。

CAP theorem, also called Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously guarantee all three of the following attributes:

CAP定理，也稱作Brewer定理，表明一個分布式計算機系統(tǒng)不可能同時保證以下全部三個屬性：

· Consistency: Every read receives the most recent write or an error.

· 一致性：每一個讀操作接受最近的寫操作或者一個錯誤；

· Availability: Every request receives a response, though without a guarantee that it contains the most recent version of the information.

· 可用性：每一個請求接受一個響應，雖然沒有保證它包含信息的最近版本；

· Partition tolerance: The system continues to operate despite arbitrary partitioning due to network failures.

· 分區(qū)容錯性：盡管由于網(wǎng)絡(luò)失效產(chǎn)生任意分區(qū)，系統(tǒng)仍繼續(xù)操作

圖2：CAP定理

Next-Generation Resiliency

This theorem is important when it comes to resiliency planning using more than one active site. Organizations typically place a very high value on database accuracy, but availability is also critical for many applications, especially transactional, customer-facing ones. This rule shows that, by moving to a distributed environment, a company may have to prioritize one guarantee over the other. Brewer’s theorem also points to the critical importance of the network, which, if it is highly available at all times, can reduce if not eliminate the need for that choice. This explains why hyperscale operators such as Google have invested so heavily in intra-data center fiber and other networking equipment to ensure high availability and capacity.

當使用一個以上活動站點做彈性規(guī)劃時，這個定理非常重要。組織典型地很重視數(shù)據(jù)準確性，但是可用性對于許多應用也同樣關(guān)鍵，尤其是那些事務(wù)型、面向客戶的應用。這個規(guī)則顯示了，通過遷移一個分布式環(huán)境，一個公司不得不將一個優(yōu)先級置于另一個之上。Brewer定理也指出網(wǎng)絡(luò)的重要性，如果網(wǎng)絡(luò)總是高可用的，即使不能消除也會減少這種選擇的需要。這也解釋了為什么像谷歌這樣的超大型運營商在數(shù)據(jù)中心內(nèi)的光纖和其他網(wǎng)絡(luò)設(shè)備上投入如此巨大以保障高可用和容量。

Base, Acid and Databases

Until recently, the organizations that most used distributed resiliency were those for which even a small outage could be catastrophic. This group - investment banks, for example - writes all data to two data centers simultaneously (synchronous replication). While one set of data may act as the master, the second is a real-time copy, and if there is a failure, traffic is switched to the second site. There is no danger of an integrity issue, because the software only allows writes to one live master. Suppliers of databases and storage systems and software, such as IBM, HP, Hitachi, Oracle, EMC and others, have long engineered systems for this high-spending category.

BASE，ACID和數(shù)據(jù)庫

直到最近，那些最多使用分布式彈性的是那些即使遇到一個小故障也是毀滅性的組織。這些集團比如銀行，將所有數(shù)據(jù)同時寫入兩個數(shù)據(jù)中心（同步復制）。當一組數(shù)據(jù)作為主，第二組作為實時拷貝，這樣即使有故障，流量會切換到第二個站點。沒有完整性問題的風險，因為軟件只允許寫在一個活的主系統(tǒng)。數(shù)據(jù)庫、存儲系統(tǒng)和軟件供應商，比如IBM、HP、Hitachi、Oracle、EMC和其他，對這種高支出類別都有長期的工程化系統(tǒng)。

Systems that allow no compromise on integrity are sometimes called ACID systems, to denote Atomicity (each transaction is all or nothing), Consistency (transactions complete according to all valid rules), Isolation (each part of the transaction is isolated from others, as if performed sequentially) and Durability (the transaction is permanent). ACID favors consistency over all else. When ACID databases work together, or if a single database is spread across multiple locations, protocols and processes ensure agreement between multiple endpoints before a transaction can go ahead. Recent advances in so-called NewSQL databases, including Google’s Spanner, replicate this on a distributed, wide scale, with some limited trade-offs.

不允許對完整性做妥協(xié)的系統(tǒng)有時被稱作ACID系統(tǒng)，代表了原子性（每一個事務(wù)要么是全部要么什么不存在），一致性（事務(wù)完全符合所有有效原則），隔離性（如果被順序執(zhí)行，事務(wù)的每一部分都與其它隔離），持久性（事務(wù)是永久的）。ACID偏愛一致性超過所有。當ACID數(shù)據(jù)庫們一起工作時，如果一個數(shù)據(jù)庫散布在多個地點，協(xié)議和過程保證多個端點在一個事務(wù)進行之前的一致性。最新的進展是所謂的NewSQL數(shù)據(jù)庫，包括谷歌的Spanner，在分布式的，廣泛的范圍內(nèi)復制這個，當然有一些受限的折中。

In recent years, with the aid of lower-cost, homogenous and virtualized architectures, it has become much easier (and cheaper) to replicate IT environments in several active data centers in different locations. This has led to the development of architectures that temporarily (usually momentarily) sacrifice integrity for availability if there is a contention issue. Processes are put in place to resolve any conflicts, in some cases reversing one of two transactions that may have happened independently of each other.

最近幾年，在更低成本、同構(gòu)、虛擬化的架構(gòu)幫助下，在多活異地數(shù)據(jù)中心復制IT環(huán)境變得更加容易（更加便宜）。這已導致架構(gòu)發(fā)展為當有競爭問題時臨時（短暫的）犧牲完整性以保障可用性。一些處理被采取以解決沖突，這些處理可以在某些情況下回退相互獨立發(fā)生的兩個事務(wù)中的一個。

These database design architectures are known as BASE, to denote the characteristics of Basically Available, Soft State and Eventual Consistency. These architectures, supported by modern open source NoSQL databases such as MongoDB and Apache’s CouchDB, incorporate mechanisms for allowing and then resolving conflicting transactions.

這些數(shù)據(jù)庫設(shè)計架構(gòu)被稱為BASE，以代表基本可用的特性，軟狀態(tài)和最終一致的特性。這些被現(xiàn)代開源NoSQL數(shù)據(jù)庫（比如MongoDB和Apache的CouchDB）支持的架構(gòu)包含允許和解決事務(wù)的沖突的機制。

Next-Generation Resiliency

The use of BASE architectures is now very common, especially in cloud environments, and effectively tolerates failures. But there are classes of application for which it is unsuitable - for example, trading systems or control situations where eventual resolution or reversible transactions are not acceptable. Even so, given that the conflicts may often be rare and easily resolved, this architecture is now being widely adopted, reducing costs and enabling more use of distributed architectures to improve resiliency.

BASE架構(gòu)的使用非常普遍且能夠有效地容錯，尤其是在云環(huán)境中。但是有些類別的軟件不適合，比如，在交易系統(tǒng)或者控制情況中，最終解決和可逆的事務(wù)是不可接受的。即使這樣，考慮到?jīng)_突通常很少見并且容易被解決，BASE架構(gòu)正在被廣泛采用，同時減少成本和使能分布式架構(gòu)更多的被使用以提升彈性。

BASE architectures rely very heavily on fast, reliable networks. The longer the latency, the more likely it is that conflicts between reads and writes from different users will occur. While these will mostly be resolved easily, too many conflicts could cause problems with clients or control systems in real-time networks. Some Internet of Things (IoT) applications will not sit comfortably on cloud platforms that use BASE architectures.

BASE架構(gòu)嚴重依賴快速穩(wěn)定的網(wǎng)絡(luò)。延遲越長，越可能在不同用戶的讀寫之間發(fā)生沖突。雖然這些大部分都將會被輕松解決，但在實時網(wǎng)絡(luò)中太多的沖突可能導致客戶端或者控制系統(tǒng)出現(xiàn)問題。一些物聯(lián)網(wǎng)應用將不會舒服的坐落在使用了BASE架構(gòu)云平臺上。

Types of Distributed Architecture

分布式架構(gòu)的種類

As we have seen, differing business requirements, including legacy investments, will influence the degree to which newer, distributed systems and databases can be used; similarly, the business requirements and the design of the existing systems will, to some extent, point toward certain resiliency architectures. We see the models in Figure 4 being used for resiliency, with the cloud- based models being markedly different from the earlier ones.

如我們所見，不同的業(yè)務(wù)需求，包括歷史投資，將會影響新的分布式系統(tǒng)和數(shù)據(jù)庫被使用的程度；同樣的，業(yè)務(wù)需求和現(xiàn)存系統(tǒng)的設(shè)計在一定程度上指向了某一確定的彈性架構(gòu)。我們看圖3中用于彈性的模型，基于云的模型顯著的與早期模型不同。

Figure 3: Types of Distributed Architecture

圖3：分布式架構(gòu)的種類

This is the traditional setup, with high levels of redundancy at the infrastructure level, including facilities and basic IT. With sufficient redundancy and planned design, operations can continue in spite of planned (concurrent maintainability), and in some cases unplanned, facilities failure. At the IT level, resilience is further assured by internal replication (e.g., clusters), so that loads may be replicated elsewhere and data/applications/configurations backed up to an offsite DR.

單站點可用性

這是一個傳統(tǒng)配置，包含物理設(shè)施和基礎(chǔ)IT的基礎(chǔ)設(shè)施層具備高級別的冗余。通過充分的冗余和規(guī)劃的設(shè)計，在計劃內(nèi)的（并發(fā)維護性）以及某些情況下計劃外的物理設(shè)施故障時，運營仍然能夠繼續(xù)。在IT層，彈性通過內(nèi)部復制（比如集群）得到進一步的保障，負載可能被復制到別處，數(shù)據(jù)/應用/配置備份到一個離線容災節(jié)點。

Linked Site Resiliency

This describes two or more lower-tier data centers within a campus, region or zone using a dedicated network to achieve a higher level of availability than is possible at any individual site, typically within synchronous replication distance. (This means that the two data centers are near enough to each other and to customers that they are always synchronized. This distance will depend on the applications, but is usually less than 50 miles.) In order to achieve the same or higher level of facility availability as a high-availability single-site data center, linked sites may double up and share some less-critical infrastructure with nearby in-zone data centers. This assumes resilient and sufficient network capacity with predictable and independent pathways.

鏈接站點彈性

這描述了在同一園區(qū)、地區(qū)或者區(qū)域內(nèi)的兩個及以上低級別數(shù)據(jù)中心，它們通過使用專用網(wǎng)絡(luò)來達到比任一單站可能達到的更高級別的可用性。（這意味著兩個數(shù)據(jù)中心相互之間以及到客戶之間足夠近，它們一直是同步的。這個距離會取決于具體應用，但通常小于50英里。）為了達到與高可用單站數(shù)據(jù)中心相同甚至更高的物理設(shè)施可用性，鏈接站點可能共享在一些附近同一區(qū)域內(nèi)數(shù)據(jù)中心的非關(guān)鍵基礎(chǔ)設(shè)施。這假設(shè)在可預測的和獨立的路徑上，有彈性的和充足的網(wǎng)絡(luò)容量。

In this configuration, concurrent maintainability (downtime at one site does not disrupt service) is possible as long as there is sufficient capacity, and processes are in place, to support full operations at either site. At the IT level, this setup can be used to support either synchronous (fault-tolerant automated failover to the second site) or asynchronous (a second copy of applications, data and files is kept at the second site to pick up the load) replication.

在這種配置下，只要有足夠的容量并且處理是適當?shù)模l(fā)可維護能力（一個站點斷服不會導致服務(wù)中斷）是可能支持在其中一個站點的完整操作。在IT層，這種配置能夠被用于支持要么同步（容錯自動故障切換到第二個站點）或者要么異步（為承載負載，應用、數(shù)據(jù)和文件的第二拷貝被保留在第二個站點）的復制。

Distributed Site Resiliency

This term describes two or more independent sites, in or out of region or globally distributed (cloud or not), using shared internet/VPN networks to provide resiliency through multiple asynchronously connected instances. This can produce very high availability but can result in some (usually minor) loss of integrity between instances if outages occur.

分布式站點彈性

這個術(shù)語描述了在區(qū)域內(nèi)外或是全局分布的（云或非云）兩個及以上的獨立站點，它們通過多個異步連接的實例以及使用共享互聯(lián)網(wǎng)/VPN網(wǎng)絡(luò)來提供彈性。這種方式能夠產(chǎn)生非常高的可用性，但是如果中斷發(fā)生，也會導致一些（通常很?。嵗g的完整性損失。

At the IT level, distributed site resiliency is the architecture that underpins most DR services, and especially the modern cloud iteration, DR as a service (DRaaS). Improved network capacity, software tools, database synchronization protocols and, critically, homogenous IT infrastructure running virtualized workloads have now made this option far more practical, flexible and economically feasible both for active/active operations and for backup and recovery. As more distributed management technologies are added, distributed site resiliency can support or blur into cloud-based resiliency.

在IT層，分布式站點彈性是一種支持大多數(shù)容災服務(wù)的架構(gòu)，尤其是現(xiàn)在云迭代，容災即服務(wù)（DRaaS）。改進后的網(wǎng)絡(luò)容量，軟件工具，數(shù)據(jù)庫同步協(xié)議和非常關(guān)鍵的運行虛擬化負載的同構(gòu)IT基礎(chǔ)設(shè)施現(xiàn)在已經(jīng)使這種彈性方式對于雙活操作和備份恢復來說更加實用，靈活以及經(jīng)濟可行。隨著越多的分布式管理技術(shù)加入，分布式站點彈性能夠支持或者模糊的看做基于云的彈性。

Next-Generation Resiliency

Cloud-Based Resiliency

This term describes resiliency provided by distributing virtualized applications, instances and/or containers with associated data across multiple data centers, using middleware, orchestration and distributed databases, under the control of a comprehensive and distributed control system. These systems will enable service or design choices to be made between, for example, absolute database integrity or immediate availability. Effectively, cloud-based resiliency moves the resiliency up to the IT level. Any facility resilience achieved through redundancy provides added security, but may not prove essential. It does, however, assume that there is sufficient capacity in place, including the network, which is critical if loads are shifted from place to place. Developers do not need to concern themselves with location or infrastructure - this architecture is primarily suited for stateless or cloud-native applications.

基于云的彈性

這個術(shù)語描述了通過使用中間件、編排和分布式數(shù)據(jù)庫，在一個綜合的、分布式的控制系統(tǒng)控制下，將虛擬化應用、實例和/或攜帶相關(guān)數(shù)據(jù)的容器分布到多個數(shù)據(jù)中心來提供彈性。這些控制系統(tǒng)會做出服務(wù)或者設(shè)計選擇，比如絕對數(shù)據(jù)庫完整或者立即可用。實際上，基于云的彈性將彈性上升到IT層。任何通過冗余實現(xiàn)的物理設(shè)施彈性提供了額外的安全，但是可能證明不是必須的。不管怎樣，它的確假設(shè)在相應的地方有足夠的容量，包括網(wǎng)絡(luò)，如果負載從一個地方遷移到另一個地方，它非常關(guān)鍵。開發(fā)者不需要關(guān)注他們自己的位置或者基礎(chǔ)設(shè)施，這個架構(gòu)主要是和無狀態(tài)的或者云原生的應用。

Clearly, each type of resiliency architecture described above fulfills different purposes and has a different profile in terms of objectives, cost, level of availability and technical maturity. Cloud- based resiliency is the newest, and currently the most expensive; it may provide good total cost of ownership, but effectively can only be achieved at scale and with considerable capital. Each type is not mutually exclusive, at least at the facilities level.

顯而易見，以上描述的每一種彈性架構(gòu)實現(xiàn)了不同的目的，根據(jù)目標、成本、可用性級別和技術(shù)成熟度有不同的畫像?；谠频膹椥允亲钚碌?，也是當前最昂貴的；它可能提供很好的總體擁有成本，但實際上只有在大規(guī)模情況下，具備大量資金時才會實現(xiàn)。

For CIOs setting out to develop appropriate resiliency strategies, this is a challenging period because engineering control is being eroded, to be replaced with a more nuanced and strategic approach where good assessments are needed.

對于CIO著手開發(fā)合適的彈性策略來說，這是一個具有挑戰(zhàn)的時期，因為工程控制正在被侵蝕，它被更加微妙的、戰(zhàn)略性的方法替代，這個方法需要好的評估。

With cloud services and architectures now part of the mix, or even the totality, the CIO must determine which type (or types) of resiliency is most appropriate for each type of application and data, based on business needs and technical risk, and then architect the best combination of IT infrastructure. This will span data center resiliency, applications, databases and networking, and must take into account organizational structure, processes, tools and automation. From all this, the organization must then deliver comprehensive and consistent applications that meet and exceed customer expectations for service availability and resiliency.

通過云服務(wù)和架構(gòu)的部分混合甚至完全混合，CIO必須決定對于每一種應用、數(shù)據(jù)，基于業(yè)務(wù)需求和技術(shù)風險哪種彈性最適合，然后構(gòu)建IT基礎(chǔ)設(shè)施最佳組合。這會橫跨數(shù)據(jù)中心彈性、應用、數(shù)據(jù)庫和網(wǎng)絡(luò)，同時必須考慮組織結(jié)構(gòu)、流程、工具和自動化。從這一切，組織必須交付理解深刻的和一致的應用，它們能夠從業(yè)務(wù)可用性和彈性上符合并超越客戶期望。

(全文完)

翻譯：