Appendix H - Large-Scale Multiprocessors and Scientific Applications.pdf - Computer Architecture A Quantitative Approach (Dodatki) - Kowalski2015

H.1

Introduction

H-2

H.2

Interprocessor Communication: The Critical Performance Issue

H-3

H.3

Characteristics of Scientiﬁc Applications

H-6

H.4

Synchronization: Scaling Up

H-12

H.5

Performance of Scientiﬁc Applications on Shared-Memory

Multiprocessors

H-21

H.6

Performance Measurement of Parallel Processors with Scientiﬁc

Applications

H-33

H.7

Implementing Cache Coherence

H-34

H.8

The Custom Cluster Approach: Blue Gene/L

H-41

H.9

Concluding Remarks

H-44

Large-Scale Multiprocessors

and Scientiﬁc Applications

“Hennessy and Patterson should move MPPs to Chapter 11.”

Jim Gray, Microsoft Research,

when asked about the coverage of massively parallel pro-

cessors (MPPs) for the third edition in 2000

Unfortunately for companies in the MPP business,

the third edition had only ten chapters and the

MPP business did not grow as anticipated when

the ﬁrst and second edition were written.

H-2

Appendix H

Large-Scale Multiprocessors and Scientiﬁc Applications

H.1

Introduction

The primary application of large-scale multiprocessors is for true parallel pro-

gramming, as opposed to multiprogramming or transaction-oriented computing

where independent tasks are executed in parallel without much interaction. In

true parallel computing, a set of tasks execute in a collaborative fashion on one

application. The primary target of parallel computing is scientiﬁc and technical

applications. In contrast, for loosely coupled commercial applications, such as

Web servers and most transaction-processing applications, there is little commu-

nication among tasks. For such applications, loosely coupled clusters are gener-

ally adequate and most cost-effective, since intertask communication is rare.

Because true parallel computing involves cooperating tasks, the nature of

communication between those tasks and how such communication is supported

in the hardware is of vital importance in determining the performance of the

application. The next section of this appendix examines such issues and the char-

acteristics of different communication models.

In comparison to sequential programs, whose performance is largely dictated

by the cache behavior and issues related to instruction-level parallelism, parallel

programs have several additional characteristics that are important to perfor-

mance, including the amount of parallelism, the size of parallel tasks, the fre-

quency and nature of intertask communication, and the frequency and nature of

synchronization. These aspects are affected both by the underlying nature of the

application as well as by the programming style. Section H.3 reviews the impor-

tant characteristics of several scientiﬁc applications to give a ﬂavor of these

issues.

As we saw in Chapter 4, synchronization can be quite important in achieving

good performance. The larger number of parallel tasks that may need to synchro-

nize makes contention involving synchronization a much more serious problem

in large-scale multiprocessors. Section H.4 examines methods of scaling up the

synchronization mechanisms of Chapter 4.

Section H.5 explores the detailed performance of shared-memory parallel

applications executing on a moderate-scale shared-memory multiprocessor. As

we will see, the behavior and performance characteristics are quite a bit more

complicated than those in small-scale shared-memory multiprocessors. Section

H.6 discusses the general issue of how to examine parallel performance for dif-

ferent sized multiprocessors. Section H.7 explores the implementation challenges

of distributed shared-memory cache coherence, the key architectural approach

used in moderate-scale multiprocessors. Sections H.7 and H.8 rely on a basic

understanding of interconnection networks, and the reader should at least quickly

review Appendix E before reading these sections.

Section H.8 explores the design of one of the newest and most exciting large-

scale multiprocessors in recent times, Blue Gene. Blue Gene is a cluster-based

multiprocessor, but it uses a custom, highly dense node designed speciﬁcally for

this function, as opposed to the nodes of most earlier cluster multiprocessors that

used a node architecture similar to those in a desktop or smaller-scale multipro-

H.2 Interprocessor Communication: The Critical Performance Issue

H-3

cessor node. By using a custom node design, Blue Gene achieves a signiﬁcant

reduction in the cost, physical size, and power consumption of a node. Blue

Gene/L, a 64K-node version, is the world’s fastest computer in 2006, as mea-

sured by the linear algebra benchmark, Linpack.

H.2

Interprocessor Communication: The Critical

Performance Issue

In multiprocessors with larger processor counts, interprocessor communication

becomes more expensive, since the distance between processors increases. Fur-

thermore, in truly parallel applications where the threads of the application must

communicate, there is usually more communication than in a loosely coupled set

of distinct processes or independent transactions, which characterize many com-

mercial server applications. These factors combine to make efﬁcient interproces-

sor communication one of the most important determinants of parallel

performance, especially for the scientiﬁc market.

Unfortunately, characterizing the communication needs of an application and

the capabilities of an architecture are both complex. This section examines the

key hardware characteristics that determine communication performance, while

the next section looks at application behavior and communication needs.

Three performance metrics are critical in any hardware communication

mechanism:

—Ideally the communication bandwidth is lim-

ited by processor, memory, and interconnection bandwidths, rather than by

some aspect of the communication mechanism. The interconnection network

determines the maximum communication capacity of the system. The band-

width in or out of a single node, which is often as important as total system

bandwidth, is affected both by the architecture within the node and by the

communication mechanism. How does the communication mechanism affect

the communication bandwidth of a node? When communication occurs,

resources within the nodes involved in the communication are tied up or

occupied, preventing other outgoing or incoming communication. When this

is incurred for each word of a message, it sets an absolute limit on

the communication bandwidth. This limit is often lower than what the net-

work or memory system can provide. Occupancy may also have a component

that is incurred for each communication event, such as an incoming or outgo-

ing request. In the latter case, the occupancy limits the communication rate,

and the impact of the occupancy on overall communication bandwidth

depends on the size of the messages.

Communication latency

—Ideally the latency is as low as possible. As Appen-

dix E explains:

Communication latency = Sender overhead + Time of flight

+ Transmission time + Receiver overhead

Communication bandwidth

occupancy

H-4

Appendix H

Large-Scale Multiprocessors and Scientiﬁc Applications

assuming no contention. Time of ﬂight is ﬁxed and transmission time is deter-

mined by the interconnection network. The software and hardware overheads

in sending and receiving messages are largely determined by the communi-

cation mechanism and its implementation. Why is latency crucial? Latency

affects both performance and how easy it is to program a multiprocessor.

Unless latency is hidden, it directly affects performance either by tying up

processor resources or by causing the processor to wait.

Overhead and occupancy are closely related, since many forms of overhead

also tie up some part of the node, incurring an occupancy cost, which in turn

limits bandwidth. Key features of a communication mechanism may

directly affect overhead and occupancy. For example, how is the destination

address for a remote communication named, and how is protection imple-

mented? When naming and protection mechanisms are provided by the pro-

cessor, as in a shared address space, the additional overhead is small.

Alternatively, if these mechanisms must be provided by the operating sys-

tem for each communication, this increases the overhead and occupancy

costs of communication, which in turn reduce bandwidth and increase

latency.

How well can the communication mecha-

nism hide latency by overlapping communication with computation or with

other communication? Although measuring this is not as simple as measuring

the ﬁrst two metrics, it is an important characteristic that can be quantiﬁed by

measuring the running time on multiprocessors with the same communica-

tion latency but different support for latency hiding. Although hiding latency

is certainly a good idea, it poses an additional burden on the software system

and ultimately on the programmer. Furthermore, the amount of latency that

can be hidden is application dependent. Thus, it is usually best to reduce

latency wherever possible.

Each of these performance measures is affected by the characteristics of the

communications needed in the application, as we will see in the next section. The

size of the data items being communicated is the most obvious characteristic,

since it affects both latency and bandwidth directly, as well as affecting the efﬁ-

cacy of different latency-hiding approaches. Similarly, the regularity in the com-

munication patterns affects the cost of naming and protection, and hence the

communication overhead. In general, mechanisms that perform well with smaller

as well as larger data communication requests, and irregular as well as regular

communication patterns, are more ﬂexible and efﬁcient for a wider class of appli-

cations. Of course, in considering any communication mechanism, designers

must consider cost as well as performance.

Advantages of Different Communication Mechanisms

The two primary means of communicating data in a large-scale multiprocessor

are message passing and shared memory. Each of these two primary communica-

Communication latency hiding—

Appendix H - Large-Scale Multiprocessors and Scientific Applications.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: