Appendix H - Large-Scale Multiprocessors and Scientific Applications.pdf
(
409 KB
)
Pobierz
App H.fm
H.1
Introduction
H-2
H.2
Interprocessor Communication: The Critical Performance Issue
H-3
H.3
Characteristics of Scientific Applications
H-6
H.4
Synchronization: Scaling Up
H-12
H.5
Performance of Scientific Applications on Shared-Memory
Multiprocessors
H-21
H.6
Performance Measurement of Parallel Processors with Scientific
Applications
H-33
H.7
Implementing Cache Coherence
H-34
H.8
The Custom Cluster Approach: Blue Gene/L
H-41
H.9
Concluding Remarks
H-44
H
Large-Scale Multiprocessors
and Scientific Applications
“Hennessy and Patterson should move MPPs to Chapter 11.”
Jim Gray, Microsoft Research,
when asked about the coverage of massively parallel pro-
cessors (MPPs) for the third edition in 2000
Unfortunately for companies in the MPP business,
the third edition had only ten chapters and the
MPP business did not grow as anticipated when
the first and second edition were written.
H-2
Appendix H
Large-Scale Multiprocessors and Scientific Applications
H.1
Introduction
The primary application of large-scale multiprocessors is for true parallel pro-
gramming, as opposed to multiprogramming or transaction-oriented computing
where independent tasks are executed in parallel without much interaction. In
true parallel computing, a set of tasks execute in a collaborative fashion on one
application. The primary target of parallel computing is scientific and technical
applications. In contrast, for loosely coupled commercial applications, such as
Web servers and most transaction-processing applications, there is little commu-
nication among tasks. For such applications, loosely coupled clusters are gener-
ally adequate and most cost-effective, since intertask communication is rare.
Because true parallel computing involves cooperating tasks, the nature of
communication between those tasks and how such communication is supported
in the hardware is of vital importance in determining the performance of the
application. The next section of this appendix examines such issues and the char-
acteristics of different communication models.
In comparison to sequential programs, whose performance is largely dictated
by the cache behavior and issues related to instruction-level parallelism, parallel
programs have several additional characteristics that are important to perfor-
mance, including the amount of parallelism, the size of parallel tasks, the fre-
quency and nature of intertask communication, and the frequency and nature of
synchronization. These aspects are affected both by the underlying nature of the
application as well as by the programming style. Section H.3 reviews the impor-
tant characteristics of several scientific applications to give a flavor of these
issues.
As we saw in Chapter 4, synchronization can be quite important in achieving
good performance. The larger number of parallel tasks that may need to synchro-
nize makes contention involving synchronization a much more serious problem
in large-scale multiprocessors. Section H.4 examines methods of scaling up the
synchronization mechanisms of Chapter 4.
Section H.5 explores the detailed performance of shared-memory parallel
applications executing on a moderate-scale shared-memory multiprocessor. As
we will see, the behavior and performance characteristics are quite a bit more
complicated than those in small-scale shared-memory multiprocessors. Section
H.6 discusses the general issue of how to examine parallel performance for dif-
ferent sized multiprocessors. Section H.7 explores the implementation challenges
of distributed shared-memory cache coherence, the key architectural approach
used in moderate-scale multiprocessors. Sections H.7 and H.8 rely on a basic
understanding of interconnection networks, and the reader should at least quickly
review Appendix E before reading these sections.
Section H.8 explores the design of one of the newest and most exciting large-
scale multiprocessors in recent times, Blue Gene. Blue Gene is a cluster-based
multiprocessor, but it uses a custom, highly dense node designed specifically for
this function, as opposed to the nodes of most earlier cluster multiprocessors that
used a node architecture similar to those in a desktop or smaller-scale multipro-
H.2 Interprocessor Communication: The Critical Performance Issue
H-3
cessor node. By using a custom node design, Blue Gene achieves a significant
reduction in the cost, physical size, and power consumption of a node. Blue
Gene/L, a 64K-node version, is the world’s fastest computer in 2006, as mea-
sured by the linear algebra benchmark, Linpack.
H.2
Interprocessor Communication: The Critical
Performance Issue
In multiprocessors with larger processor counts, interprocessor communication
becomes more expensive, since the distance between processors increases. Fur-
thermore, in truly parallel applications where the threads of the application must
communicate, there is usually more communication than in a loosely coupled set
of distinct processes or independent transactions, which characterize many com-
mercial server applications. These factors combine to make efficient interproces-
sor communication one of the most important determinants of parallel
performance, especially for the scientific market.
Unfortunately, characterizing the communication needs of an application and
the capabilities of an architecture are both complex. This section examines the
key hardware characteristics that determine communication performance, while
the next section looks at application behavior and communication needs.
Three performance metrics are critical in any hardware communication
mechanism:
1.
—Ideally the communication bandwidth is lim-
ited by processor, memory, and interconnection bandwidths, rather than by
some aspect of the communication mechanism. The interconnection network
determines the maximum communication capacity of the system. The band-
width in or out of a single node, which is often as important as total system
bandwidth, is affected both by the architecture within the node and by the
communication mechanism. How does the communication mechanism affect
the communication bandwidth of a node? When communication occurs,
resources within the nodes involved in the communication are tied up or
occupied, preventing other outgoing or incoming communication. When this
is incurred for each word of a message, it sets an absolute limit on
the communication bandwidth. This limit is often lower than what the net-
work or memory system can provide. Occupancy may also have a component
that is incurred for each communication event, such as an incoming or outgo-
ing request. In the latter case, the occupancy limits the communication rate,
and the impact of the occupancy on overall communication bandwidth
depends on the size of the messages.
2.
Communication latency
—Ideally the latency is as low as possible. As Appen-
dix E explains:
Communication latency = Sender overhead + Time of flight
+ Transmission time + Receiver overhead
Communication bandwidth
occupancy
H-4
Appendix H
Large-Scale Multiprocessors and Scientific Applications
assuming no contention. Time of flight is fixed and transmission time is deter-
mined by the interconnection network. The software and hardware overheads
in sending and receiving messages are largely determined by the communi-
cation mechanism and its implementation. Why is latency crucial? Latency
affects both performance and how easy it is to program a multiprocessor.
Unless latency is hidden, it directly affects performance either by tying up
processor resources or by causing the processor to wait.
Overhead and occupancy are closely related, since many forms of overhead
also tie up some part of the node, incurring an occupancy cost, which in turn
limits bandwidth. Key features of a communication mechanism may
directly affect overhead and occupancy. For example, how is the destination
address for a remote communication named, and how is protection imple-
mented? When naming and protection mechanisms are provided by the pro-
cessor, as in a shared address space, the additional overhead is small.
Alternatively, if these mechanisms must be provided by the operating sys-
tem for each communication, this increases the overhead and occupancy
costs of communication, which in turn reduce bandwidth and increase
latency.
3.
How well can the communication mecha-
nism hide latency by overlapping communication with computation or with
other communication? Although measuring this is not as simple as measuring
the first two metrics, it is an important characteristic that can be quantified by
measuring the running time on multiprocessors with the same communica-
tion latency but different support for latency hiding. Although hiding latency
is certainly a good idea, it poses an additional burden on the software system
and ultimately on the programmer. Furthermore, the amount of latency that
can be hidden is application dependent. Thus, it is usually best to reduce
latency wherever possible.
Each of these performance measures is affected by the characteristics of the
communications needed in the application, as we will see in the next section. The
size of the data items being communicated is the most obvious characteristic,
since it affects both latency and bandwidth directly, as well as affecting the effi-
cacy of different latency-hiding approaches. Similarly, the regularity in the com-
munication patterns affects the cost of naming and protection, and hence the
communication overhead. In general, mechanisms that perform well with smaller
as well as larger data communication requests, and irregular as well as regular
communication patterns, are more flexible and efficient for a wider class of appli-
cations. Of course, in considering any communication mechanism, designers
must consider cost as well as performance.
Advantages of Different Communication Mechanisms
The two primary means of communicating data in a large-scale multiprocessor
are message passing and shared memory. Each of these two primary communica-
Communication latency hiding—
Plik z chomika:
Kowalski2015
Inne pliki z tego folderu:
Appendix K - Historical Perspectives and References.pdf
(372 KB)
Appendix J - Survey of Instruction Set Architectures.pdf
(505 KB)
Appendix I - Computer Arithmetic.pdf
(2898 KB)
Appendix H - Large-Scale Multiprocessors and Scientific Applications.pdf
(409 KB)
Appendix G - Hardware and Software for VLIW and EPIC.pdf
(269 KB)
Inne foldery tego chomika:
100 ciosów karate
100 Things Every Designer Needs to Know About People
50 przepisów Sałatki
A Guide to Graph Colouring Algorithms and Applications
ABC Naprawy Odbiorników Radiowych
Zgłoś jeśli
naruszono regulamin