Appendix F - Vector Processors.pdf
(
369 KB
)
Pobierz
App F.fm
F. 1
Why Vector Processors?
F-2
F. 2
Basic Vector Architecture
F-4
F. 3
Two Real-World Issues: Vector Length and Stride
F-16
F. 4
Enhancing Vector Performance
F-23
F. 5
Effectiveness of Compiler Vectorization
F-32
F. 6
Putting It All Together: Performance of Vector Processors
F-34
F. 7
A Modern Vector Supercomputer: The Cray X1
F-40
F. 8
Fallacies and Pitfalls
F-44
F. 9
Concluding Remarks
F-45
F.10
Historical Perspective and References
F-47
Exercises
F-53
F
Vector Processors
Revised by Krste Asanovic
Massachusetts Institute of Technology
I’m certainly not inventing vector processors. There are three kinds
that I know of existing today. They are represented by the Illiac-IV, the
(CDC) Star processor, and the TI (ASC) processor. Those three were all
pioneering processors. . . . One of the problems of being a pioneer is
you always make mistakes and I never, never want to be a pioneer. It’s
always best to come second when you can look at the mistakes the
pioneers made.
Seymour Cray
Public lecture at Lawrence Livermore Laboratories
on the introduction of the Cray-1 (1976)
F-2
Appendix F
Vector Processors
F. 1
Why Vector Processors?
In Chapters 2 and 3 we saw how we could significantly increase the performance
of a processor by issuing multiple instructions per clock cycle and by more
deeply pipelining the execution units to allow greater exploitation of instruction-
level parallelism. (This appendix assumes that you have read Chapters 2 and 3
and Appendix G completely; in addition, the discussion on vector memory sys-
tems assumes that you have read Appendix C and Chapter 5.) Unfortunately, we
also saw that there are serious difficulties in exploiting ever larger degrees of ILP.
As we increase both the width of instruction issue and the depth of the
machine pipelines, we also increase the number of independent instructions
required to keep the processor busy with useful work. This means an increase in
the number of partially executed instructions that can be in flight at one time. For
a dynamically scheduled machine, hardware structures, such as instruction win-
dows, reorder buffers, and rename register files, must grow to have sufficient
capacity to hold all in-flight instructions, and worse, the number of ports on each
element of these structures must grow with the issue width. The logic to track
dependencies between all in-flight instructions grows quadratically in the number
of instructions. Even a statically scheduled VLIW machine, which shifts more of
the scheduling burden to the compiler, requires more registers, more ports per
register, and more hazard interlock logic (assuming a design where hardware
manages interlocks after issue time) to support more in-flight instructions, which
similarly cause quadratic increases in circuit size and complexity. This rapid
increase in circuit complexity makes it difficult to build machines that can control
large numbers of in-flight instructions, and hence limits practical issue widths
and pipeline depths.
were successfully commercialized long before instruction-
level parallel machines and take an alternative approach to controlling multiple
functional units with deep pipelines. Vector processors provide high-level opera-
tions that work on
linear arrays of numbers. A typical vector operation
might add two 64-element, floating-point vectors to obtain a single 64-element
vector result. The vector instruction is equivalent to an entire loop, with each itera-
tion computing one of the 64 elements of the result, updating the indices, and
branching back to the beginning.
Vector instructions have several important properties that solve most of the
problems mentioned above:
vectors
—
A single vector instruction specifies a great deal of work—it is equivalent to
executing an entire loop. Each instruction represents tens or hundreds of
operations, and so the instruction fetch and decode bandwidth needed to keep
multiple deeply pipelined functional units busy is dramatically reduced.
By using a vector instruction, the compiler or programmer indicates that the
computation of each result in the vector is independent of the computation of
other results in the same vector and so hardware does not have to check for
data hazards within a vector instruction. The elements in the vector can be
Vector processors
F.1 Why Vector Processors?
F
3
computed using an array of parallel functional units, or a single very deeply
pipelined functional unit, or any intermediate configuration of parallel and
pipelined functional units.
Hardware need only check for data hazards between two vector instructions
once per vector operand, not once for every element within the vectors. That
means the dependency checking logic required between two vector instruc-
tions is approximately the same as that required between two scalar instruc-
tions, but now many more elemental operations can be in flight for the same
complexity of control logic.
Vector instructions that access memory have a known access pattern. If the
vector’s elements are all adjacent, then fetching the vector from a set of
heavily interleaved memory banks works very well. The high latency of initi-
ating a main memory access versus accessing a cache is amortized, because a
single access is initiated for the entire vector rather than to a single word.
Thus, the cost of the latency to main memory is seen only once for the entire
vector, rather than once for each word of the vector.
Because an entire loop is replaced by a vector instruction whose behavior is
predetermined, control hazards that would normally arise from the loop
branch are nonexistent.
For these reasons, vector operations can be made faster than a sequence of scalar
operations on the same number of data items, and designers are motivated to
include vector units if the application domain can use them frequently.
Vector processors are particularly useful for large scientific and engineering
applications, including car crash simulations and weather forecasting, for which
a typical job might take dozens of hours of supercomputer time running over
multigigabyte data sets. High-speed scalar processors rely on caches to reduce
average memory access latency, but big, long-running, scientific programs often
have very large active data sets that are sometimes accessed with low locality,
yielding poor performance from the memory hierarchy. Some scalar architectures
provide mechanisms to bypass the cache when software is aware that memory
accesses will have poor locality. But saturating a modern memory system
requires hardware to track hundreds or thousands of in-flight scalar memory
operations, and this has been proven too costly to implement for scalar ISAs. In
contrast, vector ISAs launch entire vector fetches into the memory system with
each instruction, and much simpler logic can provide high sustained memory
bandwidth.
When the last edition of this appendix was written in 2001, exotic vector
supercomputers appeared to be slowly fading from the supercomputing arena, to
be replaced by systems built from large numbers of superscalar microprocessors.
But in 2002, Japan unveiled the world’s fastest supercomputer, the Earth Simula-
tor, designed to create a “virtual planet” to analyze and predict the effect of envi-
ronmental changes on the world’s climate. The Earth Simulator was five times
faster than the previous leader, and faster than the next 12 fastest machines
combined. The announcement caused a major upheaval in high-performance
-
F-4
Appendix F
Vector Processors
computing, particularly in the United States, which was shocked to have lost the
lead in an area of strategic importance. The Earth Simulator has fewer processors
than competing microprocessor-based machines, but each node is a single-chip
vector microprocessor with much greater efficiency on many important super-
computing codes for the reasons given above. The impact of the Earth Simulator,
together with the release of a new generation of vector machines from Cray, has
led to a resurgence of interest in the type of vector architectures described in this
appendix.
F. 2
Basic Vector Architecture
A vector processor typically consists of an ordinary pipelined scalar unit plus a
vector unit. All functional units within the vector unit have a latency of several
clock cycles. This allows a shorter clock cycle time and is compatible with long-
running vector operations that can be deeply pipelined without generating haz-
ards. Most vector processors allow the vectors to be dealt with as floating-point
numbers, as integers, or as logical data. Here we will focus on floating point. The
scalar unit is basically no different from the type of advanced pipelined CPU dis-
cussed in Chapters 2 and 3, and commercial vector machines have included both
out-of-order scalar units (NEC SX/5) and VLIW scalar units (Fujitsu VPP5000).
There are two primary types of architectures for vector processors:
vector-
. In a vector-register
processor, all vector operations—except load and store—are among the vector
registers. These architectures are the vector counterpart of a load-store architec-
ture. All major vector computers shipped since the late 1980s use a vector-register
architecture, including the Cray Research processors (Cray-1, Cray-2, X-MP,
YMP, C90, T90, SV1, and X1), the Japanese supercomputers (NEC SX/2 through
SX/8, Fujitsu VP200 through VPP5000, and the Hitachi S820 and S-8300), and
the minisupercomputers (Convex C-1 through C-4). In a memory-memory vector
processor, all vector operations are memory to memory. The first vector computers
were of this type, as were CDC’s vector computers. From this point on we will
focus on vector-register architectures only; we will briefly return to memory-
memory vector architectures at the end of the appendix (Section F.10) to discuss
why they have not been as successful as vector-register architectures.
We begin with a vector-register processor consisting of the primary com-
ponents shown in Figure F.1. This processor, which is loosely based on the
Cray-1, is the foundation for discussion throughout most of this appendix. We
will call it VMIPS; its scalar portion is MIPS, and its vector portion is the logi-
cal vector extension of MIPS. The rest of this section examines how the basic
architecture of VMIPS relates to other processors.
The primary components of the instruction set architecture of VMIPS are the
following:
and
memory-memory vector processors
—Each vector register is a fixed-length bank holding a single
vector. VMIPS has eight vector registers, and each vector register holds 64
register processors
Vector registers
Plik z chomika:
Deadm4n
Inne pliki z tego folderu:
Appendix J - Survey of Instruction Set Architectures.pdf
(505 KB)
Appendix D - Embedded Systems.pdf
(1600 KB)
Appendix E - Interconnection Networks.pdf
(1179 KB)
Appendix I - Computer Arithmetic.pdf
(2898 KB)
Appendix F - Vector Processors.pdf
(369 KB)
Inne foldery tego chomika:
Learn Hardware, Firmware and Software Design (2005)
Stack Computers - The New Wave
Zgłoś jeśli
naruszono regulamin