Appendix F - Vector Processors.pdf - Computer Architecture - A Quantitative Approach - Deadm4n

F. 1

Why Vector Processors?

F-2

F. 2

Basic Vector Architecture

F-4

F. 3

Two Real-World Issues: Vector Length and Stride

F-16

F. 4

Enhancing Vector Performance

F-23

F. 5

Effectiveness of Compiler Vectorization

F-32

F. 6

Putting It All Together: Performance of Vector Processors

F-34

F. 7

A Modern Vector Supercomputer: The Cray X1

F-40

F. 8

Fallacies and Pitfalls

F-44

F. 9

Concluding Remarks

F-45

F.10

Historical Perspective and References

F-47

Exercises

F-53

Vector Processors

Revised by Krste Asanovic

Massachusetts Institute of Technology

I’m certainly not inventing vector processors. There are three kinds

that I know of existing today. They are represented by the Illiac-IV, the

(CDC) Star processor, and the TI (ASC) processor. Those three were all

pioneering processors. . . . One of the problems of being a pioneer is

you always make mistakes and I never, never want to be a pioneer. It’s

always best to come second when you can look at the mistakes the

pioneers made.

Seymour Cray

Public lecture at Lawrence Livermore Laboratories

on the introduction of the Cray-1 (1976)

F-2

Appendix F

Vector Processors

F. 1

Why Vector Processors?

In Chapters 2 and 3 we saw how we could signiﬁcantly increase the performance

of a processor by issuing multiple instructions per clock cycle and by more

deeply pipelining the execution units to allow greater exploitation of instruction-

level parallelism. (This appendix assumes that you have read Chapters 2 and 3

and Appendix G completely; in addition, the discussion on vector memory sys-

tems assumes that you have read Appendix C and Chapter 5.) Unfortunately, we

also saw that there are serious difﬁculties in exploiting ever larger degrees of ILP.

As we increase both the width of instruction issue and the depth of the

machine pipelines, we also increase the number of independent instructions

required to keep the processor busy with useful work. This means an increase in

the number of partially executed instructions that can be in ﬂight at one time. For

a dynamically scheduled machine, hardware structures, such as instruction win-

dows, reorder buffers, and rename register ﬁles, must grow to have sufﬁcient

capacity to hold all in-ﬂight instructions, and worse, the number of ports on each

element of these structures must grow with the issue width. The logic to track

dependencies between all in-ﬂight instructions grows quadratically in the number

of instructions. Even a statically scheduled VLIW machine, which shifts more of

the scheduling burden to the compiler, requires more registers, more ports per

manages interlocks after issue time) to support more in-ﬂight instructions, which

similarly cause quadratic increases in circuit size and complexity. This rapid

increase in circuit complexity makes it difﬁcult to build machines that can control

large numbers of in-ﬂight instructions, and hence limits practical issue widths

and pipeline depths.

were successfully commercialized long before instruction-

level parallel machines and take an alternative approach to controlling multiple

functional units with deep pipelines. Vector processors provide high-level opera-

tions that work on

linear arrays of numbers. A typical vector operation

might add two 64-element, ﬂoating-point vectors to obtain a single 64-element

vector result. The vector instruction is equivalent to an entire loop, with each itera-

tion computing one of the 64 elements of the result, updating the indices, and

branching back to the beginning.

Vector instructions have several important properties that solve most of the

problems mentioned above:

vectors

—

A single vector instruction speciﬁes a great deal of work—it is equivalent to

executing an entire loop. Each instruction represents tens or hundreds of

operations, and so the instruction fetch and decode bandwidth needed to keep

multiple deeply pipelined functional units busy is dramatically reduced.

By using a vector instruction, the compiler or programmer indicates that the

computation of each result in the vector is independent of the computation of

other results in the same vector and so hardware does not have to check for

data hazards within a vector instruction. The elements in the vector can be

Vector processors

F.1 Why Vector Processors?

computed using an array of parallel functional units, or a single very deeply

pipelined functional unit, or any intermediate conﬁguration of parallel and

pipelined functional units.

Hardware need only check for data hazards between two vector instructions

once per vector operand, not once for every element within the vectors. That

means the dependency checking logic required between two vector instruc-

tions is approximately the same as that required between two scalar instruc-

tions, but now many more elemental operations can be in ﬂight for the same

complexity of control logic.

Vector instructions that access memory have a known access pattern. If the

vector’s elements are all adjacent, then fetching the vector from a set of

heavily interleaved memory banks works very well. The high latency of initi-

ating a main memory access versus accessing a cache is amortized, because a

single access is initiated for the entire vector rather than to a single word.

Thus, the cost of the latency to main memory is seen only once for the entire

vector, rather than once for each word of the vector.

Because an entire loop is replaced by a vector instruction whose behavior is

predetermined, control hazards that would normally arise from the loop

branch are nonexistent.

For these reasons, vector operations can be made faster than a sequence of scalar

operations on the same number of data items, and designers are motivated to

include vector units if the application domain can use them frequently.

Vector processors are particularly useful for large scientiﬁc and engineering

applications, including car crash simulations and weather forecasting, for which

a typical job might take dozens of hours of supercomputer time running over

multigigabyte data sets. High-speed scalar processors rely on caches to reduce

average memory access latency, but big, long-running, scientiﬁc programs often

have very large active data sets that are sometimes accessed with low locality,

yielding poor performance from the memory hierarchy. Some scalar architectures

provide mechanisms to bypass the cache when software is aware that memory

accesses will have poor locality. But saturating a modern memory system

requires hardware to track hundreds or thousands of in-ﬂight scalar memory

operations, and this has been proven too costly to implement for scalar ISAs. In

contrast, vector ISAs launch entire vector fetches into the memory system with

each instruction, and much simpler logic can provide high sustained memory

bandwidth.

When the last edition of this appendix was written in 2001, exotic vector

supercomputers appeared to be slowly fading from the supercomputing arena, to

be replaced by systems built from large numbers of superscalar microprocessors.

But in 2002, Japan unveiled the world’s fastest supercomputer, the Earth Simula-

tor, designed to create a “virtual planet” to analyze and predict the effect of envi-

ronmental changes on the world’s climate. The Earth Simulator was ﬁve times

faster than the previous leader, and faster than the next 12 fastest machines

combined. The announcement caused a major upheaval in high-performance

F-4

Appendix F

Vector Processors

computing, particularly in the United States, which was shocked to have lost the

lead in an area of strategic importance. The Earth Simulator has fewer processors

than competing microprocessor-based machines, but each node is a single-chip

vector microprocessor with much greater efﬁciency on many important super-

computing codes for the reasons given above. The impact of the Earth Simulator,

together with the release of a new generation of vector machines from Cray, has

led to a resurgence of interest in the type of vector architectures described in this

appendix.

F. 2

Basic Vector Architecture

A vector processor typically consists of an ordinary pipelined scalar unit plus a

vector unit. All functional units within the vector unit have a latency of several

clock cycles. This allows a shorter clock cycle time and is compatible with long-

running vector operations that can be deeply pipelined without generating haz-

ards. Most vector processors allow the vectors to be dealt with as ﬂoating-point

numbers, as integers, or as logical data. Here we will focus on ﬂoating point. The

scalar unit is basically no different from the type of advanced pipelined CPU dis-

cussed in Chapters 2 and 3, and commercial vector machines have included both

out-of-order scalar units (NEC SX/5) and VLIW scalar units (Fujitsu VPP5000).

There are two primary types of architectures for vector processors:

vector-

. In a vector-register

processor, all vector operations—except load and store—are among the vector

registers. These architectures are the vector counterpart of a load-store architec-

ture. All major vector computers shipped since the late 1980s use a vector-register

architecture, including the Cray Research processors (Cray-1, Cray-2, X-MP,

YMP, C90, T90, SV1, and X1), the Japanese supercomputers (NEC SX/2 through

SX/8, Fujitsu VP200 through VPP5000, and the Hitachi S820 and S-8300), and

the minisupercomputers (Convex C-1 through C-4). In a memory-memory vector

processor, all vector operations are memory to memory. The ﬁrst vector computers

were of this type, as were CDC’s vector computers. From this point on we will

focus on vector-register architectures only; we will brieﬂy return to memory-

memory vector architectures at the end of the appendix (Section F.10) to discuss

why they have not been as successful as vector-register architectures.

We begin with a vector-register processor consisting of the primary com-

ponents shown in Figure F.1. This processor, which is loosely based on the

Cray-1, is the foundation for discussion throughout most of this appendix. We

will call it VMIPS; its scalar portion is MIPS, and its vector portion is the logi-

cal vector extension of MIPS. The rest of this section examines how the basic

architecture of VMIPS relates to other processors.

The primary components of the instruction set architecture of VMIPS are the

following:

and

memory-memory vector processors

—Each vector register is a ﬁxed-length bank holding a single

vector. VMIPS has eight vector registers, and each vector register holds 64

Vector registers

Appendix F - Vector Processors.pdf

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: