Data and memory optimization techniques for embedded systems.pdf

Data and Memory Optimization

Techniques for Embedded Systems

P. R. PANDA

Synopsys, Inc.

F. CATTHOOR

Inter-University Microelectronics Centre and Katholieke Universiteit Leuven

N. D. DUTT

University of California at Irvine

K. DANCKAERT, E. BROCKMEYER, C. KULKARNI, and

A. VANDERCAPPELLE

Inter-University Microelectronics Centre

and

P. G. KJELDSBERG

Norwegian University of Science and Technology

We present a survey of the state-of-the-art techniques used in performing data and memory-

related optimizations in embedded systems. The optimizations are targeted directly or

indirectly at the memory subsystem, and impact one or more out of three important cost

metrics: area, performance, and power dissipation of the resulting implementation.

We first examine architecture-independent optimizations in the form of code transformations.

We next cover a broad spectrum of optimization techniques that address memory architectures

at varying levels of granularity, ranging from register files to on-chip memory, data caches,

and dynamic memory (DRAM). We end with memory addressing related issues.

Categories and Subject Descriptors: B.3 [ Hardware ]: Memory Structures; B.5.1 [ Register-

Transfer-Level Implementation ]: Design— Memory design ; B.5.2 [ Register-Transfer-Level

Implementation ]: Design Aids— Automatic synthesis ; Optimization ; B.7.1 [ Integrated Cir-

cuits ]: Types and Design Styles— Memory technologies ; D.3.4 [ Programming Languages ]:

Processors— Compilers ; Optimization

Authors’ addresses: P. R. Panda, Synopsys, Inc., 700 E. Middlefield Rd., Mountain View, CA

94043; email: panda@synopsys.com; F. Catthoor, Inter-University Microelectronics Centre and

Katholieke Universiteit Leuven , Kapeldreef 75, Leuven, Belgium; email: catthoor@imec.be; N.

D. Dutt, Center for Embedded Computer Systems, University of California at Irvine , Irvine,

CA 92697; email: dutt@cecs.uci.edu; K. Danckaert, E. Brockmeyer, C. Kulkarni, and A.

Vandercappelle, Inter-University Microelectronics Centre, Kapeldreef 75, Leuven, Belgium;

email: damclaer@imec.be; brpcl,eu@imec.be; kulkarni@imec.be; vdcappel@imec.be; P. G.

Kjeldsberg, Norwegian University of Science and Technology, Trondheim, Norway; email:

pgk@fysel.ntnu.no.

Permission to make digital / hard copy of part or all of this work for personal or classroom use

is granted without fee provided that the copies are not made or distributed for profit or

commercial advantage, the copyright notice, the title of the publication, and its date appear,

and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to

republish, to post on servers, or to redistribute to lists, requires prior specific permission

and / or a fee.

ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001, Pages 149 –206.

150

•

P. R. Panda et al.

General Terms: Algorithms, Design, Experimentation, Performance

Additional Key Words and Phrases: Address generation, allocation, architecture exploration,

code transformation, data cache, data optimization, DRAM, high-level synthesis, memory

architecture customization, memory power dissipation, register file, size estimation, SRAM,

survey

1. INTRODUCTION

In the design of embedded systems, memory issues play a very important

role, and often impact significantly the embedded system’s performance,

power dissipation, and overall cost of implementation. Indeed, as new

processor families and processor cores begin to push the limits of high

performance, the traditional processor-memory gap widens and often be-

comes the dominant bottleneck in achieving high performance. While

embedded systems range from simple micro-controller-based solutions to

high-end mixed hardware/software solutions, embedded system designers

need to pay particular attention to issues such as minimizing memory

requirements, improving memory throughput, and limiting the power dis-

sipated by the system’s memory.

Traditionally, much attention has been paid to the role of memory system

design in the compiler, architecture, and CAD domains. Many of these

techniques, while applicable to some extent, do not fully exploit the

optimization opportunities in embedded system design. From an applica-

tion viewpoint, embedded systems are special-purpose, and so are amena-

ble to aggressive optimization techniques that can fully utilize knowledge

of the applications. Whereas many traditional memory-related hardware

and software optimizations had to account for variances due to general-

purpose applications, memory optimizations for embedded systems can be

tailored to suit the expected profile of code and data. Furthermore, from an

architectural viewpoint, the embedded system designer pays great atten-

tion to the customization of the memory subsystem (both on-chip, as well as

off-chip): this leads to many nontraditional memory organizations, with a

standard cache hierarchy being only one of many memory architectural

options. Finally, from a constraint viewpoint, the embedded system de-

signer needs to meet not only system performance goals, but also has to do

this within a power budget (especially for mobile applications), and meet

real-time constraints. The system performance should account for not only

the processor’s speed but also the system bus load to the shared board-level

storage units such as main memory and disk. Even the L2 cache is shared

in a multiprocessor context. As a result of all this, the memory and bus

subsystem costs become a significant contributor to overall system costs,

and thus the embedded system designer attempts to minimize memory

requirements with the goal of lowering overall system costs.

ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.

Data and Memory Optimization

•

151

In this survey, we present a variety of optimization techniques for data

and memory used in embedded systems. We begin in Section 2 with a

survey of several global optimizations that are independent of the target

architectural platform, and which, more often than not, always result in

improved performance, cost, and power. These optimizations take the form

of source-to-source code transformations that precede many traditional

compiler and synthesis steps, and which move the design to a superior

starting point in the design space exploration of alternative embedded

system realizations.

Next, in Section 3, we discuss optimization opportunities in the context of

specific memory modules, customized memory architectures, and their use

in both a hardware (or behavioral) synthesis context, as well as in a

software (or traditional compiler) context. This section progresses from

optimization techniques applied to memory elements closest to the compu-

tational engines – registers and register files, and then discusses optimiza-

tion techniques for increasingly distant memory structures: SRAM, cache,

and DRAM. We survey approaches in the modeling of these disparate

memory structures, their customization, and their optimization.

Finally, in Section 4, we survey memory address generation technqiues.

An important byproduct of applying both platform-independent as well as

memory architecture-specific optimizations is that the memory accesses

undergo a significant amount of transformation from the original source

code. Thus, attention must be paid to effective generation of the target

memory addresses, implemented either as code running on a programmable

processor, or as data consumed by a variety of hardware and software

engines.

Since this survey primarily covers data-related optimizations, we do not

address in detail techniques that are specific to instructions, instruction

caches, etc. However, we point out analogous optimizations that apply to

instructions in relevant sections.

2. PLATFORM-INDEPENDENT CODE TRANSFORMATIONS

The importance of performing loop control flow transformations prior to the

memory organization related tasks has been recognized quite early in

compiler theory (for an overview, see Banerjee et al. [1993]) and the

embedded system synthesis domain [Verbauwhede et al. 1989]; it follows

that if such target architecture-independent transformations are not ap-

plied, the resulting memory organization will be heavily suboptimal. In this

section we examine the role of source-to-source code transformations in the

solution to the data transfer and storage bottleneck problem. This is

especially important for embedded applications where performance is not

the only goal; cost issues such as memory footprint and power consumption

are also crucial. The fact that execution speed and energy for a given

application form at least partly different objective functions that require

different optimization strategies in an embedded context has been conclu-

sively shown since 1994 (for example, see the early work in Catthoor et al.

ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.

152

•

P. R. Panda et al.

[1994] and Meng et al. [1995]). Even in general-purpose processors, they

form different axes of the exploration space for the memory organization

[Brockmeyer et al. 2000a].

Many of these code transformations can be carefully performed in a

platform-independent order [Catthoor et al. 2000; Danckaert et al. 1999].

This very useful property allows us to apply them to a given application

code without having any prior knowledge of the platform architecture

parameters such as memory sizes, communication scheme, and datapath

type. The resulting optimized code can then be passed through a platform-

dependent stage to obtain further cost-performance improvements and

tradeoffs. We will see in subsequent sections that the optimizations are

especially useful when the target architecture has a customizable memory

organization.

We first discuss global (data flow and) loop transformations in Section

2.1. This will be followed by data reuse-related transformations in Section

2.2. Finally, in Section 2.3, we study the link with and the impact on

processor partitioning and parallelisation. A good overview of research on

system-level transformations can be found in Catthoor et al. [1998] and

Benini and de Micheli [2000], with the latter focussing on low-power

techniques. In the following sections we limit ourselves to discussion of the

most directly related work.

2.1 Code Rewriting Techniques for Access Locality and Regularity

Code rewriting techniques, consisting of loop (and sometimes also data

flow) transformations, are an essential part of modern optimizing and

parallelizing compilers. They are mainly used to enhance the temporal and

spatial locality for cache performance and to expose the inherent parallel-

ism of the algorithm to the outer (for asynchronous parallelism) or inner

(for synchronous parallelism) loop nests [Amarasinghe et al. 1995; Wolfe

1996; Banerjee et al. 1993]. Other application areas are communication-

free data allocation techniques [Chen and Sheu 1994] and optimizing

communications in general [Gupta et al. 1996].

Most work has focused on interactive systems, with very early (since the

late 70’s) work [Loveman 1977]. Environments such as Tiny [Wolfe 1991];

Omega at the University of Maryland [Kelly and Pugh 1992]; SUIF at

Stanford [Amarasinghe et al. 1995; [Hall et al. 1996]; the Paradigm

compiler at the University of Illinois [Banerjee et al. 1995]; (and earlier

work [Polychronopoulos 1988]). The ParaScope Editor [McKinley et al.

1993] at Rice University are representative of this large body of work.

In addition, research has been performed on (partly) automating the

steering of these loop transformations. Many transformations and methods

to steer them have been proposed that increase the parallelism in several

contexts. This has happened in the array synthesis community (e.g., at

Saarbrucken [Thiele 1989]; at Versailles [Feautrier 1995]; and E.N.S. Lyon

[Darte et al. 1993]; and at the University of SW Louisiana [Shang et al.

1992]). In the parallelizing compiler community (e.g., at Cornell [Li and

ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.

Data and Memory Optimization

•

153

Pingali 1992]; at Illinois [Padua and Wolfe 1986]; at Stanford [Wolf and

Lam 1991] and [Amarasinghe et al. 1995]; at Santa Clara [Shang et al.

1996]); and finally in the high-level synthesis community also (at the

University of Minnesota [Parhi 1989] and the University of Notre-Dame

[Passos and Sha 1994]).

Efficient parallelism is however partly coupled to locality of data access ,

and this has been incorporated in a number of approaches. Examples are

the work on data and control flow transformations for distributed shared-

memory machines at the University of Rochester[Cierniak and Li 1995], or

heuristics to improve the cache hit ratio and execution time at the Univer-

sity of Amherst [McKinley et al. 1996]. Rice University has recently also

started investigating the actual memory bandwidth issues and the relation

to loop fusion [Ding and Kennedy 2000]. At E.N.S. Lyon, the effect of

several loop transformation on memory access has been studied too

[Fraboulet et al. 1999].

It is thus no surprise that these code rewriting techniques are also very

important in the context of data transfer and storage (DTS) solutions,

especially for embedded applications that permit customized memory orga-

nizations. As the first optimization step in the design methodology pro-

posed in Franssen et al. [1994]; Greef et al. [1995]; and Masselos et

al.1999a]; they were able to significantly reduce the required amount of

storage and transfers and improve access behavior, thus enabling the

ensuing steps of more platform-dependent optimizations. As such, the

global loop transformations mainly increase the locality and regularity of

the accesses in the code. In an embedded context this is clearly good for

memory size (area) and memory accesses (power) [Franssen et al. 1994;

Greef et al. 1995], but of course also for pure performance [Masselos et al.

1999a], even though the two objectives do not fully lead to the same loop

transformation steering. The main distinction from the vast amount of

earlier related work in the compiler literature is that they perform these

transformations across all loop nests in the entire program [Franssen et al.

1994]. Traditional loop optimizations performed in compilers, where the

scope of loop transformations is limited to one procedure or usually even

one loop nest, can enhance the locality (and parallelization possibilities)

within that loop nest, but may not solve the global data flow and associated

buffer space needed between the loop nests or procedures. A recent trans-

formation framework including interprocedural analysis proposed in McK-

inley [1998] is a step in this direction: it is focused on parallelisation for a

shared memory multiprocessor. The memory-related optimizations are still

performed on a loop-nest basis (and so are “local”); but the loops in that

loop nest may span different procedures and a fusing preprocessing step

tries to combine all compatible loop nests that do not have dependencies

blocking their fusing. The goal of the fusing is primarily to improve

parallelism.

The global loop and control flow transformation step proposed in Greef et

al. [1995]; Franssen et al. [1994]; and Masselos et al. [1999a] can be viewed

as a precompilation phase, applied prior to conventional compiler loop

ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.

Plik z chomika:

Inne pliki z tego folderu:

Inne foldery tego chomika: