Data and memory optimization techniques for embedded systems.pdf

(331 KB) Pobierz
Data and Memory Optimization Techniques for Embedded Systems
Data and Memory Optimization
Techniques for Embedded Systems
P. R. PANDA
Synopsys, Inc.
F. CATTHOOR
Inter-University Microelectronics Centre and Katholieke Universiteit Leuven
N. D. DUTT
University of California at Irvine
K. DANCKAERT, E. BROCKMEYER, C. KULKARNI, and
A. VANDERCAPPELLE
Inter-University Microelectronics Centre
and
P. G. KJELDSBERG
Norwegian University of Science and Technology
We present a survey of the state-of-the-art techniques used in performing data and memory-
related optimizations in embedded systems. The optimizations are targeted directly or
indirectly at the memory subsystem, and impact one or more out of three important cost
metrics: area, performance, and power dissipation of the resulting implementation.
We first examine architecture-independent optimizations in the form of code transformations.
We next cover a broad spectrum of optimization techniques that address memory architectures
at varying levels of granularity, ranging from register files to on-chip memory, data caches,
and dynamic memory (DRAM). We end with memory addressing related issues.
Categories and Subject Descriptors: B.3 [ Hardware ]: Memory Structures; B.5.1 [ Register-
Transfer-Level Implementation ]: Design— Memory design ; B.5.2 [ Register-Transfer-Level
Implementation ]: Design Aids— Automatic synthesis ; Optimization ; B.7.1 [ Integrated Cir-
cuits ]: Types and Design Styles— Memory technologies ; D.3.4 [ Programming Languages ]:
Processors— Compilers ; Optimization
Authors’ addresses: P. R. Panda, Synopsys, Inc., 700 E. Middlefield Rd., Mountain View, CA
94043; email: panda@synopsys.com; F. Catthoor, Inter-University Microelectronics Centre and
Katholieke Universiteit Leuven , Kapeldreef 75, Leuven, Belgium; email: catthoor@imec.be; N.
D. Dutt, Center for Embedded Computer Systems, University of California at Irvine , Irvine,
CA 92697; email: dutt@cecs.uci.edu; K. Danckaert, E. Brockmeyer, C. Kulkarni, and A.
Vandercappelle, Inter-University Microelectronics Centre, Kapeldreef 75, Leuven, Belgium;
email: damclaer@imec.be; brpcl,eu@imec.be; kulkarni@imec.be; vdcappel@imec.be; P. G.
Kjeldsberg, Norwegian University of Science and Technology, Trondheim, Norway; email:
pgk@fysel.ntnu.no.
Permission to make digital / hard copy of part or all of this work for personal or classroom use
is granted without fee provided that the copies are not made or distributed for profit or
commercial advantage, the copyright notice, the title of the publication, and its date appear,
and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission
and / or a fee.
© 2001 ACM 1084-4309/01/0400 –0149 $5.00
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001, Pages 149 –206.
408106901.001.png
150
P. R. Panda et al.
General Terms: Algorithms, Design, Experimentation, Performance
Additional Key Words and Phrases: Address generation, allocation, architecture exploration,
code transformation, data cache, data optimization, DRAM, high-level synthesis, memory
architecture customization, memory power dissipation, register file, size estimation, SRAM,
survey
1. INTRODUCTION
In the design of embedded systems, memory issues play a very important
role, and often impact significantly the embedded system’s performance,
power dissipation, and overall cost of implementation. Indeed, as new
processor families and processor cores begin to push the limits of high
performance, the traditional processor-memory gap widens and often be-
comes the dominant bottleneck in achieving high performance. While
embedded systems range from simple micro-controller-based solutions to
high-end mixed hardware/software solutions, embedded system designers
need to pay particular attention to issues such as minimizing memory
requirements, improving memory throughput, and limiting the power dis-
sipated by the system’s memory.
Traditionally, much attention has been paid to the role of memory system
design in the compiler, architecture, and CAD domains. Many of these
techniques, while applicable to some extent, do not fully exploit the
optimization opportunities in embedded system design. From an applica-
tion viewpoint, embedded systems are special-purpose, and so are amena-
ble to aggressive optimization techniques that can fully utilize knowledge
of the applications. Whereas many traditional memory-related hardware
and software optimizations had to account for variances due to general-
purpose applications, memory optimizations for embedded systems can be
tailored to suit the expected profile of code and data. Furthermore, from an
architectural viewpoint, the embedded system designer pays great atten-
tion to the customization of the memory subsystem (both on-chip, as well as
off-chip): this leads to many nontraditional memory organizations, with a
standard cache hierarchy being only one of many memory architectural
options. Finally, from a constraint viewpoint, the embedded system de-
signer needs to meet not only system performance goals, but also has to do
this within a power budget (especially for mobile applications), and meet
real-time constraints. The system performance should account for not only
the processor’s speed but also the system bus load to the shared board-level
storage units such as main memory and disk. Even the L2 cache is shared
in a multiprocessor context. As a result of all this, the memory and bus
subsystem costs become a significant contributor to overall system costs,
and thus the embedded system designer attempts to minimize memory
requirements with the goal of lowering overall system costs.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization
151
In this survey, we present a variety of optimization techniques for data
and memory used in embedded systems. We begin in Section 2 with a
survey of several global optimizations that are independent of the target
architectural platform, and which, more often than not, always result in
improved performance, cost, and power. These optimizations take the form
of source-to-source code transformations that precede many traditional
compiler and synthesis steps, and which move the design to a superior
starting point in the design space exploration of alternative embedded
system realizations.
Next, in Section 3, we discuss optimization opportunities in the context of
specific memory modules, customized memory architectures, and their use
in both a hardware (or behavioral) synthesis context, as well as in a
software (or traditional compiler) context. This section progresses from
optimization techniques applied to memory elements closest to the compu-
tational engines – registers and register files, and then discusses optimiza-
tion techniques for increasingly distant memory structures: SRAM, cache,
and DRAM. We survey approaches in the modeling of these disparate
memory structures, their customization, and their optimization.
Finally, in Section 4, we survey memory address generation technqiues.
An important byproduct of applying both platform-independent as well as
memory architecture-specific optimizations is that the memory accesses
undergo a significant amount of transformation from the original source
code. Thus, attention must be paid to effective generation of the target
memory addresses, implemented either as code running on a programmable
processor, or as data consumed by a variety of hardware and software
engines.
Since this survey primarily covers data-related optimizations, we do not
address in detail techniques that are specific to instructions, instruction
caches, etc. However, we point out analogous optimizations that apply to
instructions in relevant sections.
2. PLATFORM-INDEPENDENT CODE TRANSFORMATIONS
The importance of performing loop control flow transformations prior to the
memory organization related tasks has been recognized quite early in
compiler theory (for an overview, see Banerjee et al. [1993]) and the
embedded system synthesis domain [Verbauwhede et al. 1989]; it follows
that if such target architecture-independent transformations are not ap-
plied, the resulting memory organization will be heavily suboptimal. In this
section we examine the role of source-to-source code transformations in the
solution to the data transfer and storage bottleneck problem. This is
especially important for embedded applications where performance is not
the only goal; cost issues such as memory footprint and power consumption
are also crucial. The fact that execution speed and energy for a given
application form at least partly different objective functions that require
different optimization strategies in an embedded context has been conclu-
sively shown since 1994 (for example, see the early work in Catthoor et al.
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
152
P. R. Panda et al.
[1994] and Meng et al. [1995]). Even in general-purpose processors, they
form different axes of the exploration space for the memory organization
[Brockmeyer et al. 2000a].
Many of these code transformations can be carefully performed in a
platform-independent order [Catthoor et al. 2000; Danckaert et al. 1999].
This very useful property allows us to apply them to a given application
code without having any prior knowledge of the platform architecture
parameters such as memory sizes, communication scheme, and datapath
type. The resulting optimized code can then be passed through a platform-
dependent stage to obtain further cost-performance improvements and
tradeoffs. We will see in subsequent sections that the optimizations are
especially useful when the target architecture has a customizable memory
organization.
We first discuss global (data flow and) loop transformations in Section
2.1. This will be followed by data reuse-related transformations in Section
2.2. Finally, in Section 2.3, we study the link with and the impact on
processor partitioning and parallelisation. A good overview of research on
system-level transformations can be found in Catthoor et al. [1998] and
Benini and de Micheli [2000], with the latter focussing on low-power
techniques. In the following sections we limit ourselves to discussion of the
most directly related work.
2.1 Code Rewriting Techniques for Access Locality and Regularity
Code rewriting techniques, consisting of loop (and sometimes also data
flow) transformations, are an essential part of modern optimizing and
parallelizing compilers. They are mainly used to enhance the temporal and
spatial locality for cache performance and to expose the inherent parallel-
ism of the algorithm to the outer (for asynchronous parallelism) or inner
(for synchronous parallelism) loop nests [Amarasinghe et al. 1995; Wolfe
1996; Banerjee et al. 1993]. Other application areas are communication-
free data allocation techniques [Chen and Sheu 1994] and optimizing
communications in general [Gupta et al. 1996].
Most work has focused on interactive systems, with very early (since the
late 70’s) work [Loveman 1977]. Environments such as Tiny [Wolfe 1991];
Omega at the University of Maryland [Kelly and Pugh 1992]; SUIF at
Stanford [Amarasinghe et al. 1995; [Hall et al. 1996]; the Paradigm
compiler at the University of Illinois [Banerjee et al. 1995]; (and earlier
work [Polychronopoulos 1988]). The ParaScope Editor [McKinley et al.
1993] at Rice University are representative of this large body of work.
In addition, research has been performed on (partly) automating the
steering of these loop transformations. Many transformations and methods
to steer them have been proposed that increase the parallelism in several
contexts. This has happened in the array synthesis community (e.g., at
Saarbrucken [Thiele 1989]; at Versailles [Feautrier 1995]; and E.N.S. Lyon
[Darte et al. 1993]; and at the University of SW Louisiana [Shang et al.
1992]). In the parallelizing compiler community (e.g., at Cornell [Li and
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Data and Memory Optimization
153
Pingali 1992]; at Illinois [Padua and Wolfe 1986]; at Stanford [Wolf and
Lam 1991] and [Amarasinghe et al. 1995]; at Santa Clara [Shang et al.
1996]); and finally in the high-level synthesis community also (at the
University of Minnesota [Parhi 1989] and the University of Notre-Dame
[Passos and Sha 1994]).
Efficient parallelism is however partly coupled to locality of data access ,
and this has been incorporated in a number of approaches. Examples are
the work on data and control flow transformations for distributed shared-
memory machines at the University of Rochester[Cierniak and Li 1995], or
heuristics to improve the cache hit ratio and execution time at the Univer-
sity of Amherst [McKinley et al. 1996]. Rice University has recently also
started investigating the actual memory bandwidth issues and the relation
to loop fusion [Ding and Kennedy 2000]. At E.N.S. Lyon, the effect of
several loop transformation on memory access has been studied too
[Fraboulet et al. 1999].
It is thus no surprise that these code rewriting techniques are also very
important in the context of data transfer and storage (DTS) solutions,
especially for embedded applications that permit customized memory orga-
nizations. As the first optimization step in the design methodology pro-
posed in Franssen et al. [1994]; Greef et al. [1995]; and Masselos et
al.1999a]; they were able to significantly reduce the required amount of
storage and transfers and improve access behavior, thus enabling the
ensuing steps of more platform-dependent optimizations. As such, the
global loop transformations mainly increase the locality and regularity of
the accesses in the code. In an embedded context this is clearly good for
memory size (area) and memory accesses (power) [Franssen et al. 1994;
Greef et al. 1995], but of course also for pure performance [Masselos et al.
1999a], even though the two objectives do not fully lead to the same loop
transformation steering. The main distinction from the vast amount of
earlier related work in the compiler literature is that they perform these
transformations across all loop nests in the entire program [Franssen et al.
1994]. Traditional loop optimizations performed in compilers, where the
scope of loop transformations is limited to one procedure or usually even
one loop nest, can enhance the locality (and parallelization possibilities)
within that loop nest, but may not solve the global data flow and associated
buffer space needed between the loop nests or procedures. A recent trans-
formation framework including interprocedural analysis proposed in McK-
inley [1998] is a step in this direction: it is focused on parallelisation for a
shared memory multiprocessor. The memory-related optimizations are still
performed on a loop-nest basis (and so are “local”); but the loops in that
loop nest may span different procedures and a fusing preprocessing step
tries to combine all compatible loop nests that do not have dependencies
blocking their fusing. The goal of the fusing is primarily to improve
parallelism.
The global loop and control flow transformation step proposed in Greef et
al. [1995]; Franssen et al. [1994]; and Masselos et al. [1999a] can be viewed
as a precompilation phase, applied prior to conventional compiler loop
ACM Transactions on Design Automation of Electronic Systems, Vol. 6, No. 2, April 2001.
Zgłoś jeśli naruszono regulamin