-
When Amdahl and Off-die Bandwidth Kill CMP Scaling: Two Tough Problems and Two Radical Solutions
Thu, Mar 01, 2007 @ 10:30 AM - 11:30 AM
Ming Hsieh Department of Electrical and Computer Engineering
Conferences, Lectures, & Seminars
"When Amdahl and Off-die Bandwidth Kill CMP Scaling: Two Tough Problems and Two Radical Solutions"Murali AnnavaramNokia Research Center, Palo AltoAbstract:The continuous increase in transistor density coupled with the simultaneous reduction in chip power budget caused a major paradigm shift in the processor design. The chip industry is moving away from high frequency power hungry uniprocessors towards chip multiprocessors (CMPs) with many lower power cores. The road to CMP scaling, however, has two significant barriers. First, single threaded applications parallelized to take advantage of CMPs will have unavoidable phases of sequential execution. Amdahl's law dictates that the speedup of such parallel programs will be limited by the sequential portion of the computation. The second barrier to CMP scaling is that the off-die bandwidth requirement will grow dramatically as the working set of multi-threaded applications grows with the thread count. Furthermore, increased thread level parallelism results in reduced shared cache locality, as interleaved accesses from multiple threads appear random at the shared cache level. The twofold effects of increased working set size and reduced locality will place significant pressure on the limited number of pins to reach off-die memory.This talk focuses on these two problems and presents two radical solutions. In the first part of this talk, I will present Energy Per Instruction (EPI) throttling -- a novel mechanism for mitigating Amdahl's bottleneck by varying the amount of energy expended to process instructions according to the amount of available parallelism. When a program enters a sequential phase the EPI throttling mechanism assigns all available energy to a single processor so as to execute the sequential phase quickly; conversely, when a program enters a parallel phase energy is distributed to several cores within the CMP so as to process as many instructions in parallel as possible. More generally, using the equation, Power=Energy per instruction (EPI) * Instructions per second (IPS), EPI throttling proposes that during phases of limited parallelism (low IPS) the chip multi-processor will spend more EPI; similarly, during phases of higher parallelism (high IPS) the chip multi-processor will spend less EPI. The performance benefits of an EPI throttle are evaluated on an asymmetric multiprocessor (AMP) prototyped from a physical 4-way Xeon SMP server. Using a wide range of multi-threaded programs, I will show a 38% wall clock speedup using EPI throttled AMP compared to a standard SMP that uses the same power. In the second part of this talk, I will present 3D stacking technology that not only enables stacking large capacity DRAM caches on CMPs but also provides tremendous bandwidth using die to die vias. I will focus on the design challenges of stacking DRAM on CMPs. In particular, I will show that due to the limited DRAM banking the number of page opens increase dramatically as more threads access the DRAM cache. Using detailed simulation results, I will show that, contrary to popular belief, simply stacking a DRAM cache on a CMP does not provide the expected benefits of stacked memory. Using a PC-based stride prefetching mechanism I will show that random page access behavior can be mitigated thereby allowing 3D stacked DRAM to provide the necessary bandwidth and capacity required for CMP scaling.Speaker Bio:Murali Annavaram is a researcher at the Nokia research center in Palo Alto. His current research is focused on exploring mobile device features required for providing location and context-aware computing services. Prior to Nokia he was a senior research scientist at Intel microprocessor research labs where his research spanned the entire spectrum of systems architecture ranging from high level software issues to low level device variations. His research at Intel includes 3D stacking, EPI throttling for power efficient CMP designs, impact of process variability on chip designs, characterizing server workloads for improving simulation and tracing technologies. Murali received his PhD from University of Michigan working with Prof. Ed Davidson focusing on prefetching techniques for irregular applications.Hosted by: Prof. Michel Dubois, dubois@paris.usc.edu
Location: Hughes Aircraft Electrical Engineering Center (EEB) - -248
Audiences: Everyone Is Invited
Contact: Rosine Sarafian