Publications on Soft Processors and performance analsysis for Liquid ArchitecturesReconfigurable Network Group
(In Reverse Chronological Order)
(Also available as BIBTeX format)
Abstract: Simulation has been the de facto standard method for performance evaluation of newly proposed ideas in computer architecture for many years. While simulation allows for theoretically arbitrary fidelity (at least to the level of cycle accuracy) as well as the ability to monitor the architecture without perturbing the execution itself, it suffers from low effective fidelity and long execution times.
We (and others) have advocated the use of empirical experimentation on reconfigurable hardware for computer architecture performance assessment. In this paper, we describe an empirical performance assessment subsystem implemented in reconfigurable hardware and illustrate its use. Results are presented that demonstrate the need for the types of performance assessment that reconfigurable hardware can provide.
Abstract: We present the design, implementation, and evaluation of a circuit we call the Statistics Module that captures cycle-accurate performance data at (or above) the microarchitecture layer. The circuit is deployed introspectively--in the architecture itself-- using an FPGA in the context of a soft-core implementation of a SPARC architecture (LEON). Accessible over the Internet, the circuit can be dynamically configured (without resynthesis) to capture programlevel, function-level, and instruction-level statistics on any subset of predefined VHDL signals. The circuit is deployed outside the actual soft core, so that its operation does not interfere with a program's execution at any level.
In contrast with simulations, StatsMod monitors actual real-time program executions, including runtime artifacts such as multithreading, operating system support, and external interrupts. Furthermore, unlike software-introduced instrumentation, the measurements do not affect the statistics, and microarchitecture characteristics are easily captured.
Our design avoids the otherwise combinatorial size of circuitry that would be required to accommodate all methods and events, scaling well with the number of artifacts that are actually measured. We have used this circuit to measure cycle-accurate cache-RAM statistics, such as cache hits and misses, RAM reads and writes, using both write-through and write-back policies. In this paper, we show the scalabilty of our design as it accommodates more methods and events.
Abstract: Applications for constrained embedded systems are subject to strict time constraints and restrictive resource utilization. With soft core processors, application developers can customize the processor for their application, constrained by resources but aimed at high application performance. With such freedom in the design space of the processor, however, comes complexity. We present here an automatic optimization technique that helps the developers with the processor microarchitecture customization.
A naive approach exploring all possible configurations is exponential with the number of parameters and hence is clearly infeasible, even with only tens of reconfigurable parameters. Instead, our approach runs in time that is linear with the number of parameter values, based on an assumption of parameter independence. This makes the approach feasible and scalable. For the dimensions that we customize, namely application runtime and hardware resources, we formulate their costs as a constrained binary integer nonlinear optimization program. Though the results are not guaranteed to be optimal, we find they are near-optimal in practice. Our technique itself is general and can be applied to other design-space exploration problems.
Abstract: Applications for constrained embedded systems require careful attention to the match between the application and the support offered by an architecture, at the ISA and microarchitecture levels. Generic processors, such as ARM and Power PC, are inexpensive, but with respect to a given application, they often overprovision in areas that are unimportant for the application's performance. Moreover, while application-specific, customized logic could dramatically improve the performance of an application, that approach is typically too expensive to justify its cost for most applications. In this paper,we describe our experience using reconfigurable architectures to develop an understanding of an application's performance and to enhance its performance with respect to customized, constrained logic.We begin with a standard ISA currently in use for embedded systems.We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the application's performance.
Abstract: We describe our experience using reconfigurable architectures to develop an understanding of an applications performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the applications performance.
Abstract: We present an implementation of a liquid-architecture system that supports efficient development, prototyping, and performance evaluation of custom architectures. The implementation integrates the LEON soft-core, SPARC-compatible processor into the Field-programmable Port Extender (FPX). The resulting platform can be instantiated, configured, and executed via the Internet.
Abstract: While hardware plugins are well suited for processing data with high throughput, software plugins are well suited for implementing complex control functions. A plugin module has been implemented for the FPX that executes software on an embedded soft-core processor. By including this module in an FPX design, it is possible to implement active networking functions on the FPX using both hardware and software. The KCPSM, an 8-bit microcontroller developed by Xilinx Corp., has been embedded into a FPX module. The module includes circuits to be reprogrammed over the network and to execute new programs between the processing of data packets. A sample application, called the FPX KCPSM Module has been developed that illustrates how easily an application can make use of the hybrid system. This module loads the program memory of the KCPSM from an incoming UDP packet, and executes the new program upon receiving a new incoming UDP packet. The resulting circuit runs at 70MHz and occupies 35% on a Xilinx XCV1000E-7- FG680.