book

Reconfigurable Computing

by Scott Hauck, André DeHon

July 2010

Intermediate to advanced

944 pages

25h 59m

English

Morgan Kaufmann

Read now

Unlock full access

Copyright
The Morgan Kaufmann Series in Systems on Silicon
List of Contributors
Preface
Acknowledgments
Introduction
I. Reconfigurable Computing Hardware
1. Device Architecture
1.1. Logic—The Computational Fabric1.1.1. Logic Elements1.1.2. Programmability1.2. The Array and Interconnect1.2.1. Interconnect StructuresNearest neighborSegmentedHierarchical1.2.2. Programmability1.2.3. Summary1.3. Extending Logic1.3.1. Extended Logic ElementsFast carry chainMultipliersRAMProcessor blocks1.3.2. Summary1.4. Configuration1.4.1. SRAM1.4.2. Flash Memory1.4.3. Antifuse1.4.4. Summary1.5. Case Studies1.5.1. Altera StratixLogic architectureRouting architecture1.5.2. Xilinx Virtex-II ProLogic architectureRouting architecture1.6. SummaryReferences
2. Reconfigurable Computing Architectures
2.1. Reconfigurable Processing Fabric Architectures2.1.1. Fine-grainedGarp’s nonsymmetrical RPF2.1.2. Coarse-grainedPipeRench2.2. RPF Integration into Traditional Computing Systems2.2.1. Independent Reconfigurable Coprocessor ArchitecturesRaPiD2.2.2. Processor + RPF ArchitecturesLoosely coupled RPF and processor architectureTightly coupled RPF and processorChimaera2.3. Summary and Future WorkReferences
3. Reconfigurable Computing Systems
3.1. Early Systems3.2. PAM, VCC, and Splash3.2.1. PAM3.2.2. Virtual Computer3.2.3. Splash3.3. Small-scale Reconfigurable Systems3.3.1. PRISM3.3.2. CAL and XC62003.3.3. Cloning3.4. Circuit Emulation3.4.1. AMD/Intel3.4.2. Virtual Wires3.5. Accelerating Technology3.5.1. Teramac3.6. Reconfigurable Supercomputing3.6.1. Cray, SRC, and Silicon Graphics3.6.2. The CMX-2X3.7. Non-FPGA Research3.8. Other System Issues3.9. The Future of Reconfigurable SystemsReferences
4. Reconfiguration Management
4.1. Reconfiguration4.2. Configuration Architectures4.2.1. Single-context4.2.2. Multi-context4.2.3. Partially Reconfigurable4.2.4. Relocation and Defragmentation4.2.5. Pipeline Reconfigurable4.2.6. Block Reconfigurable4.2.7. Summary4.3. Managing the Reconfiguration Process4.3.1. Configuration Grouping4.3.2. Configuration Caching4.3.3. Configuration Scheduling4.3.4. Software-based Relocation and Defragmentation4.3.5. Context Switching4.4. Reducing Configuration Transfer Time4.4.1. Architectural Approaches4.4.2. Configuration Compression4.4.3. Configuration Data Reuse4.5. Configuration Security4.6. SummaryReferences

II. Programming Reconfigurable Systems
5. Compute Models and System Architectures
5.1. Compute Models5.1.1. Challenges5.1.2. Common PrimitivesFunctionTransform or object5.1.3. DataflowSingle-rate synchronous dataflowSynchronous dataflowDynamic streaming dataflowDynamic Streaming Dataflow with PeeksStreaming dataflow with allocationGeneral dataflow5.1.4. Sequential ControlFinite stateSequential control with allocationSingle memory pool5.1.5. Data Parallel5.1.6. Data-centric5.1.7. Multi-threaded5.1.8. Other Compute Models5.2. System Architectures5.2.1. Streaming DataflowData presenceDatapath sharingStreaming coprocessorsInterconnect sharing5.2.2. Sequential ControlFSMDVLIW datapath controlProcessorInstruction augmentationFunctional Unit modelCoprocessor modelPhased reconfiguration managerWorker farm5.2.3. Bulk Synchronous Parallelism5.2.4. Data ParallelSingle program, multiple dataSingle-instruction multiple dataVectorVector coprocessors5.2.5. Cellular AutomataFolded CA5.2.6. Multi-threadedCommunicating FSMs with datapathsProcessors with channelsMessage passingShared memory5.2.7. Hierarchical CompositionReferences
6. Programming FPGA Applications in VHDL
6.1. VHDL Programming6.1.1. Structural Description6.1.2. RTL Description6.1.3. Parametric Hardware Generation6.1.4. Finite-state Machine Datapath Example6.1.5. Advanced TopicsDelta delayMultivalued logic6.2. Hardware Compilation Flow6.2.1. Constraints6.3. Limitations of VHDLReferences
7. Compiling C for Spatial Computing
7.1. Overview of How C Code Runs on Spatial Hardware7.1.1. Data Connections between Operations7.1.2. Memory7.1.3. If-then-else Using Multiplexers7.1.4. Actual Control Flow7.1.5. Optimizing the Common Path7.1.6. Summary and Challenges7.2. Automatic Compilation7.2.1. Hyperblocks7.2.2. Building a Dataflow Graph for a HyperblockTop-level build algorithmsBuilding data edgesBuilding muxesPredicatesOrdering edgesLive variables at exitsScalar variables in memory7.2.3. DFG OptimizationConstant foldingIdentity simplificationStrength reductionDead node eliminationCommon subexpression eliminationBoolean value identificationType-based operator size reductionDataflow analysis-based operator size reductionMemory access optimizationRemoving redundant loads7.2.4. From DFG to Reconfigurable FabricPacking operations into clock cyclesSchedulingPipelined schedulingConnecting memory nodes to the memory portsWhat next?7.3. Uses and Variations of C Compilation to Hardware7.3.1. Automatic HW/SW Partitioning7.3.2. Programmer AssistanceUseful code changesLoop interchange, reversal, and other transformsLoop fusion and fissionLocal arraysControl structureAddress indirectionDeclaration of data sizesUseful annotationsIntegrating operator-level modulesIntegrating large blocks7.4. SummaryReferences
8. Programming Streaming FPGA Applications Using Block Diagrams in Simulink
8.1. Designing High-performance Datapaths Using Stream-based Operators8.2. An Image-processing Design Driver8.2.1. Converting RGB Video to Grayscale8.2.2. Two-dimensional Video Filtering8.2.3. Mapping the Video Filter to the BEE2 FPGA Platform8.3. Specifying Control in Simulink8.3.1. Explicit Controller Design with Simulink Blocks8.3.2. Controller Design Using the Matlab M Language8.3.3. Controller Design Using VHDL or Verilog8.3.4. Controller Design Using Embedded Microprocessors8.4. Component Reuse: Libraries of Simple and Complex Subsystems8.4.1. Signal-processing Primitives8.4.2. Tiled Subsystems8.5. SummaryAcknowledgmentsReferences
9. Stream Computations Organized for Reconfigurable Execution
9.1. Programming9.1.1. Task Description Format9.1.2. C++ Integration and Composition9.2. System Architecture and Execution Patterns9.2.1. Stream Support9.2.2. Phased Reconfiguration9.2.3. Sequential versus Parallel9.2.4. Fixed-size and Standard I/O Page9.3. Compilation9.4. Runtime9.4.1. Scheduling9.4.2. Placement9.4.3. Routing9.5. HighlightsReferences
10. Programming Data Parallel FPGA Applications Using the SIMD/Vector Model
10.1. SIMD Computing on FPGAs: An Example10.2. SIMD Processing Architectures10.3. Data Parallel Languages10.4. Reconfigurable Computers for SIMD/Vector Processing10.5. Variations of SIMD/Vector Computing10.5.1. Multiple SIMD Engines10.5.2. A Multi-SIMD Coarse-grained Array10.5.3. SPMD Model10.6. Pipelined SIMD/Vector Processing10.7. SummaryAcknowledgmentsReferences
11. Operating System Support for Reconfigurable Computing
11.1. History11.2. Abstracted Hardware Resources11.2.1. Programming Model11.3. Flexible Binding11.3.1. Install Time Binding11.3.2. Runtime Binding11.3.3. Fast CAD for Flexible Binding11.4. Scheduling11.4.1. On-demand Scheduling11.4.2. Static Scheduling11.4.3. Dynamic Scheduling11.4.4. Quasi-static Scheduling11.4.5. Real-time Scheduling11.4.6. Preemption11.5. Communication11.5.1. Communication StylesShared memoryMethod callsStreams11.5.2. Virtual Memory11.5.3. I/O11.5.4. Uncertain Communication Latency11.6. Synchronization11.6.1. Explicit Synchronization11.6.2. Implicit Synchronization11.6.3. Deadlock Prevention11.7. Protection11.7.1. Hardware Protection11.7.2. Intertask Communication11.7.3. Task Configuration Protection11.8. SummaryReferences
12. The JHDL Design and Debug System
12.1. JHDL Background and Motivation12.2. The JHDL Design Language12.2.1. Level-1 Design: Primitive Instantiation12.2.2. Level-2 Design: Using the Logic Class and Its Provided Methods12.2.3. Level-3 Design: Programmatic Circuit Generation (Module Generators)12.2.4. JHDL Is a Structural Design Language12.2.5. JHDL Is a Programmatic Circuit Design Language12.3. The JHDL CAD System12.3.1. Testbenches in JHDL12.3.2. The cvt Class12.4. JHDL’s Hardware Mode12.5. Advanced JHDL Capabilities12.5.1. Dynamic Testbenches12.5.2. Behavioral Synthesis12.5.3. Advanced Debugging CapabilitiesDebug circuitry synthesisCheckpointing, context switching, and remote access12.6. SummaryReferences
III. Mapping Designs to Reconfigurable Platforms
13. Technology Mapping
13.1. Structural Mapping Algorithms13.1.1. Cut Generation13.1.2. Area-oriented Mapping13.1.3. Performance-driven Mapping13.1.4. Power-aware Mapping13.2. Integrated Mapping Algorithms13.2.1. Simultaneous Logic Synthesis, Mapping13.2.2. Integrated Retiming, Mapping13.2.3. Placement-driven Mapping13.3. Mapping Algorithms for Heterogeneous Resources13.3.1. Mapping to LUTs of Different Input Sizes13.3.2. Mapping to Complex Logic Blocks13.3.3. Mapping Logic to Embedded Memory Blocks13.3.4. Mapping to Macrocells13.4. SummaryReferences
FPGA Placement
14. Placement for General-purpose FPGAs
14.1. The FPGA Placement Problem14.1.1. Device Legality Constraints14.1.2. Optimization Goals14.1.3. Designer Placement Directives14.2. Clustering14.3. Simulated Annealing for Placement14.3.1. VPR and Related Annealing Algorithms14.3.2. Simultaneous Placement and Routing with Annealing14.4. Partition-based Placement14.5. Analytic Placement14.6. Further Reading and Open ChallengesReferences
15. Datapath Composition
15.1. Fundamentals15.1.1. Regularity15.1.2. Datapath Layout15.2. Tool Flow Overview15.3. The Impact of Device Architecture15.3.1. Architecture Irregularities15.4. The Interface to Module Generators15.4.1. The Flow Interface15.4.2. The Data Model15.4.3. The Library Specification15.4.4. The Intra-module Layout15.5. The Mapping15.5.1. 1:1 Mapping15.5.2. N:1 Mapping15.5.3. The Combined Approach15.6. Placement15.6.1. Linear Placement15.6.2. Constrained Two-dimensional Placement15.6.3. Two-dimensional Placement15.7. Compaction15.7.1. Selecting HWOPs for Compaction15.7.2. Regularity Analysis15.7.3. Optimization TechniquesWord-level optimizationContext-sensitive optimizationLogic optimization15.7.4. Building the Super-HWOP15.7.5. Discussion15.8. Summary and Future WorkReferences
16. Specifying Circuit Layout on FPGAs
16.1. The Problem16.2. Explicit Cartesian Layout Specification16.3. Algebraic Layout Specification16.3.1. Case Study: Batcher’s Bitonic Sorter16.4. Layout Verification for Parameterized Designs16.5. SummaryReferences
17. PathFinder: A Negotiation-based, Performance-driven Router for FPGAs
17.1. The History of PathFinder17.2. The PathFinder Algorithm17.2.1. The Circuit Graph Model17.2.2. A Negotiated Congestion Router17.2.3. The Negotiated Congestion/Delay Router17.2.4. Applying A* to PathFinder17.3. Enhancements and Extensions to PathFinder17.3.1. Incremental Rerouting17.3.2. The Cost Function17.3.3. Resource Cost17.3.4. The Relationship of PathFinder to Lagrangian Relaxation17.3.5. Circuit Graph ExtensionsSymmetric device inputsDe-multiplexersBidirectional switches17.4. Parallel PathFinder17.5. Other Applications of the PathFinder Algorithm17.6. SummaryAcknowledgmentsReferences
18. Retiming, Repipelining, and C-slow Retiming
18.1. Retiming: Concepts, Algorithm, and Restrictions18.2. Repipelining and C-slow Retiming18.2.1. Repipelining18.2.2. C-slow Retiming18.3. Implementations of Retiming18.4. Retiming on Fixed-frequency FPGAs18.5. C-slowing as Multi-threading18.6. Why Isn’t Retiming Ubiquitous?References
19. Configuration Bitstream Generation
19.1. The Bitstream19.2. Downloading Mechanisms19.3. Software to Generate Configuration Data19.4. SummaryReferences
20. Fast Compilation Techniques
20.1. Accelerating Classical Techniques20.1.1. Accelerating Simulated Annealing20.1.2. Accelerating PathFinder20.2. Alternative Algorithms20.2.1. Multiphase Solutions20.2.2. Incremental Place and Route20.3. Effect of Architecture20.4. SummaryReferences
IV. Application Development
21. Implementing Applications with FPGAs
21.1. Strengths and Weaknesses of FPGAs21.1.1. Time to Market21.1.2. Cost21.1.3. Development Time21.1.4. Power Consumption21.1.5. Debug and Verification21.1.6. FPGAs and Microprocessors21.2. Application Characteristics and Performance21.2.1. Computational Characteristics and PerformanceData parallelismData element size and arithmetic complexityPipeliningSimple control requirements21.2.2. I/O and Performance21.3. General Implementation Strategies for FPGA-based Systems21.3.1. Configure-once21.3.2. Runtime ReconfigurationGlobal RTRLocal RTRRTR applications21.3.3. Summary of Implementation Issues21.4. Implementing Arithmetic in FPGAs21.4.1. Fixed-point Number Representation and Arithmetic21.4.2. Floating-point Arithmetic21.4.3. Block Floating Point21.4.4. Constant Folding and Data-oriented Specialization21.5. SummaryReferences
22. Instance-specific Design
22.1. Instance-specific Design22.1.1. TaxonomyTypes of instance-specific optimizationsConstant foldingFunction adaptationArchitecture adaptation22.1.2. Approaches22.1.3. Examples of Instance-specific DesignsConstant coefficient multipliersKey-specific crypto-processorsNetwork intrusion detectionCustomizable instruction processors22.2. Partial Evaluation22.2.1. Motivation22.2.2. Process of Specialization22.2.3. Partial Evaluation in PracticeConstant folding in logical expressionsUnnecessary logic removal22.2.4. Partial Evaluation of a MultiplierOptimizing a simple descriptionFunctional specialization for constant inputsGeometric specialization22.2.5. Partial Evaluation at Runtime22.2.6. FPGA-specific ConcernsLUT mappingStatic resourcesVerification of runtime specialization22.3. SummaryReferences
23. Precision Analysis for Fixed-point Computation
23.1. Fixed-point Number System23.1.1. Multiple-wordlength Paradigm23.1.2. Optimization for Multiple Wordlength23.2. Peak Value Estimation23.2.1. Analytic Peak EstimationLinear time-invariant systemsTransfer function calculationExampleScaling with transfer functionsData range propagationForward propagation23.2.2. Simulation-based Peak Estimation23.2.3. Summary of Peak Estimation23.3. Wordlength Optimization23.3.1. Error Estimation and Area ModelsSimulation-based methodsAn analytic technique for linear time-invariant systemsNoise modelNoise propagation and power estimationA hybrid approach for nonlinear differentiable systemsPerturbation analysisDerivative monitorsLinearizationNoise injectionHigh-level area models23.3.2. Search TechniquesA heuristic search procedureAlternative search procedures23.4. SummaryReferences
24. Distributed Arithmetic
24.1. Theory24.2. DA Implementation24.3. Mapping DA onto FPGAs24.4. Improving DA Performance24.5. An Application of DA on an FPGAReferences
25. CORDIC Architectures for FPGA Computing
25.1. CORDIC Algorithm25.1.1. Rotation Mode25.1.2. Scaling Considerations25.1.3. Vectoring Mode25.1.4. Multiple Coordinate Systems and a Unified Description25.1.5. Computational AccuracyAngle approximation errorDatapath rounding error25.2. Architectural Design25.3. FPGA Implementation of CORDIC Processors25.3.1. Convergence25.3.2. Folded CORDIC25.3.3. Parallel Linear Array25.3.4. Scaling Compensation25.4. SummaryReferences
26. Hardware/Software Partitioning
26.1. The Trend Toward Automatic Partitioning26.2. Partitioning of Sequential Programs26.2.1. Granularity26.2.2. Partition Evaluation26.2.3. Alternative Region Implementations26.2.4. Implementation Models26.2.5. ExplorationSimple formulationFormulation with asymmetric communication and greedy/nongreedy automated heuristicsComplex formulations and powerful automated heuristicsOther issues26.3. Partitioning of Parallel Programs26.3.1. Differences among Parallel Programming ModelsGranularityEvaluationAlternative region implementationsImplementation modelsExploration26.4. Summary and DirectionsReferences
V. Case Studies of FPGA Applications
27. SPIHT Image Compression
27.1. Background27.2. SPIHT Algorithm27.2.1. Wavelets and the Discrete Wavelet Transform27.2.2. SPIHT Coding Engine27.3. Design Considerations and Modifications27.3.1. Discrete Wavelet Transform Architectures27.3.2. Fixed-point Precision Analysis27.3.3. Fixed Order SPIHT27.4. Hardware Implementation27.4.1. Target Hardware Platform27.4.2. Design Overview27.4.3. Discrete Wavelet Transform Phase27.4.4. Maximum Magnitude Phase27.4.5. The SPIHT Coding Phase27.5. Design Results27.6. Summary and Future WorkReferences
28. Automatic Target Recognition Systems on Reconfigurable Devices
28.1. Automatic Target Recognition Algorithms28.1.1. Focus of Attention28.1.2. Second-level Detection28.2. Dynamically Reconfigurable Designs28.2.1. Algorithm Modifications28.2.2. Image Correlation Circuit28.2.3. Performance Analysis28.2.4. Template Partitioning28.2.5. Implementation Method28.3. Reconfigurable Static Design28.3.1. Design-specific Parameters28.3.2. Order of Correlation TasksZero mask rows28.3.3. Reconfigurable Image Correlator28.3.4. Application-specific Computation Unit28.4. ATR Implementations28.4.1. A Dynamically Reconfigurable System28.4.2. A Statically Reconfigurable System28.4.3. Reconfigurable Computing Models28.5. SummaryAcknowledgmentsReferences
29. Boolean Satisfiability: Creating Solvers Optimized for Specific Problem Instances
29.1. Boolean Satisfiability Basics29.1.1. Problem Formulation29.1.2. SAT Applications29.2. SAT-solving Algorithms29.2.1. Basic Backtrack Algorithm29.2.2. Improving the Backtrack Algorithm29.3. A Reconfigurable SAT Solver Generated According to an SAT Instance29.3.1. Problem Analysis29.3.2. Implementing a Basic Backtrack Algorithm with Reconfigurable Hardware29.3.3. Implementing an Improved Backtrack Algorithm with Reconfigurable Hardware29.4. A Different Approach to Reduce Compilation Time and Improve Algorithm Efficiency29.4.1. System Architecture29.4.2. Performance29.4.3. Implementation Issues29.5. DiscussionReferences
30. Multi-FPGA Systems: Logic Emulation
30.1. Background30.2. Uses of Logic Emulation Systems30.3. Types of Logic Emulation Systems30.3.1. Single-FPGA Emulation30.3.2. Multi-FPGA Emulation30.3.3. Design-mapping Overview30.3.4. Multi-FPGA Partitioning and Placement Approaches30.3.5. Multi-FPGA Routing Approaches30.4. Issues Related to Contemporary Logic Emulation30.4.1. In-circuit Emulation30.4.2. Coverification30.4.3. Logic Analysis30.5. The Need for Fast FPGA Mapping30.6. Case Study: The VirtuaLogic VLE Emulation System30.6.1. The VirtuaLogic VLE Emulation System Structure30.6.2. The VirtuaLogic Emulation Software Flow30.6.3. Multiported Memory Mapping30.6.4. Design Mapping with Multiple Asynchronous Clocks30.6.5. Incremental Compilation of Designs30.6.6. VLE Interfaces for Coverification30.6.7. Parallel FPGA Compilation for the VLE System30.7. Future Trends30.8. SummaryReferences
31. The Implications of Floating Point for FPGAs
31.1. Why Is Floating Point Difficult?31.1.1. General Implementation Considerations31.1.2. Adder Implementation31.1.3. Multiplier Implementation31.2. Floating-point Application Case Studies31.2.1. Matrix MultiplyFPGA implementationPerformance31.2.2. Dot ProductFPGA implementationPerformance31.2.3. Fast Fourier TransformFPGA implementationParallel architecturePipelined architectureParallel–pipelined architecturePerformance31.3. SummaryReferences
32. Finite Difference Time Domain: A Case Study Using FPGAs
32.1. The FDTD Method32.1.1. Background32.1.2. The FDTD Algorithm32.1.3. FDTD ApplicationsGround-penetrating radarBreast cancer detectionSpiral antenna model32.1.4. The Advantages of FDTD on an FPGAParallelism and deep pipeliningFixed-point arithmetic32.2. FDTD Hardware Design Case Study32.2.1. The WildStar-II Pro FPGA Computing Board32.2.2. Data Analysis and Fixed-point Quantization32.2.3. Hardware ImplementationMemory hierarchy and memory interfaceManaged-cache moduleMemory transfer bottleneckDataflow and processing core optimizationExpansion to three dimensionsPipelining and parallelismPipeliningParallelismTwo hardware implementations32.2.4. Performance Results32.3. SummaryReferences
33. Evolvable FPGAs
33.1. The POE Model of Bioinspired Design Methodologies33.2. Artificial Evolution33.2.1. Genetic Algorithms33.3. Evolvable Hardware33.3.1. Genome EncodingHigh-level languagesLow-level languagesFitness calculation33.4. Evolvable Hardware: A Taxonomy33.4.1. Extrinsic Evolution33.4.2. Intrinsic Evolution33.4.3. Complete EvolutionCentralized evolutionPopulation-oriented evolution33.4.4. Open-ended Evolution33.5. Evolvable Hardware Digital Platforms33.5.1. Xilinx XC6200 Family33.5.2. Evolution on Commercial FPGAsVirtual reconfigurationDynamic partial reconfiguration33.5.3. Custom Evolvable FPGAs33.6. Conclusions and Future DirectionsReferences
34. Network Packet Processing in Reconfigurable Hardware
34.1. Networking with Reconfigurable Hardware34.1.1. The Motivation for Building Networks with Reconfigurable Hardware34.1.2. Hardware and Software for Packet Processing34.1.3. Network Data Processing with FPGAs34.1.4. Network Processing System Modularity34.2. Network Protocol Processing34.2.1. Internet Protocol Wrappers34.2.2. TCP Wrappers34.2.3. Payload-processing Modules34.2.4. Payload Processing with Regular Expression Scanning34.2.5. Payload Scanning with Bloom Filters34.3. Intrusion Detection and Prevention34.3.1. Worm and Virus Protection34.3.2. An Integrated Header, Payload, and Queuing System34.3.3. Automated Worm Detection34.4. Semantic Processing34.4.1. Language Identification34.4.2. Semantic Processing of TCP Data34.5. Complete Networking System Issues34.5.1. The Rack-mount Chassis Form Factor34.5.2. Network Control and Configuration34.5.3. A Reconfiguration Mechanism34.5.4. Dynamic Hardware Plug-ins34.5.5. Partial Bitfile Generation34.5.6. Control Channel Security34.6. SummaryReferences
35. Active Pages: Memory-centric Computation
35.1. Active Pages35.1.1. DRAM Hardware Design35.1.2. Hardware Interface35.1.3. Programming Model35.2. Performance Results35.2.1. Speedup over Conventional Systems35.2.2. Processor–Memory Nonoverlap35.2.3. Summary35.3. Algorithmic Complexity35.3.1. Algorithms35.3.2. Array-Insert35.3.3. LCS (Two-dimensional Dynamic Programming)35.3.4. Summary35.4. Exploring Parallelism35.4.1. Speedup over Conventional35.4.2. Multiplexing Performance35.4.3. Processor Width Performance35.4.4. Processor Width versus MultiplexingNonactive memoryActive Pages processing timePartitioning35.4.5. Summary35.5. Defect Tolerance35.6. Related Work35.7. SummaryAcknowledgmentsReferences
VI. Theoretical Underpinnings and Future Directions
36. Theoretical Underpinnings
36.1. General Computational Array Model36.2. Implications of the General Model36.2.1. Instruction Distribution36.2.2. Instruction Storage36.3. Induced Architectural Models36.3.1. Fixed Instructions (FPGA)36.3.2. Shared Instructions (SIMD Processors)36.4. Modeling Architectural Space36.4.1. Raw Density from Architecture36.4.2. EfficiencyMismatch in WsimdMismatch in NinstrComposite effectsEfficiency of processors and FPGAs36.4.3. Caveats36.5. Implications36.5.1. Density of Computation versus Description36.5.2. Historical Appropriateness36.5.3. Reconfigurable ApplicationsReferences
37. Defect and Fault Tolerance
37.1. Defects and Faults37.2. Defect Tolerance37.2.1. Basic Idea37.2.2. Substitutable Resources37.2.3. YieldPerfect yieldYield with sparing37.2.4. Defect Tolerance through SparingTestingGlobal sparingPerfect component modelLocal sparing37.2.5. Defect Tolerance with MatchingMatching formulationFine-grained Pterm matchingFPGA component level37.3. Transient Fault Tolerance37.3.1. Feedforward CorrectionMemory37.3.2. Rollback Error RecoveryDetectionRecoveryCommunications37.4. Lifetime Defects37.4.1. Detection37.4.2. Repair37.5. Configuration Upsets37.6. OutlookReferences
38. Reconfigurable Computing and Nanoscale Architecture
38.1. Trends in Lithographic Scaling38.2. Bottom-up Technology38.2.1. Nanowires38.2.2. Nanowire Assembly38.2.3. Crosspoints38.3. Challenges38.4. Nanowire Circuits38.4.1. Wired-OR Diode Logic Array38.4.2. Restoration38.5. Statistical Assembly38.6. nanoPLA Architecture38.6.1. Basic Logic Block38.6.2. Interconnect Architecture38.6.3. Memories38.6.4. Defect Tolerance38.6.5. Design Mapping38.6.6. Density Benefits38.7. Nanoscale Design Alternatives38.7.1. Imprint Lithography38.7.2. Interfacing38.7.3. Restoration38.8. SummaryReferences

Content preview from Reconfigurable Computing

Chapter 35. Active Pages: Memory-centric Computation

Diana FranklinDepartment of Computer Science California Polytechnic State University

Although field-programmable gate arrays (FPGAs) excel at tailoring the computation and interconnect to an application’s needs, we can go one step further. In many applications, regardless of the speed of the computation, memory performance always will be the limiting factor. This problem, referred to as the memory wall, is broken up into two parts—memory latency and bandwidth. For large-scale data-parallel applications, the computation can be moved to memory. This allows for both parallel computation and increased bandwidth. The replication of small computation units provides parallelism, and the sum of their ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9780123705228

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Reconfigurable Computing

by Scott Hauck, André DeHon

Chapter 35. Active Pages: Memory-centric Computation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.