Direct Memory Access is a way of streamlining transfers of large blocks of data between two sections of memory or between memory and an I/O device. Let’s say you want to read in 100 MB from disk and store it in memory. You have two options.

The processor can read each byte at a time from the disk controller into a register, then store the contents of the register to the appropriate memory location. For each byte transferred, the processor must read an instruction, decode the instruction, read the data, read the next instruction, decode the instruction, and then store the data. Then the process starts over again for the next byte.

The second option in moving large amounts of data around the system is DMA. A special device, called a DMA controller (DMAC), performs high-speed transfers between memory and I/O devices. Using DMA bypasses the processor by setting up a channel between the I/O device and the memory. Thus, data is read from the I/O device and written into memory without the need to execute code to perform the transfer on a byte-by-byte (or word-by-word) basis.

In order for a DMA transfer to occur, the DMAC must have use of the address and data buses. There are several ways in which this could be implemented by the system designer. The most common approach (and probably the simplest) is to suspend the operation of the processor and for the processor to release its buses (the buses are tristate). This allows the DMAC to take over the buses for the short period required to perform the transfer. Processors that support DMA usually have a special control input that enables a DMAC (or some other processor) to request the buses.

There are four basic types of DMA:

  • Standard block transfer is accomplished by the DMA controller performing a sequence of memory transfers. The transfers involve a load operation from a source address followed by a store operation to a destination address. Standard block transfers are initiated under software control and are used for moving data structures from one region of memory to another.

  • Demand-mode transfer is similar to standard mode except that the transfer is controlled by an external device. Demand-mode transfers are used to move data between memory and I/O or vice versa. The I/O device requests and synchronizes the movement of data.

  • Fly-by transfer provides high-speed data movement in the system. Instead of using multiple bus accesses as with conventional DMA transfers, fly-by transfers move data from source to destination in a single access. The data is not read into the processor before going to its destination. During a fly-by transfer, memory and I/O are given different bus control signals. For example, an I/O device is given a read request at the same time that memory is given a write request. Data moves from the I/O device straight into the memory device.

  • Data-chaining transfers allow DMA transfers to be performed as specified by a linked list in memory. Data chaining is started by specifying a pointer to a descriptor in memory. The descriptor is a table specifying byte count, source address, destination address, and a pointer to the next descriptor. The DMAC loads the relevant information about the transfer from this table and begins moving data. The transfer continues until the number of bytes transferred is equal to the entry in the byte count field. On completion, the pointer to the next descriptor is loaded. This continues until a null pointer is found.

To illustrate the use of DMA, let’s consider the example of a fly-by transfer of data from a hard disk controller to memory. A DMA transfer begins by the processor configuring the DMAC for the transfer. This setup involves specifying the source, destination, and size of the data, as well as other parameters. The disk controller generates a request for service to the DMAC (not the processor). The DMAC then generates a HOLD or BR (bus request) to the processor. The processor completes the current instruction; places the address, control, and data buses in a high-impedance state (floats, tristates, or releases them); responds to the DMAC with a HOLD-acknowledge or BG (bus granted); and enters a dormant state. Upon receiving a HOLD-acknowledge, the DMAC places the address of the memory location at which the transfer to memory will begin onto the address bus and generates a WRITE to the memory, while the disk controller places the data on the data bus. Hence, a direct memory access is accomplished from the disk controller to the memory.

In a similar fashion, transfers from memory to I/O devices are also possible. DMACs are capable of handling block transfers of data. The DMAC automatically increments the address on the address bus to point to each successive memory location as the I/O device generates (or receives) data. Once the transfer is complete, the buses are returned to the processor, and it resumes normal operation.

Not all DMA controllers support all forms of DMA. Some DMA controllers simply read data from a source, hold it internally, and then store it to a destination. They perform the transfer in exactly the same way that a processor would. The advantage of a DMA controller over a processor is that each transfer performed by a processor still has program fetches associated with it. Thus, even though a transfer by a DMA controller takes place by sequential reads and writes, the controller does not also have to fetch and execute code, thereby providing a faster transfer.

Support for DMA is normally not found in small microcontrollers. Some midrange processors (16-bit, low-end 32-bit) may have DMA support. All high-end processors (32-bit and above) will have DMA support, and many include a DMA controller on-chip. Similarly, peripherals intended for small-scale computers will not provide DMA support, whereas peripherals intended for high-speed and powerful computers definitely will have DMA support.

Parallel and Distributed Computers

Some embedded applications require greater performance than is achievable from a single processor. For cost reasons, implementing a design with the latest superscalar RISC processor may not be practical, or perhaps the application lends itself to distributed processing with the tasks run across several communicating machines. Using a fleet of lower-cost processors, distributed throughout the installation, may make more sense. Implementing embedded systems using parallel processors is becoming increasingly common.

Introduction to parallel architectures

The traditional architecture for computers follows the conventional, von Neumann serial architecture. Computers based on this form usually have a single, sequential processor. The main limitation of this form of computing architecture is that the conventional processor is able to execute only one instruction at a time. Algorithms that run on these machines must therefore be expressed as a sequential problem. A given task must be broken down into a series of sequential steps, each to be executed in order, one at a time.

Many problems that are computationally intensive are also highly parallel. An algorithm that is applied to a large data set characterizes these problems. Often the computation for each element in the data set is the same and is only loosely reliant on the results from computations on neighboring data. Thus, speed advantages may be gained from performing calculations in parallel for each element in the data set, rather than sequentially moving through the data set and computing each result in a serial manner. Machines with multitudes of processors working on a data structure in parallel often far outperform conventional computers in such applications.

The grain of the computer is defined as the number of processing elements within the machine. A coarsely grained machine has relatively few processors, whereas a finely grained machine may have tens of thousands of processing elements. Typically, the processing elements of a finely grained machine are much less powerful than those of a coarsely grained computer. The processing power is achieved through the brute-force approach of having such a large number of processing elements.

There are several different forms of parallel machine. Each architecture has its own advantages and limitations, and each has its share of supporters.

Single-instruction multiple-data computers

Single-Instruction Multiple-Data (SIMD) computers are highly parallel machines, employing large arrays of simple processing elements. In an SIMD machine, each processing element has a small amount of local memory. The instructions executed by the SIMD computer are broadcast from a central instruction server to every processing element within the machine. In this way, each processor executes the same instruction as all other processing elements within the machine. Since each processor executes the instruction on its local data, all elements within the data structure are worked upon simultaneously.

The SIMD machine is generally used in conjunction with a conventional computer. An example of this was the Connection Machine (CM-1) by Thinking Machines Corporation, which used either a VAX minicomputer or a Silicon Graphics or Sun workstation as the “host” computer. The Connection Machine was a finely grained SIMD computer with up to 64K processing elements that appeared as a block of 64K of “intelligent memory” to the host system. An application running on the host downloaded a data set into the processor array of the Connection Machine, each processor within the CM-1 acting as a single memory unit. The host then issued instructions to each processing element of the CM-1 simultaneously. After the computations were completed, the host then read back the result from the Connection Machine, as though it were conventional memory.

The primary advantage of the SIMD machine is that simple and cheap processing elements are used to form the computer. Thus, significant computing power is available using inexpensive, off-the-shelf components. In addition, since each processor is executing the same instructions and therefore sharing a common instruction fetch, the architecture of the machine is somewhat simpler. Only one instruction store is required for the entire computer.

The use of multiple processing elements, each executing the same instructions in unison, is also the SIMD’s main disadvantage. Many problems do not lend themselves to being broken down into a form suitable for executing on an SIMD computer. In addition, the data sets associated with a given problem may not match well with a given SIMD architecture. For example, an SIMD machine with 10k processing elements does not mesh well with a data set of 12k data elements.

Multiple-instruction multiple-data computers

The other major form of parallel machine is the Multiple-Instruction Multiple-Data (MIMD) computer. These machines are typically coarsely grained collections of semi-autonomous processors, each with its own local memory and local programs. An algorithm being executed on an MIMD computer is typically broken up into a series of smaller subproblems, each executed on a processor of the MIMD machine. By giving each processing element in the MIMD machine identical programs to execute, the MIMD machine may be treated as an SIMD computer. The grain of an MIMD computer is much less than that of an SIMD machine. MIMD computers tend to use a smaller number of very powerful processors, rather than a large number of less powerful ones.

MIMD computers can be of one of two types, shared-memory MIMD and message-passing MIMD. Shared-memory MIMD systems have an array of high-speed processors, each with local memory or cache, and each with access to a large, global memory (Figure 1-11). The global memory contains the programs and data to be executed by the machine. Also in this memory is a table of processes (or subprograms) awaiting execution. Each processor will fetch a process and associated data into its local memory or cache and will run semiautonomously of the other processors in the system. Process communication also takes place through the global memory.

Shared-memory MIMD

Figure 1-11. Shared-memory MIMD

A speed advantage is gained by sharing the program among several powerful processors. However, logic within the system must arbitrate between processors for access to the shared memory and associated shared buses of the system. In addition, allowances must be made for a processor attempting to access data in global memory that is out of date. If processor A reads a process and data structure into its local memory and subsequently modifies that data structure, processor B attempting to access the same data structure in main memory must be notified that a more recent version of the data structure exists. Such arbitration is implemented in processors like the (now-extinct) Motorola MC88110, which was intended for use in shared-memory MIMD machines.

An alternative MIMD architecture is that of the message-passing MIMD computer (Figure 1-12). In this system, each processor has its own local, main memory. No global memory exists for the machine. Each processing element (processor with local memory) either loads or has loaded into it the programs (and associated data) that it is to execute. Each process runs autonomously on its local processor, and interprocess communication is achieved through message passing through a common medium. The processors may communicate through a single, shared bus (such as Ethernet, CAN, or SCSI) or by using a more elaborate interprocessor connection architecture, such as 2-D arrays, N-dimensional hypercubes, rings, stars, trees, or fully interconnected systems.

Message-passing MIMD

Figure 1-12. Message-passing MIMD

Such machines do not suffer the bus contention problems of shared-memory machines. However, the most effective and efficient means of interconnecting the processing nodes of a message-passing MIMD machine is still a major area of research. Each different architecture has its own merits, and which is best for a given application depends to a certain degree on what that application is. Problems that require only a limited amount of interprocess communication may work effectively on a machine without high interconnectivity, whereas other applications may weigh down the communications medium with their message passing. If a percentage of a processing node’s time is spent in message routing for its neighbors, a machine with a high degree of interprocess communication but with a low degree of interconnectivity may spend most of its time dealing in message passing with little time spent on actual computation.

The ideal interconnection architecture is that of the fully interconnected system, with every processing node having a direct communications link with every other processing node. However, this is not always practical due to the costs and logistics of such a high degree of interconnectivity. A solution to this problem is to provide each processing element in the machine with a limited number of connections, based on the assumption that a processing element will not need or be able to communicate with every other processing element in the machine simultaneously. These limited connections from each processing node may then be interconnected using a crossbar switch, thereby providing full interconnectivity for the machine through only a limited number of links per node.

A distributed machine is composed of individual computers, networked together as a loosely coupled MIMD parallel machine. Projects such as Beowulf and even SETI@Home can be considered MIMD machines. Distributed machines are common in the embedded world. A collection of small processing nodes may be distributed across a factory, providing local monitoring and control, and together forming a parallel machine executing the global control algorithm. The avionics of commercial and military aircraft are also distributed parallel computers.

Now let’s take a look at computer applications and how that relates to the architecture of the machine.

Get Designing Embedded Hardware now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.