O'Reilly logo

Linux Device Drivers, Second Edition by Alessandro Rubini, Jonathan Corbet

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 6. Flow of Time

At this point, we know the basics of how to write a full-featured char module. Real-world drivers, however, need to do more than implement the necessary operations; they have to deal with issues such as timing, memory management, hardware access, and more. Fortunately, the kernel makes a number of facilities available to ease the task of the driver writer. In the next few chapters we’ll fill in information on some of the kernel resources that are available, starting with how timing issues are addressed. Dealing with time involves the following, in order of increasing complexity:

  • Understanding kernel timing

  • Knowing the current time

  • Delaying operation for a specified amount of time

  • Scheduling asynchronous functions to happen after a specified time lapse

Time Intervals in the Kernel

The first point we need to cover is the timer interrupt, which is the mechanism the kernel uses to keep track of time intervals. Interrupts are asynchronous events that are usually fired by external hardware; the CPU is interrupted in its current activity and executes special code (the Interrupt Service Routine, or ISR) to serve the interrupt. Interrupts and ISR implementation issues are covered in Chapter 9.

Timer interrupts are generated by the system’s timing hardware at regular intervals; this interval is set by the kernel according to the value of HZ, which is an architecture-dependent value defined in <linux/param.h>. Current Linux versions define HZ to be 100 for most platforms, but some platforms use 1024, and the IA-64 simulator uses 20. Despite what your preferred platform uses, no driver writer should count on any specific value of HZ.

Every time a timer interrupt occurs, the value of the variable jiffies is incremented. jiffies is initialized to 0 when the system boots, and is thus the number of clock ticks since the computer was turned on. It is declared in <linux/sched.h> as unsigned long volatile, and will possibly overflow after a long time of continuous system operation (but no platform features jiffy overflow in less than 16 months of uptime). Much effort has gone into ensuring that the kernel operates properly when jiffies overflows. Driver writers do not normally have to worry about jiffies overflows, but it is good to be aware of the possibility.

It is possible to change the value of HZ for those who want systems with a different clock interrupt frequency. Some people using Linux for hard real-time tasks have been known to raise the value of HZ to get better response times; they are willing to pay the overhead of the extra timer interrupts to achieve their goals. All in all, however, the best approach to the timer interrupt is to keep the default value for HZ, by virtue of our complete trust in the kernel developers, who have certainly chosen the best value.

Processor-Specific Registers

If you need to measure very short time intervals or you need extremely high precision in your figures, you can resort to platform-dependent resources, selecting precision over portability.

Most modern CPUs include a high-resolution counter that is incremented every clock cycle; this counter may be used to measure time intervals precisely. Given the inherent unpredictability of instruction timing on most systems (due to instruction scheduling, branch prediction, and cache memory), this clock counter is the only reliable way to carry out small-scale timekeeping tasks. In response to the extremely high speed of modern processors, the pressing demand for empirical performance figures, and the intrinsic unpredictability of instruction timing in CPU designs caused by the various levels of cache memories, CPU manufacturers introduced a way to count clock cycles as an easy and reliable way to measure time lapses. Most modern processors thus include a counter register that is steadily incremented once at each clock cycle.

The details differ from platform to platform: the register may or may not be readable from user space, it may or may not be writable, and it may be 64 or 32 bits wide—in the latter case you must be prepared to handle overflows. Whether or not the register can be zeroed, we strongly discourage resetting it, even when hardware permits. Since you can always measure differences using unsigned variables, you can get the work done without claiming exclusive ownership of the register by modifying its current value.

The most renowned counter register is the TSC (timestamp counter), introduced in x86 processors with the Pentium and present in all CPU designs ever since. It is a 64-bit register that counts CPU clock cycles; it can be read from both kernel space and user space.

After including <asm/msr.h> (for “machine-specific registers”), you can use one of these macros:


The former atomically reads the 64-bit value into two 32-bit variables; the latter reads the low half of the register into a 32-bit variable and is sufficient in most cases. For example, a 500-MHz system will overflow a 32-bit counter once every 8.5 seconds; you won’t need to access the whole register if the time lapse you are benchmarking reliably takes less time.

These lines, for example, measure the execution of the instruction itself:

unsigned long ini, end;
rdtscl(ini); rdtscl(end);
printk("time lapse: %li\n", end - ini);

Some of the other platforms offer similar functionalities, and kernel headers offer an architecture-independent function that you can use instead of rdtsc. It is called get_cycles, and was introduced during 2.1 development. Its prototype is

 #include <linux/timex.h>
 cycles_t get_cycles(void);

The function is defined for every platform, and it always returns 0 on the platforms that have no cycle-counter register. The cycles_t type is an appropriate unsigned type that can fit in a CPU register. The choice to fit the value in a single register means, for example, that only the lower 32 bits of the Pentium cycle counter are returned by get_cycles. The choice is a sensible one because it avoids the problems with multiregister operations while not preventing most common uses of the counter—namely, measuring short time lapses.

Despite the availability of an architecture-independent function, we’d like to take the chance to show an example of inline assembly code. To this aim, we’ll implement a rdtscl function for MIPS processors that works in the same way as the x86 one.

We’ll base the example on MIPS because most MIPS processors feature a 32-bit counter as register 9 of their internal “coprocessor 0.” To access the register, only readable from kernel space, you can define the following macro that executes a “move from coprocessor 0” assembly instruction:[26]

 #define rdtscl(dest) \
    __asm__ __volatile__("mfc0 %0,$9; nop" : "=r" (dest))

With this macro in place, the MIPS processor can execute the same code shown earlier for the x86.

What’s interesting with gcc inline assembly is that allocation of general-purpose registers is left to the compiler. The macro just shown uses %0 as a placeholder for “argument 0,” which is later specified as “any register (r) used as output (=).” The macro also states that the output register must correspond to the C expression dest. The syntax for inline assembly is very powerful but somewhat complex, especially for architectures that have constraints on what each register can do (namely, the x86 family). The complete syntax is described in the gcc documentation, usually available in the info documentation tree.

The short C-code fragment shown in this section has been run on a K7-class x86 processor and a MIPS VR4181 (using the macro just described). The former reported a time lapse of 11 clock ticks, and the latter just 2 clock ticks. The small figure was expected, since RISC processors usually execute one instruction per clock cycle.

[26] The trailing nop instruction is required to prevent the compiler from accessing the target register in the instruction immediately following mfc0. This kind of interlock is typical of RISC processors, and the compiler can still schedule useful instructions in the delay slots. In this case we use nop because inline assembly is a black box for the compiler and no optimization can be performed.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required