ch4_ldd

18
Introduction to Linux Device Driver Development Prepared by Richard A. Sevenich, [email protected] Chapter 4. A Selection of Topics from the Kernel Internals General References: Love, Linux Kernel Development , Sams (2004). Bovet & Cesati, Understanding the Linux Kernel , 2nd Edition, O'Reilly (2003). Beck et alii, Linux Kernel Programming, 3rd Edition, Addison-Wesley (2002). Rubini & Corbet, Linux Device Drivers , 2nd Edition, O'Reilly (2001). Linux Kernel Version 2.4 Source Code. Note: At this writing the Rubini & Corbet book is freely available via download from http://www.xml.com/ldd/chapter/book/ 4.1 Introduction The topic areas we'll skim are: system calls signals wait queues task queues kernel timers and other timely topics interrupt handling process scheduler The kernel continues to change in each of these areas, particularly in response to the need for greater scalability. 4.2 The System Call Dispatcher Some of you will recall the MS DOS function dispatcher. It had functionality such as to send a character to the screen or printer receive a character from the keyboard read/write a disk drive get/set time or date This was implemented in assembly language and used the following interface: put the number of the desired function in the ah register perform any other initialization needed by the function (using other registers) call interrupt 0x21 The corresponding interrupt handler was the function dispatcher. The Linux system call dispatcher may be more complex, but is essentially similar to the MS DOS function dispatcher. Such a jump table is not a new idea. Let's look at an example: int main() {   int result;   result = write(1, "hello\n", 6);   ... } The code above makes a library call, write, which is a wrapper around the sys_write system call. Some authors refer to the library call as a stub. The arguments of write are passed via the stack to the library function which does some setup and then invokes (assuming IA 32) int 0x80, the system call dispatcher. R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 1

Transcript of ch4_ldd

Page 1: ch4_ldd

Introduction to Linux Device Driver Development

Prepared by Richard A. Sevenich, [email protected]

Chapter 4. A Selection of Topics from the Kernel Internals

General References:

• Love, Linux Kernel Development, Sams (2004).• Bovet & Cesati, Understanding the Linux Kernel, 2nd Edition, O'Reilly (2003).• Beck et alii, Linux Kernel Programming, 3rd Edition, Addison-Wesley (2002).• Rubini & Corbet, Linux Device Drivers, 2nd Edition, O'Reilly (2001).• Linux Kernel Version 2.4 Source Code.

Note: At this writing the Rubini & Corbet book is freely available via download fromhttp://www.xml.com/ldd/chapter/book/

4.1 Introduction

The topic areas we'll skim are: • system calls• signals• wait queues• task queues• kernel timers and other timely topics• interrupt handling• process scheduler

The kernel continues to change in each of these areas, particularly in response to the need for greater scalability.

4.2 The System Call Dispatcher

Some of you will recall the MS DOS function dispatcher. It had functionality such as to• send a character to the screen or printer• receive a character from the keyboard• read/write a disk drive• get/set time or dateThis was implemented in assembly language and used the following interface:• put the number of the desired function in the ah register• perform any other initialization needed by the function (using other registers)• call interrupt 0x21The corresponding interrupt handler was the function dispatcher.

The Linux system call dispatcher may be more complex, but is essentially similar to the MS DOS functiondispatcher. Such a jump table is not a new idea. Let's look at an example:

int main() {  int result;  result = write(1, "hello\n", 6);  ...}

The code above makes a library call, write, which is a wrapper around the sys_write system call. Some authors referto the library call as a stub. The arguments of write are passed via the stack to the library function which does somesetup and then invokes (assuming IA 32) int 0x80, the system call dispatcher.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 1

Page 2: ch4_ldd

In our example, the library call would do this setup for the IA32:• put the system call number for write (4) in the eax register• put the first argument (stdout = 1) into the ebx register• put the second argument (a pointer to the string "hello\n") into the ecx register• put the third argument (length of string = 6) into the edx register• invoke int 0x80The interrupt handler switches to kernel mode, performs the task, and returns a result in eax to the library function.Our goal in this chapter is to get familiar with some details of the underlying implementation and then implementour own system call and then use it. Integrating our system call into the kernel will necessitate recompiling thekernel as you have done before. However, you have a good .config file, right - and it's backed up, right?

4.2.1 Implementation Details

An important header file to look at is <asm/unistd.h>. It starts with a table of #define's containing the system callnumbers. The number for the write library call we considered earlier appears thusly

#define __NR_write 4Further investigation of this header file suggests that the library wrapper around the system call or stub can begenerated by a macro call of this form (see 'man 2 intro'):

_syscallX(type, name, type1, arg1, type2, arg2, ...) where • X is the number of arguments taken by the stub (range is 0 through 5)• type is the return type of the system call• name is the name of the system call• typeN is the type of the Nth argument• argN is the name of the Nth argumentThese macros can be seen in <linux/unistd.h>. For example, in that header file we find:

#define _syscall3(type, name, type1, arg1, type2, arg2, type3, arg3) \type name(typ1 arg1, type2 arg2, type3 arg3) \{ \long __res; \__asm__ volatile ("int $0x80") \: "=a" (__res) \: "0" (__NR_##name), "b" ((long)(arg1)), "c" ((long)(arg2)), \"d" ((long)(arg3)); \__syscall_return(type, __res); \}

and later:static inline _syscall3(int, write, int, fd, const char *, buf, off_t,count)

Hence we know how to build the prototype for write i.e.int write(int fd, const char * buf, off_t count){  long __res; \  __asm__ volatile ("int $0x80") \  : "=a" (__res) \  : "0" (__NR_write), "b" ((long)(fd)), "c" ((long)(* buf)), \  "d" ((long)(count)); \  __syscall_return(int, __res);\} 

Here we see in which registers parameters are passed, how the value 4 identifying the write system call isdetermined, etc. It is left for you to expand the __syscall_return macro. Note that we have explained the scenariofrom making the library call in the user code to having the library call subsequently invoke int 0x80.

Now we've claimed that write is a wrapper around the actual kernel level system call, sys_write. How is sys_writecalled and where in the source code is it? If we knew those answers we'd be on our way to doing our ownimplementation. We note that int 0x80 ultimately results in executing the code in <linux/arch/i386/kernel/entry.S>.Look particularly at the code starting from ENTRY(system_call) noting that it soon does a call to a reference in thesys_call_table. That table is at the end of the entry.S file where we find, for example, that entry number 4(__NR_write) is

.long SYMBOL_NAME(sys_write)

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 2

Page 3: ch4_ldd

So that's how it finds the call to sys_write. Where in the source code is sys_write? Using http://lxr.linux.no/source/,we find that it is in /usr/src/linux/fs/read_write.c. Check to see what other directories are at this level and see whatother system calls you might locate (e.g. sys_fork, sys_chmod, sys_ alarm).

We have enough details and putting the picture together will allow us to create our own system call.

4.2.2 Implementing our own System Call

We'll just lay out the recipe, based on what we discovered in the previous section. Further we'll pick a specificexample so everything is concrete. Here is the recipe:1. Call your new system call, sys_my_new_call, in a file it_b_mine.c. As root, copy the file to /usr/src/linux/kernel/.2. Modify the Makefile in /usr/src/linux/kernel/.3. Edit /usr/src/linux/arch/i386/kernel/entry.S and /usr/linux/src/linux/asm/unistd.h, in that order (will be describedin Section 4.3.3)4. Recompile the kernel via 'make bzImage' while in /usr/src/linux, copy bzImage to the appropriate vmlinuz in /boot, run lilo, and reboot (cf. Chapter 2 of the course notes)5. Write a user program which exercises your new system callYou might tar and zip your current /usr/src/linux, because we're going to make some changes which you'll want toremove subsequently.

4.2.3 The new system call

Here's the file it_b_mine.c:#include <linux/kernel.h>asmlinkage int sys_my_new_call(void) {  printk(KERN_ALERT "sys_my_new_call at your service\n");  return 0;} 

As root, copy it into /usr/src/linux/kernel. Double check that the ownership and permissions are consistent with otherfiles in that directory.

4.2.4 Modify the Makefile

Modify the Makefile in /usr/src/linux/kernel to add it_b_mine.o to the entries for obj-y.

4.2.5 As root, edit unistd.h and entry.S

Near the end of the file, /usr/src/linux/arch/i386/kernel/entry.S, you'll find the jump table. At the very end of thattable, add

.long SYMBOL_NAME(sys_my_new_call)and note the position. In my case, it was 226. Save the new entry.S.

Next, near the beginning of the file, /usr/src/linux/include/asm/unistd.h, you'll find the table of system call numbers.Add the appropriate entry i.e. at the end I added

#define __NR_my_new_call 226where, in your case, the number might be different than 226, but must match that from the entry.S file. Save thisunistd.h.

4.2.6 Recompile and reboot

Unless you are also doing some reconfiguration, you need not do all the steps seen earlier in Section 1.4 of Chapter1. In particular, you can start with Step 5 of that section and then something along the lines of Steps 8 and 9.Essentially all you need to do then is● compile via 'make bzImage'● copy the new kernel to /boot● revise lilo.conf, if necessary, and rerun lilo ... or modify /boot/grub/menu.lst, if necessary● reboot

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 3

Page 4: ch4_ldd

4.2.7 A user program using our new system call

Let's continue to assume the source hierarchy, in which we are working, is /usr/src/linux. Now gcc expects theinclude files to be at /usr/include/, but ours are at /usr/src/linux/include. Sometimes there are symbolic links from theformer to the latter, in particular from

/usr/include/asm to /usr/src/linux/include/asm and from

/usr/include/linux to /usr/src/linux/include/linuxSo that we don't need to modify linkages for our particular example, we'll just tell gcc where those files are when wecompile the user program i.e.

gcc ­I /usr/src/linux/include ... and so on.

Here is a user program:/* Use my_new_call */#include <sys/types.h>#include <linux/unistd.h>static inline _syscall0(int, my_new_call);

int main() {  int result;  result = my_new_call();}

Compile and run this program. It should print to some log file e.g. to /var/log/messages:sys_my_new_call at your service

which you can verify via something liketail ­f /var/log/messages

If it's printing to some other log file, you can do some detective work looking at time stamps vials ­l /var/log/

and see which log files have been written recently.

4.2.8 Return to normalcy

If desired back out all the changes you made in this chapter and return your system to its original state.

4.2.9 Adding a bit more substance to our system call

User programs, of course, cannot be allowed access to kernel space. Yet we may need to pass information back andforth under tight control e.g. via the system call mechanism and appropriate kernel functions. Linux provides variousways to do this. Here we'll introduce two macros:• get_user() - can be called by a kernel process to get a single datum from the user's memory space• put_user() - can be called by a kernel process to put a single datum into the user's memory space

Here is the necessary information for get_user():#include <asm/uaccess.h>void get_user(datum, ptr) 

This will read the datum from user space, where ptr is the user space address. The size of the datum transferreddepends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success,otherwise an error.

Here is the necessary information for get_user():#include <asm/uaccess.h>put_user(datum, ptr)

This will write the datum to user space, where ptr is the user space address. The size of the datum transferreddepends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success,otherwise an error.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 4

Page 5: ch4_ldd

As an example, we'll invent two new system calls:sys_new_sys1 ­ will use get_user()sys_new_sys2 ­ will use put_user()

We'll package them together in the same file and put that file in /usr/src/linux/kernel. We also must modify theMakefile in that directory and put two new entries in both

/usr/src/linux/arch/i386/kernel/entry.S and /usr/linux/src/linux/unistd.h 

So we are essentially just following the recipe at the start of Section 4.2.2.

Here is the new kernel program:/* new_sysen.c*/#include <linux/kernel.h>#include <asm/uaccess.h>#include <asm/errno.h>static int shared_int = 0;

asmlinkage int sys_new_sys2(unsigned long arg) {  shared_int = 5 * shared_int;  printk(KERN_ALERT "sys_new_sys2 will call put_user()\n");  if (put_user(shared_int, (int *)arg) != 0) return ­EFAULT;   return 0;} 

asmlinkage int sys_new_sys1(unsigned long arg) {  shared_int = 0;  printk(KERN_ALERT "sys_new_sys1 will call get_user()\n");  if (get_user(shared_int, (int *)arg) !=0) return ­EFAULT;   return 0;} 

Here is an example user program which makes use of the two new system calls.#include <stdio.h>#include <stdlib.h>#include <linux/unistd.h>#include <sys/types.h>

static inline _syscall1(int, new_sys1, int *, foo1)static inline _syscall1(int, new_sys2, int *, foo2)

int main() {  int user_space_int;  user_space_int = 16;  printf("user_space_int starts with value %d\n", user_space_int);

  if (new_sys1(&user_space_int) != 0)   {    printf("new_sys1 failed.\n");    exit(­1);  }   if (new_sys2(&user_space_int) != 0)   {    printf("new_sys1 failed.\n");    exit(­1);  }   printf("user_space_int finishes with value %d\n", user_space_int);  exit(0);}

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 5

Page 6: ch4_ldd

4.3 Signals We will see that there is a variety of available signals and there are various ways a program can be set up to respondto signals - giving the signal mechanism both power and flexibility. More specifically, a signal can have thesepossible effects on a program (please note the similarity to the hardware interrupt mechanism):• The signal is 'caught' by the program: Execution is transferred to a signal handler and, upon its completion,

control is returned to the signaled program.• There is no signal handler so the appropriate default is exercised:

STOP: The program is put into a stopped state, but can be returned to a runnable state later.EXIT: The program is forced to exit.CORE: The program is forced to exit and a core dump is generated and filed in the program's directory.IGNORE: The signal is ignored.

• The SIGKILL and SIGSTOP signals are distinct in that they can neither be caught nor ignored.A program's response to a signal is consistent throughout the process so that all threads within a process respond thatsame way.

Signals have names (all starting with 'SIG'), values, and default actions. These are listed in the man page i.e. enter'man 7 signal'. You'll note from the man page that there is a POSIX signal API and a legacy API. The referencedbook by Johnson and Troan has a very nice chapter on signals which moves through the legacy signal mechanismswhich were in some cases incompatible with each other. It also discusses the unreliability of ANSI C standardizationof the signal() function. It is recommended that the well defined and reliable POSIX signal API be used.

4.3.1 The kernel's use of signals

Of course, the kernel already uses signals to conduct its everyday business. Here are some examples from the manpage:• If a program makes an invalid memory reference (e.g. a wild pointer), the kernel send the offending process a

SIGSEGV, with default action CORE.• If a child process has stopped or terminated, the kernel sends the parent a SIGCHLD, with default action

IGNORE.• If the suspend keystroke combination (often CRTL-z) is pressed, the kernel sends SIGTSTP to any foregound

process with default action STOP.• If a program writes to a pipe which has no readers, the kernel sends that process a SIGPIPE, with default action

EXIT.In general, the kernel uses signals for various reasons, not merely on error conditions. A categorization of suchreasons might include:• Program termination• Program stopping and subsequent continuing• Dealing with errant programs• Terminal handling• Program Notification (e.g. a timeout alarm, death of child)

Again, note that some signals originate in response to a hardware interrupt i.e. the interrupt handler causes a signalto be sent.

4.3.2 Signals in user programs

As expected, user programs use of signals is more restricted. They cannot for example, just send signals to anyone.They can, however, set themselves up to catch a variety of kernel generated signals - often having to do with signalssent in connection to terminal activity. Furthermore, the POSIX signals include a pair of user-defined signals,SIGUSR1 and SIGUSR2, whereby two user programs with the same uid can communicate.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 6

Page 7: ch4_ldd

4.3.3 Signal handlers in user programs

Although there may be instances where we want the default response to the signal, it is alternatively possible that wewill want to catch and handle the signal - that will be the focus of this section. POSIX signals are organized in sets,represented by a data type sigset_t. Linux provides us with a group of functions for safely manipulating signal sets:

empty the referenced set of all signalsint sigemptyset(sigset_t * set);

fill the referenced set with all signalsint sigfillset(sigset_t * set);

add a specified signal to the referenced setint sigaddset(sigset_t * set, int signo);

remove a specified signal from the referenced setint sigdelset(sigset_t * set, int signo); 

test whether a specified signal is a member of the referenced setint sigismember(const sigset_t * set, int signo);

The program that wishes to catch the signal will also declare the signal handler. The prototype for a signal handler istypedef void (*__sighandler_t)(int signo);

The reference to your signal handler is placed in the struct sigaction, which specifies how the kernel should deliversignals to your program. The struct looks like this:

struct sigaction {  sighandler_t sa_handler;  unsigned long sa_flags;  void (*sa_restorer)(void);  sigset_t sa_mask;};

Now we'll describe the items in this struct:• sa_handler is a pointer to your signal handler, alternatively it can be

SIG_IGN - tells the kernel to ignore the signalSIF_DFL - tells the kernel to use the default response

• sa_flags is a bitmask that controls kernel behavior when the signal is received and OR's various possibilities. Oursubsequent example sets this to zero. You might investigate other options.

• sa_restorer is not used by linux• sa_mask specifies the signals to be blocked while the signal handler is executing

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 7

Page 8: ch4_ldd

Once the sigaction struct is declared, the sigaction() system call can be invoked to deliver the information to thekernel detailing how the signal should be delivered. The following user space program provides an example.

#include <signal.h>#include <stdlib.h>#include <stdio.h>#include <unistd.h>

#define true 1#define false 0

int caught = false;

/* here's a trivial signal handler */void mysig_handler(int sig) {  printf("mysig_handler got SIGALRM.\n");  caught = true; }

int main(void) {  /* declare the sigaction struct */  struct sigaction mysig_action;  /* fill in the necessary fields in the prior struct*/  mysig_action.sa_handler = mysig_handler;  sigemptyset(&mysig_action.sa_mask);   mysig_action.sa_flags = 0;  /* pass the signal and related struct to the kernel*/  sigaction(SIGALRM, &mysig_action, NULL);

  printf("Now calling alarm(5)\n");  /* set up a SIGALRM at 5 seconds from now */  alarm(5);  /* let's hang around until the signal is caught*/  while(!caught);

  printf("Resumed program upon signal handler completion.\n");  return(0);}

4.4 Wait Queues

It routinely happens in a wide variety of circumstances that a kernel process needs to wait for a particular event tohappen. Although there are instances where the process may then do a busy waiting loop (e.g. spinlocks in amultiprocessor environment) it is often more appropriate that the process block, so other processes can continue tokeep the cpu busy doing useful work. This capability is supported by wait queues. The wait queue struct is a cycliclinked list:

struct wait_queue{struct task_struct * task;struct wait_queue * next;}

The supporting macros include those that• put the process to sleep• awaken the process• add and delete wait queue membersWe'll examine these next.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 8

Page 9: ch4_ldd

Putting a process to sleep on a wait queue

These include the following:• void sleep_on(struct wait_queue **p);

This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue,and relinquishes control by calling the scheduler. The process must be awakened by some other processwhich does a wake up call (discussed under the next bold subheading) for this queue..

• void interruptible_sleep_on(struct wait_queue **p);This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue,and relinquishes control by calling the scheduler. The process must be awakened by some other processwhich does a wake up call for this queue, but can also be awakened by a signal.

• void sleep_on_timeout(struct wait_queue **p, long timeout);This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue,and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by thetimeout argument, rather than by requiring some other process to do a wake up call for this queue.

• void interruptible_sleep_on_timeout(struct wait_queue **p, long timeout); This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue,and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by thetimeout argument, rather than by requiring some other process to do a wake up call for this queue. However,the process can also be awakened by a signal.

Awakening a process on a wait queue

These include the following:• void wake_up(struct wait_queue **p);

This will wake up both interruptible and noninterruptible sleepers on the designated queue.• void wake_up_interruptible(struct wait_queue **p);

This will wake up only interruptible sleepers on the designated queue.Note that the wake up calls will not awaken processes which were explicitly stopped.

Adding/deleting wait queue members

To safely add and remove members of wait queues we have:• void add_wait_queue(struct wait_queue **queue, struct wait_queue *entry);• void remove_wait_queue(struct wait_queue **queue, struct wait_queue

*entry);In both cases, the first argument refers to the queue of interest, while the second refers to the entry to be added orremoved, respectively.

4.4.1 Race Conditions

Let's say we put some process to sleep until some condition is true maybe using a construction like this:while (wake_condition == false) {  interruptible_sleep_on(&my_wait_queue);  ...}

With the demise of the big kernel lock, this may be subject to race conditions. This will occur if the wake conditionevaluates as false in the first line and becomes true before the second line executes. In the worst case, the processwill experience deadlock. This can be avoided with some clever programming, but this has been encapsulated in thekernel - so we don't even need to be clever. The appropriate replacement for the prior code snippet is

wait_event_interruptible(my_wait_queue, wake_condition == true);There is also the expected

wait_event(my_wait_queue, wake_condition == true);

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 9

Page 10: ch4_ldd

4.5 Task Queues

Task queues hold tasks to be executed at a later time. The kernel provides predefined task queues in which you canregister your task. The scheduler then decides just when tasks in such a queue will be executed. Alternatively, youcan define your own task queue and specify when it should execute. A queue element is a tq_struct as defined by:

#include <linux/tqueue.h> struct tq_struct {  struct tq_struct *next; /* linked list of queued tasks */  unsigned long sync; /* must be initialized to zero */  void (*routine)(void *); /* function to call */  void *data; /* argument to function */ }; 

Once you have declared an element, you should• clear the next and sync fields• enter appropriate items in the routine and data fieldsThen you may queue the task with the queue_task function whose prototype is

void queue_task(struct tq_struct *task, task_queue *list); 

Note: For the predefined tq_scheduler queue, the related code must use schedule_task to put the task on thetq_scheduler queue, not queue_task. We'll see an example shortly.

To run a queue of tasks the function used is run_task_queue with prototypevoid run_task_queue(task_queue *list);

which the kernel invokes for its predefined task queues and which you must call for any task queue you defineyourself.

4.5.1 Queues Predefined by the Kernel

The four queues predefined by the kernel are:• tq_scheduler - queued tasks in here execute whenever the scheduler runs (not executed at interrupt time)• tq_timer - execution of these tasks is triggered by the timer tick (executed at interrupt time)• tq_immediate - these tasks are run as soon as possible, either on return from a system call or when the scheduler

is run (executed at interrupt time)• tq_disk - not available to modules; used internally by memory managementThis essentially leaves the first three for us.

4.5.2 The tq_timer and tq_immediate queues

Note that tasks in the tq_timer and tq_immediate queues are executed in interrupt time. This has importantconsequences. First, in interrupt mode, there is no process context so that• the queued task cannot access user space• the current pointer is not meaningful.Second, if the process attempts to sleep or calls a function which can sleep, the queued task may hang. Note thatfunctions which attempt to reserve system resources are quite likely to have a need to sleep (e.g. kmalloc).

An example of usage of tq_timer or tq_immediate#include <linux/tqueue.h> static struct tq_struct my_task;

void my_own_task(unsigned long ptr) { ... some valid code ...} 

void init_and_enqueue_my_task() {  my_task.routine = (void *)&my_own_task;  my_task.data = (void *)&some_data;  queue_task(&my_task, &tq_immediate);}

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 10

Page 11: ch4_ldd

4.5.3 The tq_scheduler queue

Tasks in the tq_scheduler queue are not executed in interrupt time, so the constraints mentioned at the start ofsection 4.5.2 do not apply. A further difference from tq_immediate and tq_timer emerged in the 2.4 kernel series -the related code must use schedule_task to put the task on the tq_scheduler queue, not queue_task. An example ofusage of tq_scheduler follows.

#include <linux/tqueue.h> static struct tq_struct my_task; static char my_msg[] = "<1>\nmy_special_task has executed.\n"; DECLARE_WAIT_QUEUE_HEAD(my_wait);

void my_special_task(unsigned long ptr) {  printk((void *)ptr);  wake_up_interruptible(&my_wait); } 

void init_and_enqueue_my_task() {  my_task.routine = (void *)&my_special_task;  my_task.data = (void *)&my_msg;  schedule_task(&my_task);  interruptible_sleep_on(&my_wait); } 

4.5.4 Your own Task Queues

In this case, since the queue is not predefined, the queue is declared by a macro in this style:DECLARE_TASK_QUEUE(my_tq); 

The fields would be filled in as before and then the task would be enqueued by:queue_task(&my_task, &my_tq);

Unlike the predefined queues, this would need to be executed overtly byrun_task_queue(&my_tq);

This leaves the question of how the task queue execution would be triggered. This is done by registering the priorfunction in one of the predefined queues.

4.6 Time Related Functionality

4.6.1 Current Time

The kernel keeps track of time via the timer interrupt, which in my IA-32 machine occurs 100 times per second(defined by HZ in /usr/src/linux/include/asm/param.h). The timer interrupt handler updates the value in jiffies. Thisis defined as an unsigned long volatile in /usr/src/linux/include/linux/sched.h. This 32-bit quantity is zeroed whenyour machine is powered up. The value in the variable jiffies is one method to measure time intervals in kernel code.If your driver needs the current time, the do_gettimeofday function is provided. It gives near microsecond resolutionfor most architectures. A usage example is shown in this fragment:

struct timeval tv; do_gettimeofday(&tv); printk(KERN_ALERT"Current seconds = %08u.%06u\n",                            (int)(tv.tv_sec%100000000), (int)(tv.tv_usec)); 

In addition to the timer interrupt driven jiffies value, most modern processors have acknowledged the need for amuch finer time resolution. This will be based on the processor clock speed and made available in a special register.This is architecture dependent and we will describe the situation in the more recent and ubiquitous IA32 (Pentiumand later). The IA32 has a 64-bit register called the time stamp counter (TSC) available via the assembly languageinstruction rdtsc. The TSC is also accessible via the C macros rdtsc and rdtscl desribed by:

#include <asm/msr.h>rdtsc(low, high) - here low and high are each 32-bit variables holding the two parts of the 64-bit TSCrdtscl(low) - here low is just the low part of the 64-bit TSC

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 11

Page 12: ch4_ldd

4.6.2 Delays

Long Delays

For pedagogical reasons, we'll start with a poor solution for creating a delay and move toward better. Each examplewill rely on this information:

/*resolution on order of jiffies */ unsigned long my_delay = desired_seconds * HZ; unsigned long target_time = jiffies + my_delay; 

Since jiffies will eventually roll over and since Linux machines are relatively stable, target_time could roll over andbe less than jiffies. Hence, a set of macros that accommodates roll over properly is provided in <linux/timer.h>.These are as follows:• time_before(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies < target_time• time_after( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies > target_time• time_before_eq(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies <=

target_time• time_after_eq( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies >=

target_time

Let's examine some delay possibilities. The first example is known as "busy waiting" and should be avoided. It issimply::

while time_before(jiffies, target_time); /* the CPU stays busy in this loop, stalling any other work */ 

The fact that jiffies is declared as volatile forces it to be reread each time it is accessed in your code - so you won'tbe haunted by a cached value. However, jiffies is changed by the timer interrupt, so using this busy waiting loopwhile hardware interrupts were disabled would hang the machine.

Our second example removes both problems:while time_before(jiffies, target_time) schedule(); 

This process calls the scheduler, so other tasks can run. However, this task remains in the execution queue whichcreates a subtle problem. If this is the only task, it will keep getting turns to run and it will keep calling the scheduler- but it's really doing nothing useful. On the other hand, if there are no tasks to run, the scheduler runs the 'idle'process which provides these benefits:• it reduces the CPU's workload, reducing temperature and increasing lifetime (e.g. a laptop will go longer before

needing its battery recharged)• the time used by the process is accountable (maybe a non issue)

Our third example removes the prior problem as follows:current­>state = TASK_INTERRUPTIBLE; schedule_timeout(my_delay); 

Here, current is the task_struct of the executing process. The scheduler will avoid the task until the timeout has beenreached.

Short Delays

The prior delays have resolution in the jiffies range. To get delays in the microsecond range, you can use the udelayfunction based on the processor's bogomips measurement. Its prototype is

#include <linux/delay.h> void udelay(unsigned long usecs); 

For example,udelay(50); 

would be a busy waiting loop that lasts for 50 microseconds. It is recommended that the argument passed to udelaynot exceed 1000, because fast machines (i.e. with high bogomips) may encounter an overflow. A wrapper iteratingaround udelay is provided by mdelay e.g.

mdelay(70); would provide a delay of 70 milliseconds.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 12

Page 13: ch4_ldd

4.6.3 Kernel Timers

Like task queues, kernel timers provide a way to defer execution of a task until a later time. The kernel timers arekept in a doubly linked list. The data structure for a timer is given in /usr/src/linux/include/linux/timer.h as:

struct timer_list {  struct timer_list *next; /* MUST be first element */  struct timer_list *prev;  unsigned long expires;  unsigned long data;  void (*function)(unsigned long);}; 

where 'expires' (3rd element) is the time in jiffies at which timeout occurs and '*function' (5th element) denotes thefunction to call at timeout. There are three important functions provided for manipulating timers:• init_timer() - initializes the timer structure by zeroing the 'next' and 'prev' pointers• add_timer() - inserts a timer structure into the global list of active timers• del_timer() - for removing a timer from the list before its timeout has transpiredNote that when a timer times out, it is automatically removed from the list.

Here are the elements of a trivial example:#include <linux/time.h> #include <linux/timer.h> #include <linux/wait.h> #include <linux/param.h> 

static struct timer_list my_timer; DECLARE_WAIT_QUEUE_HEAD(my_wait); static char msg[] = "<1>\nmy_timer has timed out.\n"; 

void upon_my_timeout(unsigned long ptr) {  printk((void *)ptr);  wake_up_interruptible(&my_wait); } 

void wait_four() {  init_timer(&my_timer);  my_timer.function = upon_my_timeout;  my_timer.data = (unsigned long)&msg;  my_timer.expires = jiffies + (4 * HZ);  add_timer(&my_timer);  interruptible_sleep_on(&my_wait); } 

The time-outs provided by such timers are unlike task queues in that the timer specifies precisely when the timeoutfunction is to be executed; whereas with a task queue all you know is that the queued task will be performed at somelater time. Occasionally the need for such functionality arises in a driver.

4.7 Interrupt Handling

We'll have a short discussion here on the linux approach to IA-32 style hardware interrupts with the assumption thatthe reader is familiar with the 'traditional' irq -> PIC/APIC <-> CPU interrupt mechanism. The interrupt handlerdoes not run within the context of a process and cannot transfer data to/from user space. The interrupt handler startsexecuting with hardware interrupts disabled, but can reenable them if it so wishes masking irq's appropriately beforethe sti. Other than that, the interrupt handler is normal C code. The writer of that code needs to understand how thehandler must interact with the hardware. For example, some devices will not issue another interrupt until theinterrupt handler has acknowledged its response to the current irq signal, perhaps by clearing a specified I/O portbit.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 13

Page 14: ch4_ldd

4.7.1 The Bottom Half Mechanism

The handler needs to do its work quickly and efficiently. If there are subtasks that require significant time, but arenot urgent; they can be deferred until later. This is the so called 'bottom-half' mechanism provided by linux. Thereare, in fact, only 32 'genuine' bottom halves available and the average joe device driver writer won't have oneassigned to his/her use. However, a driver without a genuine bottom half can employ the immediate queue toprovide bottom half functionality. What one does is to declare a task queue, initialize its routine field as the bottomhalf code you wrote, initialize its data field as needed, and then.add the initialized task queue to the immediatequeue. Finally mark_bh(IMMEDIATE_BH) is called to schedule the function which will later execute all thefunctions in the immediate queue.

4.7.2 An Example Bottom Half

Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half, say,void some_bottom_half();We then take these steps:• declare a task struct e.g.

#include <linux/tqueue.h>static struct tq_struct some_bh;

• initialize the struct somewhere appropriate such as in init_module e.g.some_bh.routine = (void *)&some_bottom_half;some_bh.data = NULL;some_bh.sync = 0; 

• add code to my_irq_handler to enqueue and mark the bottom half e.g.queue_task(&some_bh, &tq_immediate);mark_bh(IMMEDIATE_BH);

We note that the bottom half is actually taken care of by the tasklet mechanism in the 2.4 series kernel.

4.7.3 The Tasklet Alternative

The tasklet is quite similar to a task in a predefined task queue. Further, it runs in interrupt time so the constraints ofsection 4.5.2 apply. Other important properties of tasklets include these, copied from interrupt.h:• If tasklet_schedule() is called, then tasklet is guaranteed to be executed on some cpu at least once after this.• If the tasklet is already scheduled, but its excecution is still not started, it will be executed only once.• If this tasklet is already running on another CPU (or schedule is called from tasklet itself), it is rescheduled for

later.• Tasklet is strictly serialized wrt itself, but not wrt another tasklets. If client needs some intertask synchronization

he makes it with spinlocks.

The tasklet_struct follows:struct tasklet_struct{  struct tasklet_struct *next;  unsigned long state;  atomic_t count;  void (*func)(unsigned long);  unsigned long data;};

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 14

Page 15: ch4_ldd

4.7.4 A Tasklet Example

Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half via the taskletmechanism, say,

void some_bottom_half();We then take these steps:• ensure you have the needed header

#include <linux/interrupt.h>• declare and initialize the tasklet_struct:

DECLARE_TASKLET(some_bh, some_bottom_half, 0);• add code to my_irq_handler to schedule the bottom half e.g.

tasklet_schedule(&some_bh);

Note that you do not need to separately declare:struct tasklet_struct some_bh;

The DECLARE_TASKLET takes care of that.

4.8 The Process Scheduler

4.8.1 Introduction to the scheduler

The Linux kernel is currently not preemptive and lies outside the realm of the scheduler, whose main job is to pickthe next process to run. More specifically we can state that• There is no mechanism by which a 'higher priority' process can preempt a kernel mode process, but the latter can

decide to relinquish control.• A kernel process can be interrupted by an interrupt/exception handler. Upon completion of the handler control

returns to the interrupted kernel process.• The interrupt/exception handler is itself a kernel mode process and can be interrupted by an interrupt/exception

handler.• Kernel mode processes can 'turn off' external hardware interrupts as appropriate.

The scheduler for the current Linux 2.6 series has likely changed. Further kernel processes can be configured aspreemptive. We focus on the 2.4 series here. In any case, it makes a good first exposure to scheduling. The excellentO'Reilly book, Understanding the Linux Kernel by Bovet & Cesati has a good chapter on this topic and, if you go tothe O'Reilly web site (http://www.oreilly.com), you will find that the description of this book contains the chapter onthe scheduler as a downloadable example.

Recall that a process can exist in one of a possible set of states. For Linux, these are• TASK_RUNNING• TASK_INTERRUPTIBLE• TASK_UNINTERRUPTIBLE• TASK_STOPPED• TASK_ZOMBIETo determine the next process to run, the scheduler chooses from among processes in the TASK_RUNNING state.

It is assumed here that the reader has had some exposure to the concepts used in schedulers, so that no time will bespent on general background. Further, we will not discuss scheduling for SMP machines. In this chapter, we willdiscuss:• scheduling policies and preemption• when does the scheduler execute?• process goodness and priorities• the epoch• the scheduling algorithm

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 15

Page 16: ch4_ldd

4.8.2 Scheduling Policies and Preemption

In <linux/sched.h>, we find the three Linux scheduling policies:

#define SCHED_OTHER 0#define SCHED_FIFO 1#define SCHED_RR 2

SCHED_OTHERNormal user tasks will run under the SCHED_OTHER policy. As such they are preemptible and run in a time slicedenvironment involving dynamic priorities, to be described later.

SCHED_FIFOThis is a (soft) real-time policy. A SCHED_FIFO process is not time sliced and will execute until one of thefollowing conditions becomes true:• it completes• it blocks for I/O• it relinquishes the CPU by calling sched_yield()• a higher priority process enters the TASK_RUNNING state

SCHED_RRThis also is a (soft) real-time policy. However, SCHED_RR processes are subject to a time slice. A set ofSCHED_RR processes having the same priority would be scheduled in a classic round robin fashion with respect toeach other. Such a process will complete its time slice unless one of the following occurs• it completes• it blocks for I/O• it relinquishes the CPU by calling sched_yield()• a higher priority process enters the TASK_RUNNING stateIf it is preempted, it is placed at the head of its queue. Next time it runs it completes its preempted time slice. Onthe other hand, if the SCHED_RR process completes its time quantum, it is placed at the tail of its queue in thetraditional round robin fashion.

4.8.3 When does the scheduler execute?

There are several ways that scheduler execution is triggered. These can be categorized as direct and indirect.

Direct - a call to schedule()A process running in kernel mode can make a call to schedule. If you look for references to schedule via the Linuxcross reference web site, you'll see that it is called many places such as• file system code • memory management code • network management code• many drivers A typical scenario is this:• A piece of code needs to block.• It puts itself on the appropriate wait queue.• It changes its state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.• It calls the scheduler.

Indirect - via need_resched = 1

The task struct has a field, need_resched, which is checked when returning to user mode from an interrupt orexception. If this field equals 1, schedule() is called. Hence any time a process sets need_resched to 1, this ensuresthat schedule() will be called in the near future. Setting need_resched to 1 occurs in the following cases:• when sched_setscheduler() or sched_yield() is called• when a process is awakened and has higher goodness than the current process• when the current process exhausts its time quantum

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 16

Page 17: ch4_ldd

4.8.4 The Epoch

From the scheduler's viewpoint, CPU time is divided into epochs as a means of encapsulating a group of runnableprocesses and their respective time quanta. A pseudocode overview follows:

epoch_init: • set quantum value for every process, except TASK_ZOMBIE processes

start_epoch: • choose highest goodness TASK_RUNNING process to run (goodness is

discussed in Section 8.5) • run that process until it blocks, is preempted, relinquishes the CPU

voluntarily, or finishes its time quantum • if all runnable processes have exhausted their quanta, go to epoch_init • else go to start_epoch

4.8.5 Process Goodness and Priorities

To make a scheduling decision, Linux calculates what is called the 'goodness' of each process currently in theTASK_RUNNING state and then choosing the process having the highest value of goodness to run next. Linux usesother parameters called priorities as constituents of goodness and therefore was forced to invent a new term'goodness' rather than overloading the word 'priority'.

The goodness of SCHED_FIFO and SCHED_RR processesThe goodness of SCHED_FIFO and SCHED_RR processes lie in a range well above the goodness of anySCHED_OTHER process. Hence, a SCHED_OTHER process will never be chosen if there is an available (soft)real-time process.

Let's consider how the goodness of a process is calculated. For a SCHED_FIFO or SCHED_RR process,goodness = 1000 + rt_priority

where1 <= rt_priority <= 99

Note that rt_priority is a field in the task structure. The scheduler does not changes rt_priority, so it is called a 'static'priority. However, under certain conditions, the rt_priority of a real-time process can be changed by system calls notdiscussed here.. The goodness of SCHED_OTHER processes

The SCHED_OTHER goodness is somewhat more complex, is dynamic, and (as expected) does not depend onrt_priority. In this case, the goodness depends on two other fields from the task structure• priority - both the base time quantum and base priority for the process• counter - number of timer ticks (via irq0) left to the process before its time quantum expires

The goodness is given bygoodness = priority + counter

Now the counter is decremented each timer tick, and when it reaches zero the process has exhausted its timequantum. At that point, the formula above is replaced by setting

counter = 0and

goodness = 0.

The base time quantum is initialized to DEF_PRIORITY for process 0, where currently #define DEF_PRIORITY (20*HZ/100)

At the start of a new epoch, the new value of counter for each process is given bycounter = priority + counter/2.

Hence if the process is one that has just exhausted its quantum (counter = 0), it gets a new counter value equal to itsbase quantum. However, if the process is, for example, in the TASK_INTERRUPTIBLE state, its counter will beenhanced at the start of every epoch. This gives some preference to I/O bound processes.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 17

Page 18: ch4_ldd

At a fork, the child always inherits the base time quantum of its parent. It is possible, albeit rare, for a process tochange its base time quantum. As a result, most processes in the system have the same base time quantum,DEF_PRIORITY. Also at a fork, the counter of the parent is split in two, half going to the parent and half to thechild.

4.8.6 The Scheduling Algorithm

Starting with a very high level, coarse viewpoint, the scheduler does this:• does some general housekeeping such as executing all interrupt handler bottom halves and deferred processes on

task queues• calculates the goodness for the processes in the TASK_RUNNING state to determine the next process to run• turns the CPU over to the chosen processIn this section, we'll look more closely at this scenario. It will perhaps take several readings to assimilate.

This is a somewhat more detailed look at the scheduling algorithm. After understanding this you might go to thesource code itself.1. Run any deferred tasks in queue tq_scheduler.2. Run any pending bottom halves.3. Save current in local variable, prev.4. If (prev is a SCHED_RR process), then assign it a new quantum and put it at the end of the run queue.5. If (prev is in state TASK_INTERRUPTIBLE and has nonblocked, pending signals), then make its state

TASK_RUNNING.6. If (prev is not in the TASK_RUNNING state), then remove it from the run queue.7. If the run queue is empty, point next at the idle_task. Otherwise, find the process in the run queue which has the

highest goodness and reference that process with next.• If there is a tie for highest non zero goodness between prev and some other process, prev is chosen to save

a context switch.• If all the runnable processes have zero goodness, this is the end of an epoch and a new quantum is assigned

to all processes except TASK_ZOMBIE processes.8. If (prev != next) then update the context switch statistics and perform a context switch from prev to next.

R.A. Sevenich © 2004 Introduction to Linux Device Driver Development 4 - 18