This chapter describes the semaphore, shared memory, and message queue IPC mechanisms as implemented in the Linux 2.4 kernel. It is organized into four sections. The first three sections cover the interfaces and support functions for semaphores, message queues, and shared memory respectively. The last section describes a set of common functions and data structures that are shared by all three mechanisms.
The functions described in this section implement the user level semaphore mechanisms. Note that this implementation relies on the use of kernel splinlocks and kernel semaphores. To avoid confusion, the term "kernel semaphore" will be used in reference to kernel semaphores. All other uses of the word "sempahore" will be in reference to the user level semaphores.
The entire call to sys_semget() is protected by the global sem_ids.sem kernel semaphore.
In the case where a new set of semaphores must be created, the newary() function is called to create and initialize a new semaphore set. The ID of the new set is returned to the caller.
In the case where a key value is provided for an existing semaphore set, ipc_findkey() is invoked to look up the corresponding semaphore descriptor array index. The parameters and permissions of the caller are verified before returning the semaphore set ID.
For the IPC_INFO, SEM_INFO, and SEM_STAT commands, semctl_nolock() is called to perform the necessary functions.
For the GETALL, GETVAL, GETPID, GETNCNT, GETZCNT, IPC_STAT, SETVAL,and SETALL commands, semctl_main() is called to perform the necessary functions.
For the IPC_RMID and IPC_SET command, semctl_down() is called to perform the necessary functions. Throughout both of these operations, the global sem_ids.sem kernel semaphore is held.
After validating the call parameters, the semaphore operations data is copied from user space to a temporary buffer. If a small temporary buffer is sufficient, then a stack buffer is used. Otherwise, a larger buffer is allocated. After copying in the semaphore operations data, the global semaphores spinlock is locked, and the user-specified semaphore set ID is validated. Access permissions for the semaphore set are also validated.
All of the user-specified semaphore operations are parsed. During this process, a count is maintained of all the operations that have the SEM_UNDO flag set. A decrease
flag is set if any of the operations subtract from a semaphore value, and an alter
flag is set if any of the semaphore values are modified (i.e. increased or decreased). The number of each semaphore to be modified is validated.
If SEM_UNDO was asserted for any of the semaphore operations, then the undo list for the current task is searched for an undo structure associated with this semaphore set. During this search, if the semaphore set ID of any of the undo structures is found to be -1, then freeundos() is called to free the undo structure and remove it from the list. If no undo structure is found for this semaphore set then alloc_undo() is called to allocate and initialize one.
The try_atomic_semop() function is called with the do_undo
parameter equal to 0 in order to execute the sequence of operations. The return value indicates that either the operations passed, failed, or were not executed because they need to block. Each of these cases are further described below:
The try_atomic_semop() function returns zero to indicate that all operations in the sequence succeeded. In this case, update_queue() is called to traverse the queue of pending semaphore operations for the semaphore set and awaken any sleeping tasks that no longer need to block. This completes the execution of the sys_semop() system call for this case.
If try_atomic_semop() returns a negative value, then a failure condition was encountered. In this case, none of the operations have been executed. This occurs when either a semaphore operation would cause an invalid semaphore value, or an operation marked IPC_NOWAIT is unable to complete. The error condition is then returned to the caller of sys_semop().
Before sys_semop() returns, a call is made to update_queue() to traverse the queue of pending semaphore operations for the semaphore set and awaken any sleeping tasks that no longer need to block.
The try_atomic_semop() function returns 1 to indicate that the sequence of semaphore operations was not executed because one of the semaphores would block. For this case, a new sem_queue element is initialized containing these semaphore operations. If any of these operations would alter the state of the semaphore, then the new queue element is added at the tail of the queue. Otherwise, the new queue element is added at the head of the queue.
The semsleeping
element of the current task is set to indicate that the task is sleeping on this sem_queue element. The current task is marked as TASK_INTERRUPTIBLE, and the sleeper
element of the sem_queue is set to identify this task as the sleeper. The global semaphore spinlock is then unlocked, and schedule() is called to put the current task to sleep.
When awakened, the task re-locks the global semaphore spinlock, determines why it was awakened, and how it should respond. The following cases are handled:
status
element of the sem_queue structure is set to 1, then the task was awakened in order to retry the semaphore operations. Another call to try_atomic_semop() is made to execute the sequence of semaphore operations. If try_atomic_sweep() returns 1, then the task must block again as described above. Otherwise, 0 is returned for success, or an appropriate error code is returned in case of failure. Before sys_semop() returns, current->semsleeping is cleared, and the sem_queue is removed from the queue. If any of the specified semaphore operations were altering operations (increase or decrease), then update_queue() is called to traverse the queue of pending semaphore operations for the semaphore set and awaken any sleeping tasks that no longer need to block.status
element of the sem_queue structure is NOT set to 1, and the sem_queue element has not been dequeued, then the task was awakened by an interrupt. In this case, the system call fails with EINTR. Before returning, current->semsleeping is cleared, and the sem_queue is removed from the queue. Also, update_queue() is called if any of the operations were altering operations.status
element of the sem_queue structure is NOT set to 1, and the sem_queue element has been dequeued, then the semaphore operations have already been executed by update_queue(). The queue status
, which could be 0 for success or a negated error code for failure, becomes the return value of the system call.The following structures are used specifically for semaphore support:
/* One sem_array data structure for each set of semaphores in the system. */ struct sem_array { struct kern_ipc_perm sem_perm; /* permissions .. see ipc.h */ time_t sem_otime; /* last semop time */ time_t sem_ctime; /* last change time */ struct sem *sem_base; /* ptr to first semaphore in array */ struct sem_queue *sem_pending; /* pending operations to be processed */ struct sem_queue **sem_pending_last; /* last pending operation */ struct sem_undo *undo; /* undo requests on this array * / unsigned long sem_nsems; /* no. of semaphores in array */ };
/* One semaphore structure for each semaphore in the system. */ struct sem { int semval; /* current value */ int sempid; /* pid of last operation */ };
struct seminfo { int semmap; int semmni; int semmns; int semmnu; int semmsl; int semopm; int semume; int semusz; int semvmx; int semaem; };
struct semid64_ds { struct ipc64_perm sem_perm; /* permissions .. see ipc.h */ __kernel_time_t sem_otime; /* last semop time */ unsigned long __unused1; __kernel_time_t sem_ctime; /* last change time */ unsigned long __unused2; unsigned long sem_nsems; /* no. of semaphores in array */ unsigned long __unused3; unsigned long __unused4; };
/* One queue for each sleeping process in the system. */ struct sem_queue { struct sem_queue * next; /* next entry in the queue */ struct sem_queue ** prev; /* previous entry in the queue, *(q->pr ev) == q */ struct task_struct* sleeper; /* this process */ struct sem_undo * undo; /* undo structure */ int pid; /* process id of requesting process */ int status; /* completion status of operation */ struct sem_array * sma; /* semaphore array for operations */ int id; /* internal sem id */ struct sembuf * sops; /* array of pending operations */ int nsops; /* number of operations */ int alter; /* operation will alter semaphore */ };
/* semop system calls takes an array of these. */ struct sembuf { unsigned short sem_num; /* semaphore index in array */ short sem_op; /* semaphore operation */ short sem_flg; /* operation flags */ };
/* Each task has a list of undo requests. They are executed automatically * when the process exits. */ struct sem_undo { struct sem_undo * proc_next; /* next entry on this process */ struct sem_undo * id_next; /* next entry on this semaphore set */ int semid; /* semaphore set identifier */ short * semadj; /* array of adjustments, one per semaphore */ };
The following functions are used specifically in support of semaphores:
newary() relies on the ipc_alloc() function to allocate the memory required for the new semaphore set. It allocates enough memory for the semaphore set descriptor and for each of the semaphores in the set. The allocated memory is cleared, and the address of the first element of the semaphore set descriptor is passed to ipc_addid(). ipc_addid() reserves an array entry for the new semaphore set descriptor and initializes the ( struct kern_ipc_perm) data for the set. The global used_sems
variable is updated by the number of semaphores in the new set and the initialization of the ( struct kern_ipc_perm) data for the new set is completed. Other initialization for this set performed are listed below:
sem_base
element for the set is initialized to the address immediately following the ( struct sem_array) portion of the newly allocated data. This corresponds to the location of the first semaphore in the set.sem_pending
queue is initialized as empty.All of the operations following the call to ipc_addid() are performed while holding the global semaphores spinlock. After unlocking the global semaphores spinlock, newary() calls ipc_buildid() (via sem_buildid()). This function uses the index of the semaphore set descriptor to create a unique ID, that is then returned to the caller of newary().
freeary() is called by semctl_down() to perform the functions listed below. It is called with the global semaphores spinlock locked and it returns with the spinlock unlocked
semctl_down() provides the IPC_RMID and IPC_SET operations of the semctl() system call. The semaphore set ID and the access permissions are verified prior to either of these operations, and in either case, the global semaphore spinlock is held throughout the operation.
The IPC_RMID operation calls freeary() to remove the semaphore set.
The IPC_SET operation updates the uid
, gid
, mode
, and ctime
elements of the semaphore set.
semctl_nolock() is called by sys_semctl() to perform the IPC_INFO, SEM_INFO and SEM_STAT functions.
IPC_INFO and SEM_INFO cause a temporary seminfo buffer to be initialized and loaded with unchanging semaphore statistical data. Then, while holding the global sem_ids.sem
kernel semaphore, the semusz
and semaem
elements of the seminfo structure are updated according to the given command (IPC_INFO or SEM_INFO). The return value of the system call is set to the maximum semaphore set ID.
SEM_STAT causes a temporary semid64_ds buffer to be initialized. The global semaphore spinlock is then held while copying the sem_otime
, sem_ctime
, and sem_nsems
values into the buffer. This data is then copied to user space.
semctl_main() is called by sys_semctl() to perform many of the supported functions, as described in the subsections below. Prior to performing any of the following operations, semctl_main() locks the global semaphore spinlock and validates the semaphore set ID and the permissions. The spinlock is released before returning.
The GETALL operation loads the current semaphore values into a temporary kernel buffer and copies them out to user space. The small stack buffer is used if the semaphore set is small. Otherwise, the spinlock is temporarily dropped in order to allocate a larger buffer. The spinlock is held while copying the semaphore values in to the temporary buffer.
The SETALL operation copies semaphore values from user space into a temporary buffer, and then into the semaphore set. The spinlock is dropped while copying the values from user space into the temporary buffer, and while verifying reasonable values. If the semaphore set is small, then a stack buffer is used, otherwise a larger buffer is allocated. The spinlock is regained and held while the following operations are performed on the semaphore set:
sem_ctime
value for the semaphore set is set.In the IPC_STAT operation, the sem_otime
, sem_ctime
, and sem_nsems
value are copied into a stack buffer. The data is then copied to user space after dropping the spinlock.
For GETVAL in the non-error case, the return value for the system call is set to the value of the specified semaphore.
For GETPID in the non-error case, the return value for the system call is set to the pid
associated with the last operation on the semaphore.
For GETNCNT in the non-error case, the return value for the system call is set to the number of processes waiting on the semaphore being less than zero. This number is calculated by the count_semncnt() function.
For GETZCNT in the non-error case, the return value for the system call is set to the number of processes waiting on the semaphore being set to zero. This number is calculated by the count_semzcnt() function.
After validating the new semaphore value, the following functions are performed:
sem_ctime
value for the semaphore set is updated.count_semncnt() counts the number of tasks waiting on the value of a semaphore to be less than zero.
count_semzcnt() counts the number of tasks waiting on the value of a semaphore to be zero.
update_queue() traverses the queue of pending semops for a semaphore set and calls try_atomic_semop() to determine which sequences of semaphore operations would succeed. If the status of the queue element indicates that blocked tasks have already been awakened, then the queue element is skipped over. For other elements of the queue, the q-alter
flag is passed as the undo parameter to try_atomic_semop(), indicating that any altering operations should be undone before returning.
If the sequence of operations would block, then update_queue() returns without making any changes.
A sequence of operations can fail if one of the semaphore operations would cause an invalid semaphore value, or an operation marked IPC_NOWAIT is unable to complete. In such a case, the task that is blocked on the sequence of semaphore operations is awakened, and the queue status is set with an appropriate error code. The queue element is also dequeued.
If the sequence of operations is non-altering, then they would have passed a zero value as the undo parameter to try_atomic_semop(). If these operations succeeded, then they are considered complete and are removed from the queue. The blocked task is awakened, and the queue element status
is set to indicate success.
If the sequence of operations would alter the semaphore values, but can succeed, then sleeping tasks that no longer need to be blocked are awakened. The queue status is set to 1 to indicate that the blocked task has been awakened. The operations have not been performed, so the queue element is not removed from the queue. The semaphore operations would be executed by the awakened task.
try_atomic_semop() is called by sys_semop() and update_queue() to determine if a sequence of semaphore operations will all succeed. It determines this by attempting to perform each of the operations.
If a blocking operation is encountered, then the process is aborted and all operations are reversed. -EAGAIN is returned if IPC_NOWAIT is set. Otherwise 1 is returned to indicate that the sequence of semaphore operations is blocked.
If a semaphore value is adjusted beyond system limits, then then all operations are reversed, and -ERANGE is returned.
If all operations in the sequence succeed, and the do_undo
parameter is non-zero, then all operations are reversed, and 0 is returned. If the do_undo
parameter is zero, then all operations succeeded and remain in force, and the sem_otime
, field of the semaphore set is updated.
sem_revalidate() is called when the global semaphores spinlock has been temporarily dropped and needs to be locked again. It is called by semctl_main() and alloc_undo(). It validates the semaphore ID and permissions and on success, returns with the global semaphores spinlock locked.
freeundos() traverses the process undo list in search of the desired undo structure. If found, the undo structure is removed from the list and freed. A pointer to the next undo structure on the process list is returned.
alloc_undo() expects to be called with the global semaphores spinlock locked. In the case of an error, it returns with it unlocked.
The global semaphores spinlock is unlocked, and kmalloc() is called to allocate sufficient memory for both the sem_undo structure, and also an array of one adjustment value for each semaphore in the set. On success, the global spinlock is regained with a call to sem_revalidate().
The new semundo structure is then initialized, and the address of this structure is placed at the address provided by the caller. The new undo structure is then placed at the head of undo list for the current task.
sem_exit() is called by do_exit(), and is responsible for executing all of the undo adjustments for the exiting task.
If the current process was blocked on a semaphore, then it is removed from the sem_queue list while holding the global semaphores spinlock.
The undo list for the current task is then traversed, and the following operations are performed while holding and releasing the the global semaphores spinlock around the processing of each element of the list. The following operations are performed for each of the undo elements:
sem_otime
parameter of the semaphore set is updated.When the processing of the list is complete, the current->semundo value is cleared.
The entire call to sys_msgget() is protected by the global message queue semaphore ( msg_ids.sem).
In the case where a new message queue must be created, the newque() function is called to create and initialize a new message queue, and the new queue ID is returned to the caller.
If a key value is provided for an existing message queue, then ipc_findkey() is called to look up the corresponding index in the global message queue descriptor array (msg_ids.entries). The parameters and permissions of the caller are verified before returning the message queue ID. The look up operation and verification are performed while the global message queue spinlock(msg_ids.ary) is held.
The parameters passed to sys_msgctl() are: a message queue ID (msqid
), the operation (cmd
), and a pointer to a user space buffer of type msgid_ds (buf
). Six operations are provided in this function: IPC_INFO, MSG_INFO,IPC_STAT, MSG_STAT, IPC_SET and IPC_RMID. The message queue ID and the operation parameters are validated; then, the operation(cmd) is performed as follows:
The global message queue information is copied to user space.
A temporary buffer of type struct msqid64_ds is initialized and the global message queue spinlock is locked. After verifying the access permissions of the calling process, the message queue information associated with the message queue ID is loaded into the temporary buffer, the global message queue spinlock is unlocked, and the contents of the temporary buffer are copied out to user space by copy_msqid_to_user().
The user data is copied in via copy_msqid_to_user(). The global message queue semaphore and spinlock are obtained and released at the end. After the the message queue ID and the current process access permissions are validated, the message queue information is updated with the user provided data. Later, expunge_all() and ss_wakeup() are called to wake up all processes sleeping on the receiver and sender waiting queues of the message queue. This is because some receivers may now be excluded by stricter access permissions and some senders may now be able to send the message due to an increased queue size.
The global message queue semaphore is obtained and the global message queue spinlock is locked. After validating the message queue ID and the current task access permissions, freeque() is called to free the resources related to the message queue ID. The global message queue semaphore and spinlock are released.
sys_msgsnd() receives as parameters a message queue ID (msqid
), a pointer to a buffer of type struct msg_msg (msgp
), the size of the message to be sent (msgsz
), and a flag indicating wait vs. not wait (msgflg
). There are two task waiting queues and one message waiting queue associated with the message queue ID. If there is a task in the receiver waiting queue that is waiting for this message, then the message is delivered directly to the receiver, and the receiver is awakened. Otherwise, if there is enough space available in the message waiting queue, the message is saved in this queue. As a last resort, the sending task enqueues itself on the sender waiting queue. A more in-depth discussion of the operations performed by sys_msgsnd() follows:
msg
of type struct msg_msg. The message type and message size fields of msg
are also initialized.msgflg
the global message queue spinlock is unlocked, the memory resources for the message are freed, and EAGAIN is returned.msg
into the message waiting queue(msq->q_messages). Updates the q_cbytes
and the q_qnum
fields of the message queue descriptor, as well as the global variables msg_bytes
and msg_hdrs
, which indicate the total number of bytes used for messages and the total number of messages system wide.q_lspid
and the q_stime
fields of the message queue descriptor and releases the global message queue spinlock.The sys_msgrcv() function receives as parameters a message queue ID (msqid
), a pointer to a buffer of type msg_msg (msgp
), the desired message size(msgsz
), the message type (msgtyp
), and the flags (msgflg
). It searches the message waiting queue associated with the message queue ID, finds the first message in the queue which matches the request type, and copies it into the given user buffer. If no such message is found in the message waiting queue, the requesting task is enqueued into the receiver waiting queue until the desired message is available. A more in-depth discussion of the operations performed by sys_msgrcv() follows:
msgtyp
. sys_msgrcv() then locks the global message queue spinlock and obtains the message queue descriptor associated with the message queue ID. If no such message queue exists, it returns EINVAL.msgtyp
is searched.msgflg
indicates no error allowed, unlocks the global message queue spinlock and returns E2BIG.msgflg
is checked. If IPC_NOWAIT is set, then the global message queue spinlock is unlocked and ENOMSG is returned. Otherwise, the receiver is enqueued on the receiver waiting queue as follows:
msr
is allocated and is added to the head of waiting queue.r_tsk
field of msr
is set to current task.r_msgtype
and r_mode
fields are initialized with the desired message type and mode respectively.msgflg
indicates MSG_NOERROR, then the r_maxsize field of msr
is set to be the value of msgsz
otherwise it is set to be INT_MAX.r_msg
field is initialized to indicate that no message has been received yet.r_msg
field of msr
is checked. This field is used to store the pipelined message or in the case of an error, to store the error status. If the r_msg
field is filled with the desired message, then go to the last step Otherwise, the global message queue spinlock is locked again.r_msg
field is re-checked to see if the message was received while waiting for the spinlock. If the message has been received, the last step occurs.r_msg
field remains unchanged, then the task was awakened in order to retry. In this case, msr
is dequeued. If there is a signal pending for the task, then the global message queue spinlock is unlocked and EINTR is returned. Otherwise, the function needs to go back and retry.r_msg
field shows that an error occurred while sleeping, the global message queue spinlock is unlocked and the error is returned.msp
is valid, message type is loaded into the mtype
field of msp
,and store_msg() is invoked to copy the message contents to the mtext
field of msp
. Finally the memory for the message is freed by function free_msg().Data structures for message queues are defined in msg.c.
/* one msq_queue structure for each present queue on the system */ struct msg_queue { struct kern_ipc_perm q_perm; time_t q_stime; /* last msgsnd time */ time_t q_rtime; /* last msgrcv time */ time_t q_ctime; /* last change time */ unsigned long q_cbytes; /* current number of bytes on queue */ unsigned long q_qnum; /* number of messages in queue */ unsigned long q_qbytes; /* max number of bytes on queue */ pid_t q_lspid; /* pid of last msgsnd */ pid_t q_lrpid; /* last receive pid */ struct list_head q_messages; struct list_head q_receivers; struct list_head q_senders; };
/* one msg_msg structure for each message */ struct msg_msg { struct list_head m_list; long m_type; int m_ts; /* message text size */ struct msg_msgseg* next; /* the actual message follows immediately */ };
/* message segment for each message */ struct msg_msgseg { struct msg_msgseg* next; /* the next part of the message follows immediately */ };
/* one msg_sender for each sleeping sender */ struct msg_sender { struct list_head list; struct task_struct* tsk; };
/* one msg_receiver structure for each sleeping receiver */ struct msg_receiver { struct list_head r_list; struct task_struct* r_tsk; int r_mode; long r_msgtype; long r_maxsize; struct msg_msg* volatile r_msg; };
struct msqid64_ds { struct ipc64_perm msg_perm; __kernel_time_t msg_stime; /* last msgsnd time */ unsigned long __unused1; __kernel_time_t msg_rtime; /* last msgrcv time */ unsigned long __unused2; __kernel_time_t msg_ctime; /* last change time */ unsigned long __unused3; unsigned long msg_cbytes; /* current number of bytes on queue */ unsigned long msg_qnum; /* number of messages in queue */ unsigned long msg_qbytes; /* max number of bytes on queue */ __kernel_pid_t msg_lspid; /* pid of last msgsnd */ __kernel_pid_t msg_lrpid; /* last receive pid */ unsigned long __unused4; unsigned long __unused5; };
struct msqid_ds { struct ipc_perm msg_perm; struct msg *msg_first; /* first message on queue,unused */ struct msg *msg_last; /* last message in queue,unused */ __kernel_time_t msg_stime; /* last msgsnd time */ __kernel_time_t msg_rtime; /* last msgrcv time */ __kernel_time_t msg_ctime; /* last change time */ unsigned long msg_lcbytes; /* Reuse junk fields for 32 bit */ unsigned long msg_lqbytes; /* ditto */ unsigned short msg_cbytes; /* current number of bytes on queue */ unsigned short msg_qnum; /* number of messages in queue */ unsigned short msg_qbytes; /* max number of bytes on queue */ __kernel_ipc_pid_t msg_lspid; /* pid of last msgsnd */ __kernel_ipc_pid_t msg_lrpid; /* last receive pid */ };
struct msq_setbuf { unsigned long qbytes; uid_t uid; gid_t gid; mode_t mode; };
newque() allocates the memory for a new message queue descriptor ( struct msg_queue) and then calls ipc_addid(), which reserves a message queue array entry for the new message queue descriptor. The message queue descriptor is initialized as follows:
q_stime
and q_rtime
fields of the message queue descriptor are initialized as 0. The q_ctime
field is set to be CURRENT_TIME.q_qbytes
) is set to be MSGMNB, and the number of bytes currently used by the queue (q_cbytes
) is initialized as 0.q_messages
), the receiver waiting queue (q_receivers
), and the sender waiting queue (q_senders
) are each initialized as empty.All the operations following the call to ipc_addid() are performed while holding the global message queue spinlock. After unlocking the spinlock, newque() calls msg_buildid(), which maps directly to ipc_buildid(). ipc_buildid() uses the index of the message queue descriptor to create a unique message queue ID that is then returned to the caller of newque().
When a message queue is going to be removed, the freeque() function is called. This function assumes that the global message queue spinlock is already locked by the calling function. It frees all kernel resources associated with that message queue. First, it calls ipc_rmid() (via msg_rmid()) to remove the message queue descriptor from the array of global message queue descriptors. Then it calls expunge_all to wake up all receivers and ss_wakeup() to wake up all senders sleeping on this message queue. Later the global message queue spinlock is released. All messages stored in this message queue are freed and the memory for the message queue descriptor is freed.
ss_wakeup() wakes up all the tasks waiting in the given message sender waiting queue. If this function is called by freeque(), then all senders in the queue are dequeued.
ss_add() receives as parameters a message queue descriptor and a message sender data structure. It fills the tsk
field of the message sender data structure with the current process, changes the status of current process to TASK_INTERRUPTIBLE, then inserts the message sender data structure at the head of the sender waiting queue of the given message queue.
If the given message sender data structure (mss
) is still in the associated sender waiting queue, then ss_del() removes mss
from the queue.
expunge_all() receives as parameters a message queue descriptor(msq
) and an integer value (res
) indicating the reason for waking up the receivers. For each sleeping receiver associated with msq
, the r_msg
field is set to the indicated wakeup reason (res
), and the associated receiving task is awakened. This function is called when a message queue is removed or a message control operation has been performed.
When a process sends a message, the sys_msgsnd() function first invokes the load_msg() function to load the message from user space to kernel space. The message is represented in kernel memory as a linked list of data blocks. Associated with the first data block is a msg_msg structure that describes the overall message. The datablock associated with the msg_msg structure is limited to a size of DATA_MSG_LEN. The data block and the structure are allocated in one contiguous memory block that can be as large as one page in memory. If the full message will not fit into this first data block, then additional data blocks are allocated and are organized into a linked list. These additional data blocks are limited to a size of DATA_SEG_LEN, and each include an associated msg_msgseg) structure. The msg_msgseg structure and the associated data block are allocated in one contiguous memory block that can be as large as one page in memory. This function returns the address of the new msg_msg structure on success.
The store_msg() function is called by sys_msgrcv() to reassemble a received message into the user space buffer provided by the caller. The data described by the msg_msg structure and any msg_msgseg structures are sequentially copied to the user space buffer.
The free_msg() function releases the memory for a message data structure msg_msg, and the message segments.
convert_mode() is called by sys_msgrcv(). It receives as parameters the address of the specified message type (msgtyp
) and a flag (msgflg
). It returns the search mode to the caller based on the value of msgtyp
and msgflg
. If msgtyp
is null, then SEARCH_ANY is returned. If msgtyp
is less than 0, then msgtyp
is set to it's absolute value and SEARCH_LESSEQUAL is returned. If MSG_EXCEPT is specified in msgflg
, then SEARCH_NOTEQUAL is returned. Otherwise SEARCH_EQUAL is returned.
The testmsg() function checks whether a message meets the criteria specified by the receiver. It returns 1 if one of the following conditions is true:
pipelined_send() allows a process to directly send a message to a waiting receiver rather than deposit the message in the associated message waiting queue. The testmsg() function is invoked to find the first receiver which is waiting for the given message. If found, the waiting receiver is removed from the receiver waiting queue, and the associated receiving task is awakened. The message is stored in the r_msg
field of the receiver, and 1 is returned. In the case where no receiver is waiting for the message, 0 is returned.
In the process of searching for a receiver, potential receivers may be found which have requested a size that is too small for the given message. Such receivers are removed from the queue, and are awakened with an error status of E2BIG, which is stored in the r_msg
field. The search then continues until either a valid receiver is found, or the queue is exhausted.
copy_msqid_to_user() copies the contents of a kernel buffer to the user buffer. It receives as parameters a user buffer, a kernel buffer of type msqid64_ds, and a version flag indicating the new IPC version vs. the old IPC version. If the version flag equals IPC_64, then copy_to_user() is invoked to copy from the kernel buffer to the user buffer directly. Otherwise a temporary buffer of type struct msqid_ds is initialized, and the kernel data is translated to this temporary buffer. Later copy_to_user() is called to copy the contents of the the temporary buffer to the user buffer.
The function copy_msqid_from_user() receives as parameters a kernel message buffer of type struct msq_setbuf, a user buffer and a version flag indicating the new IPC version vs. the old IPC version. In the case of the new IPC version, copy_from_user() is called to copy the contents of the user buffer to a temporary buffer of type msqid64_ds. Then, the qbytes
,uid
, gid
, and mode
fields of the kernel buffer are filled with the values of the corresponding fields from the temporary buffer. In the case of the old IPC version, a temporary buffer of type struct msqid_ds is used instead.
The entire call to sys_shmget() is protected by the global shared memory semaphore.
In the case where a new shared memory segment must be created, the newseg() function is called to create and initialize a new shared memory segment. The ID of the new segment is returned to the caller.
In the case where a key value is provided for an existing shared memory segment, the corresponding index in the shared memory descriptors array is looked up, and the parameters and permissions of the caller are verified before returning the shared memory segment ID. The look up operation and verification are performed while the global shared memory spinlock is held.
A temporary shminfo64 buffer is loaded with system-wide shared memory parameters and is copied out to user space for access by the calling application.
The global shared memory semaphore and the global shared memory spinlock are held while gathering system-wide statistical information for shared memory. The shm_get_stat() function is called to calculate both the number of shared memory pages that are resident in memory and the number of shared memory pages that are swapped out. Other statistics include the total number of shared memory pages and the number of shared memory segments in use. The counts of swap_attempts
and swap_successes
are hard-coded to zero. These statistics are stored in a temporary shm_info buffer and copied out to user space for the calling application.
For SHM_STAT and IPC_STATA, a temporary buffer of type struct shmid64_ds is initialized, and the global shared memory spinlock is locked.
For the SHM_STAT case, the shared memory segment ID parameter is expected to be a straight index (i.e. 0 to n where n is the number of shared memory IDs in the system). After validating the index, ipc_buildid() is called (via shm_buildid()) to convert the index into a shared memory ID. In the passing case of SHM_STAT, the shared memory ID will be the return value. Note that this is an undocumented feature, but is maintained for the ipcs(8) program.
For the IPC_STAT case, the shared memory segment ID parameter is expected to be an ID that was generated by a call to shmget(). The ID is validated before proceeding. In the passing case of IPC_STAT, 0 will be the return value.
For both SHM_STAT and IPC_STAT, the access permissions of the caller are verified. The desired statistics are loaded into the temporary buffer and then copied out to the calling application.
After validating access permissions, the global shared memory spinlock is locked, and the shared memory segment ID is validated. For both SHM_LOCK and SHM_UNLOCK, shmem_lock() is called to perform the function. The parameters for shmem_lock() identify the function to be performed.
During IPC_RMID the global shared memory semaphore and the global shared memory spinlock are held throughout this function. The Shared Memory ID is validated, and then if there are no current attachments, shm_destroy() is called to destroy the shared memory segment. Otherwise, the SHM_DEST flag is set to mark it for destruction, and the IPC_PRIVATE flag is set to prevent other processes from being able to reference the shared memory ID.
After validating the shared memory segment ID and the user access permissions, the uid
, gid
, and mode
flags of the shared memory segment are updated with the user data. The shm_ctime
field is also updated. These changes are made while holding the global shared memory semaphore and the global share memory spinlock.
sys_shmat() takes as parameters, a shared memory segment ID, an address at which the shared memory segment should be attached(shmaddr
), and flags which will be described below.
If shmaddr
is non-zero, and the SHM_RND flag is specified, then shmaddr
is rounded down to a multiple of SHMLBA. If shmaddr
is not a multiple of SHMLBA and SHM_RND is not specified, then EINVAL is returned.
The access permissions of the caller are validated and the shm_nattch
field for the shared memory segment is incremented. Note that this increment guarantees that the attachment count is non-zero and prevents the shared memory segment from being destroyed during the process of attaching to the segment. These operations are performed while holding the global shared memory spinlock.
The do_mmap() function is called to create a virtual memory mapping to the shared memory segment pages. This is done while holding the mmap_sem
semaphore of the current task. The MAP_SHARED flag is passed to do_mmap(). If an address was provided by the caller, then the MAP_FIXED flag is also passed to do_mmap(). Otherwise, do_mmap() will select the virtual address at which to map the shared memory segment.
NOTE shm_inc() will be invoked within the do_mmap() function call via the shm_file_operations
structure. This function is called to set the PID, to set the current time, and to increment the number of attachments to this shared memory segment.
After the call to do_mmap(), the global shared memory semaphore and the global shared memory spinlock are both obtained. The attachment count is then decremented. The the net change to the attachment count is 1 for a call to shmat() because of the call to shm_inc(). If, after decrementing the attachment count, the resulting count is found to be zero, and if the segment is marked for destruction (SHM_DEST), then shm_destroy() is called to release the shared memory segment resources.
Finally, the virtual address at which the shared memory is mapped is returned to the caller at the user specified address. If an error code had been returned by do_mmap(), then this failure code is passed on as the return value for the system call.
The global shared memory semaphore is held while performing sys_shmdt(). The mm_struct
of the current process is searched for the vm_area_struct
associated with the shared memory address. When it is found, do_munmap() is called to undo the virtual address mapping for the shared memory segment.
Note also that do_munmap() performs a call-back to shm_close(), which performs the shared-memory book keeping functions, and releases the shared memory segment resources if there are no other attachments.
sys_shmdt() unconditionally returns 0.
struct shminfo64 { unsigned long shmmax; unsigned long shmmin; unsigned long shmmni; unsigned long shmseg; unsigned long shmall; unsigned long __unused1; unsigned long __unused2; unsigned long __unused3; unsigned long __unused4; };
struct shm_info { int used_ids; unsigned long shm_tot; /* total allocated shm */ unsigned long shm_rss; /* total resident shm */ unsigned long shm_swp; /* total swapped shm */ unsigned long swap_attempts; unsigned long swap_successes; };
struct shmid_kernel /* private to the kernel */ { struct kern_ipc_perm shm_perm; struct file * shm_file; int id; unsigned long shm_nattch; unsigned long shm_segsz; time_t shm_atim; time_t shm_dtim; time_t shm_ctim; pid_t shm_cprid; pid_t shm_lprid; };
struct shmid64_ds { struct ipc64_perm shm_perm; /* operation perms */ size_t shm_segsz; /* size of segment (bytes) */ __kernel_time_t shm_atime; /* last attach time */ unsigned long __unused1; __kernel_time_t shm_dtime; /* last detach time */ unsigned long __unused2; __kernel_time_t shm_ctime; /* last change time */ unsigned long __unused3; __kernel_pid_t shm_cpid; /* pid of creator */ __kernel_pid_t shm_lpid; /* pid of last operator */ unsigned long shm_nattch; /* no. of current attaches */ unsigned long __unused4; unsigned long __unused5; };
struct shmem_inode_info { spinlock_t lock; unsigned long max_index; swp_entry_t i_direct[SHMEM_NR_DIRECT]; /* for the first blocks */ swp_entry_t **i_indirect; /* doubly indirect blocks */ unsigned long swapped; int locked; /* into memory */ struct list_head list; };
The newseg() function is called when a new shared memory segment needs to be created. It acts on three parameters for the new segment the key, the flag, and the size. After validating that the size of the shared memory segment to be created is between SHMMIN and SHMMAX and that the total number of shared memory segments does not exceed SHMALL, it allocates a new shared memory segment descriptor. The shmem_file_setup() function is invoked later to create an unlinked file of type tmpfs. The returned file pointer is saved in the shm_file
field of the associated shared memory segment descriptor. The files size is set to be the same as the size of the segment. The new shared memory segment descriptor is initialized and inserted into the global IPC shared memory descriptors array. The shared memory segment ID is created by shm_buildid() (via ipc_buildid()). This segment ID is saved in the id
field of the shared memory segment descriptor, as well as in the i_ino
field of the associated inode. In addition, the address of the shared memory operations defined in structure shm_file_operation
is stored in the associated file. The value of the global variable shm_tot
, which indicates the total number of shared memory segments system wide, is also increased to reflect this change. On success, the segment ID is returned to the caller application.
shm_get_stat() cycles through all of the shared memory structures, and calculates the total number of memory pages in use by shared memory and the total number of shared memory pages that are swapped out. There is a file structure and an inode structure for each shared memory segment. Since the required data is obtained via the inode, the spinlock for each inode structure that is accessed is locked and unlocked in sequence.
shmem_lock() receives as parameters a pointer to the shared memory segment descriptor and a flag indicating lock vs. unlock.The locking state of the shared memory segment is stored in an associated inode. This state is compared with the desired locking state; shmem_lock() simply returns if they match.
While holding the semaphore of the associated inode, the locking state of the inode is set. The following list of items occur for each page in the shared memory segment:
During shm_destroy() the total number of shared memory pages is adjusted to account for the removal of the shared memory segment. ipc_rmid() is called (via shm_rmid()) to remove the Shared Memory ID. shmem_lock is called to unlock the shared memory pages, effectively decrementing the reference counts to zero for each page. fput() is called to decrement the usage counter f_count
for the associated file object, and if necessary, to release the file object resources. kfree() is called to free the shared memory segment descriptor.
shm_inc() sets the PID, sets the current time, and increments the number of attachments for the given shared memory segment. These operations are performed while holding the global shared memory spinlock.
shm_close() updates the shm_lprid
and the shm_dtim
fields and decrements the number of attached shared memory segments. If there are no other attachments to the shared memory segment, then shm_destroy() is called to release the shared memory segment resources. These operations are all performed while holding both the global shared memory semaphore and the global shared memory spinlock.
The function shmem_file_setup() sets up an unlinked file living in the tmpfs file system with the given name and size. If there are enough systen memory resource for this file, it creates a new dentry under the mount root of tmpfs, and allocates a new file descriptor and a new inode object of tmpfs type. Then it associates the new dentry object with the new inode object by calling d_instantiate() and saves the address of the dentry object in the file descriptor. The i_size
field of the inode object is set to be the file size and the i_nlink
field is set to be 0 in order to mark the inode unlinked. Also, shmem_file_setup() stores the address of the shmem_file_operations
structure in the f_op
field, and initializes f_mode
and f_vfsmnt
fields of the file descriptor properly. The function shmem_truncate() is called to complete the initialization of the inode object. On success, shmem_file_setup() returns the new file descriptor.
The semaphores, messages, and shared memory mechanisms of Linux are built on a set of common primitives. These primitives are described in the sections below.
If the memory allocation is greater than PAGE_SIZE, then vmalloc() is used to allocate memory. Otherwise, kmalloc() is called with GFP_KERNEL to allocate the memory.
When a new semaphore set, message queue, or shared memory segment is added, ipc_addid() first calls grow_ary() to insure that the size of the corresponding descriptor array is sufficiently large for the system maximum. The array of descriptors is searched for the first unused element. If an unused element is found, the count of descriptors which are in use is incremented. The kern_ipc_perm structure for the new resource descriptor is then initialized, and the array index for the new descriptor is returned. When ipc_addid() succeeds, it returns with the global spinlock for the given IPC type locked.
ipc_rmid() removes the IPC descriptor from the the global descriptor array of the IPC type, updates the count of IDs which are in use, and adjusts the maximum ID in the corresponding descriptor array if necessary. A pointer to the IPC descriptor associated with given IPC ID is returned.
ipc_buildid() creates a unique ID to be associated with each descriptor within a given IPC type. This ID is created at the time a new IPC element is added (e.g. a new shared memory segment or a new semaphore set). The IPC ID converts easily into the corresponding descriptor array index. Each IPC type maintains a sequence number which is incremented each time a descriptor is added. An ID is created by multiplying the sequence number with SEQ_MULTIPLIER and adding the product to the descriptor array index. The sequence number used in creating a particular IPC ID is then stored in the corresponding descriptor. The existence of the sequence number makes it possible to detect the use of a stale IPC ID.
ipc_checkid() divides the given IPC ID by the SEQ_MULTIPLIER and compares the quotient with the seq value saved corresponding descriptor. If they are equal, then the IPC ID is considered to be valid and 1 is returned. Otherwise, 0 is returned.
grow_ary() handles the possibility that the maximum (tunable) number of IDs for a given IPC type can be dynamically changed. It enforces the current maximum limit so that it is no greater than the permanent system limit (IPCMNI) and adjusts it down if necessary. It also insures that the existing descriptor array is large enough. If the existing array size is sufficiently large, then the current maximum limit is returned. Otherwise, a new larger array is allocated, the old array is copied into the new array, and the old array is freed. The corresponding global spinlock is held when updating the descriptor array for the given IPC type.
ipc_findkey() searches through the descriptor array of the specified ipc_ids object, and searches for the specified key. Once found, the index of the corresponding descriptor is returned. If the key is not found, then -1 is returned.
ipcperms() checks the user, group, and other permissions for access to the IPC resources. It returns 0 if permission is granted and -1 otherwise.
ipc_lock() takes an IPC ID as one of its parameters. It locks the global spinlock for the given IPC type, and returns a pointer to the descriptor corresponding to the specified IPC ID.
ipc_unlock() releases the global spinlock for the indicated IPC type.
ipc_lockall() locks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores, and messaging).
ipc_unlockall() unlocks the global spinlock for the given IPC mechanism (i.e. shared memory, semaphores, and messaging).
ipc_get() takes a pointer to a particular IPC type (i.e. shared memory, semaphores, or message queues) and a descriptor ID, and returns a pointer to the corresponding IPC descriptor. Note that although the descriptors for each IPC type are of different data types, the common kern_ipc_perm structure type is embedded as the first entity in every case. The ipc_get() function returns this common data type. The expected model is that ipc_get() is called through a wrapper function (e.g. shm_get()) which casts the data type to the correct descriptor data type.
ipc_parse_version() removes the IPC_64 flag from the command if it is present and returns either IPC_64 or IPC_OLD.
The semaphores, messages, and shared memory mechanisms all make use of the following common structures:
Each of the IPC descriptors has a data object of this type as the first element. This makes it possible to access any descriptor from any of the generic IPC functions using a pointer of this data type.
/* used by in-kernel data structures */ struct kern_ipc_perm { key_t key; uid_t uid; gid_t gid; uid_t cuid; gid_t cgid; mode_t mode; unsigned long seq; };
The ipc_ids structure describes the common data for semaphores, message queues, and shared memory. There are three global instances of this data structure-- semid_ds
, msgid_ds
and shmid_ds
-- for semaphores, messages and shared memory respectively. In each instance, the sem
semaphore is used to protect access to the structure. The entries
field points to an IPC descriptor array, and the ary
spinlock protects access to this array. The seq
field is a global sequence number which will be incremented when a new IPC resource is created.
struct ipc_ids { int size; int in_use; int max_id; unsigned short seq; unsigned short seq_max; struct semaphore sem; spinlock_t ary; struct ipc_id* entries; };
An array of struct ipc_id exists in each instance of the ipc_ids structure. The array is dynamically allocated and may be replaced with larger array by grow_ary() as required. The array is sometimes referred to as the descriptor array, since the kern_ipc_perm data type is used as the common descriptor data type by the IPC generic functions.
struct ipc_id { struct kern_ipc_perm* p; };