Memory barriers are instructions to both the compiler and the CPU to impose a partial ordering between the memory access operations specified either side of the barrier. Older and less complex CPUs will perform memory accesses in exactly the order specified, so if one is given the following piece of code: a = *A; *B = b; c = *C; d = *D; *E = e; It can be guaranteed that it will complete the memory access for each instruction before moving on to the next line, leading to a definite sequence of operations on the bus: read *A, write *B, read *C, read *D, write *E. However, with newer and more complex CPUs, this isn't always true because: (*) they can rearrange the order of the memory accesses to promote better use of the CPU buses and caches; (*) reads are synchronous and may need to be done immediately to permit progress, whereas writes can often be deferred without a problem; (*) and they are able to combine reads and writes to improve performance when talking to the SDRAM (modern SDRAM chips can do batched accesses of adjacent locations, cutting down on transaction setup costs). When a program runs on a single CPU, the hardware performs the necessary bookkeeping to ensure that programs execute as if all memory operations were performed in the order specified by the programmer (program order), hence memory barriers are not necessary. However, when the memory is shared with multiple devices, such as other CPUs in a multiprocessor system, or memory mapped peripherals, out-of-order access may affect program behavior. For example, a second CPU may see memory changes made by the first CPU in a sequence which differs from program order. So what you might actually get from the above piece of code is: read *A, read *C+*D, write *E, write *B Under normal operation, this is probably not going to be a problem; however,there are two circumstances where it definitely _can_ be a problem: (1) I/O Many I/O devices can be memory mapped, and so appear to the CPU as if they're just memory locations. However, to control the device, the driver has to make the right accesses in exactly the right order. Consider, for example, an ethernet chipset such as the AMD PCnet32. It presents to the CPU an "address register" and a bunch of "data registers".The way it's accessed is to write the index of the internal register you want to access to the address register, and then read or write the appropriate data register to access the chip's internal register: *ADR = ctl_reg_3; reg = *DATA; The problem with a clever CPU or a clever compiler is that the write to the address register isn't guaranteed to happen before the access to the data register, if the CPU or the compiler thinks it is more efficient to defer the address write: read *DATA, write *ADR then things will break. The way to deal with this is to insert an I/O memory barrier between the two accesses: *ADR = ctl_reg_3; mb(); reg = *DATA; In this case, the barrier makes a guarantee that all memory accesses before the barrier will happen before all the memory accesses after the barrier. It does not guarantee that all memory accesses before the barrier will be complete by the time the barrier is complete. (2) Multiprocessor interaction When there's a system with more than one processor, these may be working on the same set of data, but attempting not to use locks as locks are quite expensive. This means that accesses that affect both CPUs may have to be carefully ordered to prevent error. Consider the R/W semaphore slow path. In that, a waiting process is queued on the semaphore, as noted by it having a record on its stack linked to the semaphore's list: struct rw_semaphore { ... struct list_head waiters; }; struct rwsem_waiter { struct list_head list; struct task_struct *task; }; To wake up the waiter, the up_read() or up_write() functions have to read the pointer from this record to know as to where the next waiter record is, clear the task pointer, call wake_up_process() on the task, and release the task struct reference held: READ waiter->list.next; READ waiter->task; WRITE waiter->task; CALL wakeup RELEASE task If any of these steps occur out of order, then the whole thing may fail. Note that the waiter does not get the semaphore lock again - it just waits for its task pointer to be cleared. Since the record is on its stack, this means that if the task pointer is cleared before the next pointer in the list is read, then another CPU might start processing the waiter and it might clobber its stack before up*() functions have a chance to read the next pointer. CPU 0 CPU 1 ========================= =========================== down_xxx() Queue waiter Sleep up_yyy() READ waiter->task; WRITE waiter->task; <preempt> Resume processing down_xxx() returns call foo() foo() clobbers *waiter </preempt> READ waiter->list.next; --- OOPS --- This could be dealt with using a spinlock, but then the down_xxx() function has to get the spinlock again after it's been woken up, which is a waste of resources. The way to deal with this is to insert an SMP memory barrier: READ waiter->list.next; READ waiter->task; smp_mb(); WRITE waiter->task; CALL wakeup RELEASE task In this case, the barrier makes a guarantee that all memory accesses before the barrier will happen before all the memory accesses after the barrier. It does not guarantee that all memory accesses before the barrier will be complete by the time the barrier is complete. SMP memory barriers are normally no-ops on a UP system because the CPU orders overlapping accesses with respect to itself. Referrece:-http://lwn.net/Articles/174655/