MEMORY BARRIERS

Memory barriers are instructions to both the compiler and the CPU to 
impose a partial ordering between the memory access operations 
specified either side of the barrier.
Older and less complex CPUs will perform memory accesses in exactly 
the order specified, so if one is given the following piece of code:

	a = *A;
	*B = b;
	c = *C;
	d = *D;
	*E = e;

It can be guaranteed that it will complete the memory access for each
instruction before moving on to the next line, leading to a definite
sequence of operations on the bus:

	read *A, write *B, read *C, read *D, write *E.

However, with newer and more complex CPUs, this isn't always true 
because: 
 (*) they can rearrange the order of the memory accesses to promote 
     better use of the CPU buses and caches;

 (*) reads are synchronous and may need to be done immediately to 
     permit progress, whereas writes can often be deferred without
     a problem;

 (*) and they are able to combine reads and writes to improve 
     performance when talking to the SDRAM (modern SDRAM chips can 
     do batched accesses of adjacent locations, cutting down on
     transaction setup costs).

When a program runs on a single CPU, the hardware performs the necessary bookkeeping 
 to ensure that programs execute as if all memory operations were performed in the order 
specified by the programmer (program order), hence memory barriers are not necessary. 
However, when the memory is shared with multiple devices, such as other CPUs in a 
multiprocessor system, or memory mapped peripherals, out-of-order access may affect 
program behavior. For example, a second CPU may see memory changes made by the 
first CPU in a sequence which differs from program order.

So what you might actually get from the above piece of code is:

	read *A, read *C+*D, write *E, write *B

Under normal operation, this is probably not going to be a problem;
however,there are two circumstances where it definitely _can_ be a
problem:

 (1) I/O

    Many I/O devices can be memory mapped, and so appear to the CPU 
    as if they're just memory locations. However, to control the 
    device, the driver has to make the right accesses in exactly the
    right order.
    Consider, for example, an ethernet chipset such as the AMD 
    PCnet32. It presents to the CPU an "address register" and a bunch
    of "data registers".The way it's accessed is to write the index 
    of the internal register you want to access to the address 
    register, and then read or write the appropriate data register 
    to access the chip's internal register:

	*ADR = ctl_reg_3;
	reg = *DATA;

    The problem with a clever CPU or a clever compiler is that the 
    write to the address register isn't guaranteed to happen before
    the access to the data register, if the CPU or the compiler 
    thinks it is more efficient to defer the address write:

	read *DATA, write *ADR

    then things will break.
    The way to deal with this is to insert an I/O memory barrier 
    between the two accesses:

	*ADR = ctl_reg_3;
	mb();
	reg = *DATA;

    In this case, the barrier makes a guarantee that all memory accesses before the
  barrier will happen before all the memory accesses after the barrier. It does not
    guarantee that all memory accesses before the barrier will be 
    complete by the time the barrier is complete.

 (2) Multiprocessor interaction

    When there's a system with more than one processor, these may be
    working on the same set of data, but attempting not to use locks
    as locks are quite expensive. This means that accesses that 
    affect both CPUs may have to be carefully ordered to prevent 
    error.

    Consider the R/W semaphore slow path. In that, a waiting process
    is queued on the semaphore, as noted by it having a record on 
    its stack linked to the semaphore's list:

	struct rw_semaphore {
		...
		struct list_head waiters;
	};

	struct rwsem_waiter {
		struct list_head list;
		struct task_struct *task;
	};

   To wake up the waiter, the up_read() or up_write() functions 
   have to read the pointer from this record to know as to where 
   the next waiter record is, clear the task pointer, call 
   wake_up_process() on the task, and release the task struct 
   reference held:
	READ waiter->list.next;
	READ waiter->task;
	WRITE waiter->task;
	CALL wakeup
	RELEASE task

  If any of these steps occur out of order, then the whole thing 
  may fail.
  Note that the waiter does not get the semaphore lock again - it 
  just waits for its task pointer to be cleared. Since the record 
  is on its stack, this means that if the task pointer is cleared 
  before the next pointer in the list is read, then another CPU 
  might start processing the waiter and it might clobber its stack
  before up*() functions have a chance to read the next pointer.

    CPU 0				  CPU 1
  =========================	      ===========================
					down_xxx()
					Queue waiter
					Sleep
    up_yyy()
    READ waiter->task;
    WRITE waiter->task;
    <preempt>
					Resume processing
					down_xxx() returns
					call foo()
					foo() clobbers *waiter
    </preempt>
    READ waiter->list.next;
     --- OOPS ---

  This could be dealt with using a spinlock, but then the down_xxx()
  function has to get the spinlock again after it's been woken up, 
  which is a waste of resources. The way to deal with this is to 
  insert an SMP memory barrier: 
	READ waiter->list.next;
	READ waiter->task;
	smp_mb();
	WRITE waiter->task;
	CALL wakeup
	RELEASE task 

 In this case, the barrier makes a guarantee that all memory 
 accesses before the barrier will happen before all the memory 
 accesses after the barrier. It does not guarantee that all 
 memory accesses before the barrier will be complete by the time 
 the barrier is complete.
                         SMP memory barriers are normally no-ops on
 a UP system because the CPU orders overlapping accesses with 
respect to itself.
 
Referrece:-http://lwn.net/Articles/174655/
EmbLogic's Blog

Blog Members Area

Embedded Systems Trainings

Industrial Trainings

Android System Development

C with Linux

C++ with Linux

Linux Internals and System Programming

Linux Device Drivers

Embedded Applications with ARM

Embedded Systems using 8 bit Controllers

Blog Categories

Leave a Reply Cancel reply