Posts tagged with 'queue'

Testing the lock-free queue

Having discussed a test app for testing the multithreaded use of the lock-free stack, it’s now time for the lock-free queue.

Queue of businessmenFirst of all, I decided to play around with the lock-free queue more than I had done with the lock-free stack. The queue works by enqueuing at the tail and dequeuing at the head of the internal linked list. Because an enqueue requires two updates to the linked list (adding the new node, and moving the tail pointer to the new node) and it’s hard to get them to both happen during the enqueue operation, Michael and Scott, the inventors of this particular technique, decided that updating the tail pointer wasn’t ultra-important straightaway. If it happened, great, but, if not, they deferred it to the next thread performing an enqueue or a dequeue. In fact, the enqueue operation requires the tail to be pointing to the last node before the enqueue can happen. In other words, the tail can never become adrift by more than one position from the end. If a particular enqueue doesn’t update that tail pointer, the very next one will force it to happen. Since there is a dummy node in the queue implementation holding no data, this means that the tail can never be or point ‘outside’ the linked list; it’s not like we can get some weird ABA problem where the tail is pointing at an old node no longer part of the list.

Michael and Scott went even further: they made the dequeue operation update the tail position if it was required. This, frankly, made the dequeue code harder to understand and harder to test. And, in reality, it’s just not required. Again: the tail can’t become adrift by more than one node since the next enqueue will update it before the new node can be added (in particular, the tail’s next pointer must be null before the new node is added).

So I decided to remove the tail update code from the Dequeue() method altogether. That, plus some more tidying up advised by CodeRush’s code issues feature, led to this latest code for the lock-free queue:

using System;
using JmBucknall.Threading;

namespace JmBucknall.Structures {

  public class LockFreeQueue<T> {
    SingleLinkNode<T> head;
    SingleLinkNode<T> tail;

    public LockFreeQueue() {
      head = new SingleLinkNode<T>();
      tail = head;
    }

    public void Enqueue(T item) {
      SingleLinkNode<T> newNode = new SingleLinkNode<T> { Item = item };

      while (true) {
        SingleLinkNode<T> oldTail = tail;
        SingleLinkNode<T> oldTailNext = oldTail.Next;

        if (tail == oldTail) {
          if (oldTailNext != null) {
            SyncMethods.CAS<SingleLinkNode<T>>(ref tail, oldTail, oldTailNext);
          }
          else {
            if (SyncMethods.CAS<SingleLinkNode<T>>(ref tail.Next, null, newNode)) {
              SyncMethods.CAS<SingleLinkNode<T>>(ref tail, oldTail, newNode);
              return;
            }
          }
        }
      }
    }

    public bool Dequeue(out T item) {
      while (true) {
        SingleLinkNode<T> oldHead = head;
        SingleLinkNode<T> oldHeadNext = oldHead.Next;

        if (oldHead == head) {
          if (oldHeadNext == null) {
            item = default(T);
            return false;
          }
          if (SyncMethods.CAS<SingleLinkNode<T>>(ref head, oldHead, oldHeadNext)) {
            item = oldHeadNext.Item;
            return true;
          }
        }
      }
    }

    public T Dequeue() {
      T result;
      Dequeue(out result);
      return result;
    }
  }
}

For the Enqueue() method, a new node is allocated and initialized. We then enter an infinite loop (it’ll be exited when we add the new node properly at the end of the linked list) and take copies of the tail and the tail’s next node. If there’s no change in the tail, and the next node is not null, the tail has fallen behind and we (try to) update its position. If the next node is null, we try to add the new node. If that was successful, we (try to) update the tail and return. Simple enough: update the tail if needed and add the new node on the end, afterwards trying to update the tail. The new node will only be added if the tail’s next node is null.

Dequeue() is much simpler than before. No longer do we worry about the tail; let Enqueue() deal with that. We take a copy of the head and its next node. If the head hasn’t changed and the next node is null, we can return the default value and false (to indicate the queue was empty at that point in time). If the head hasn’t changed and the next node is not null, we try to advance the head to its next node. If we’re successful, the old head’s next node is now the dummy head node and we can return its item with impunity and true to indicate a successful dequeue.

Now we can look at the test program. It works in exactly the same way as in the stack case, so I’ll leave the description to that post.

using System;
using System.Threading;
using JmBucknall.Structures;

namespace MultithreadedQueueTester {

  static class Globals {
    public const int TopValue = 10000000;
    public const int EnqueuerCount = 2;
    public const int DequeuerCount = 5;
    public static int EnqueueValue;
    public static bool NoMoreData;
  }

  public class ThreadData {
    public LockFreeQueue<int> queue;
    public ManualResetEvent completion;
    public int[] results;
    public ThreadData(LockFreeQueue<int> queue, ManualResetEvent completion, int[] results) {
      this.queue = queue;
      this.completion = completion;
      this.results = results;
    }
  }

  static class EnqueueEngine {
    static public void Execute(Object stateInfo) {
      ThreadData data = stateInfo as ThreadData;
      int enqueueCount = 0;

      int nextValue = Interlocked.Increment(ref Globals.EnqueueValue);
      while (nextValue <= Globals.TopValue) {
        data.queue.Enqueue(nextValue);
        enqueueCount++;
        nextValue = Interlocked.Increment(ref Globals.EnqueueValue);
      }

      Console.WriteLine(String.Format("Enqueue count: {0}", enqueueCount));
      data.completion.Set();
    }
  }

  static class DequeueEngine {
    static public void Execute(Object stateInfo) {
      ThreadData data = stateInfo as ThreadData;

      int dequeueCount = 0;
      while (true) {
        int value = data.queue.Dequeue();
        if (value == 0) {
          if (Globals.NoMoreData)
            break;
          Thread.Sleep(1); //minor wait when queue is empty
        }
        else {
          dequeueCount++;
          int oldValue = Interlocked.CompareExchange(ref data.results[value], 1, 0);
          if (oldValue != 0) {
            Console.WriteLine(String.Format("Error: already dequeued {0}", value));
          }
        }
      }

      Console.WriteLine(String.Format("Dequeue count: {0}", dequeueCount));
      data.completion.Set();
    }
  }

  class Program {
    private static ManualResetEvent[] QueueWorkItems(int count, LockFreeQueue<int> queue, int[] results, WaitCallback callback) {
      ManualResetEvent[] events = new ManualResetEvent[count];
      for (int i = 0; i < count; i++) {
        events[i] = new ManualResetEvent(false);
        ThreadPool.QueueUserWorkItem(callback, new ThreadData(queue, events[i], results));
      }
      return events;
    }

    static void Main(string[] args) {
      DateTime start = DateTime.Now;

      Console.WriteLine("create the shared queue");
      LockFreeQueue<int> queue = new LockFreeQueue<int>();

      Console.WriteLine("create the shared results array");
      int[] results = new int[Globals.TopValue + 1];

      Console.WriteLine("create the completion events");
      var enqueuersDone = QueueWorkItems(Globals.EnqueuerCount, queue, results, EnqueueEngine.Execute);
      var dequeuersDone = QueueWorkItems(Globals.DequeuerCount, queue, results, DequeueEngine.Execute);

      Console.WriteLine("wait for the enqueuers to be done");
      ManualResetEvent.WaitAll(enqueuersDone);

      Console.WriteLine("signal the dequeuers to stop, and wait");
      Globals.NoMoreData = true;
      ManualResetEvent.WaitAll(dequeuersDone);

      Console.WriteLine("check that all values were dequeued");
      for (int i = 1; i <= Globals.TopValue; i++) {
        if (results[i] != 1)
          Console.WriteLine(String.Format("Error: {0} was never dequeued.", i));
      }

      DateTime end = DateTime.Now;
      Console.WriteLine("Done");
      Console.WriteLine(end - start);

      Console.ReadLine();
    }
  }
}

Running this test app shows the lock-free queue works well and correctly in a multithreaded environment on a four-core machine.

References

  • Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. Maged M. Michael & Michael L. Scott. (pdf)
  • Concurrent Programming on Windows. Joe Duffy. (Amazon)

 

Album cover for DecksandrumsandrockandrollNow playing:
Propellerheads - History Repeating
(from Decksandrumsandrockandroll)


More on the lock-free stack/queue: memory fences

Last time I talked about the lock-free stack and queue, I was more concerned about proving that my code was free from the ABA problem than anything else. In making my argument I naturally glossed over such niceties as the .NET memory model and so assumed (pretty much) a sequential memory model, because, to be honest, that's how we think as programmers.

Spring path with picket fenceConsider the following set of statements in C#:

      int foo;
      bool fooHasBeenSet = false;
      // ...
      foo = 42;
      fooHasBeenSet = true;

Now, you and I when we read this will naturally assume that the assignment statements occur in the order shown. In fact, that assumption is more along the lines of a tautology, a duh! statement. It can't be otherwise, surely? The foo variable will be set to 42 and then the boolean variable will be set to true. This assumption is known as the sequential memory model, and it is the strictest memory model we can use.

As it happens, on modern machines, this assumption could be false. .NET allows the compiler or the JITter or the hardware to rearrange statements as they see fit, so long as the rearrangement has the same overall semantics as the code as written. The reason for allowing this is to improve the overall efficiency of the code given the presence of pipelined CPUs, caches, and the like. So in fact the boolean variable could be set before the int variable. The boolean variable may not be set until we actually want to read it, ditto foo. We couldn't tell because as soon as either variable was read after the relevant assignment, we'd see the correct value.

Now in a single-threaded program, there's absolutely no way to find out if the order was changed. We can't throw in a Console.Write(value); immediately after the assignment to the int variable, since we would just see the correct value of the variable (but note, the assignment to the boolean could still occur prior to that). In essence, in a single-threaded application, all this talk of rearranging control flow to suit the compiler or JITter or hardware is just navel-gazing. We can't tell, and so it doesn't matter.

In a multi-threaded application though, it does make a difference. A big one.

In a multithreaded environment, one thread could set the boolean before the int variable due to this allowed rearrangement in control flow. Another thread could then see fooHasBeenSet as true, but still see the old value of foo, with presumably dire results. But, wait, there's more! In a modern PC, main memory is too slow and so there's one, generally two, levels of memory cache, known by the helpful names of L1 and L2. L1 is hyper fast, very small and is close to the CPU. L2 is much faster than RAM but slower than L1, and is a few megabytes in size. (For example, on my desktop, L1 cache is 32KB for data and 32KB for code, and the L2 cache is 4MB. I have 8GB of RAM.) The issue here is each core has separate caches. When you read data, you may be reading it from the cache for that CPU, not the shared memory for the machine. One thread may update a variable (that is, some memory slot), and another thread may not even see that it has been changed since the old value is in the cache of the core that's executing that thread.

How the heck do multithreaded applications work then? it sounds like they should all completely fail with invalid unsynchronized data.

In particular, going back to the lock-free stack and its Pop() method,

    public bool Pop(out T item) {
      SingleLinkNode<T> node;
      do {
        node = head.Next;
        if (node == null) {
          item = default(T);
          return false;
        }
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, node, node.Next));
      item = node.Item;
      return true;
    }

head.Next could be in the L1/L2 cache for a particular thread, but in fact be altered by another thread. The test to see whether head.Next to be equal to node in the CAS method could just be true for that particular thread because both variables happen to be in the same cache. We could pop the same node off the stack twice! Man, this is even worse than ABA, we are so hosed.

Luckily for us, all is not lost. Without going into too much detail (see this MSDN article if you want more), the hardware contains primitives to ensure data coordination and synchronization. These primitives go under the name of memory fences (or barriers) and since .NET only supports full memory fences, that's all I'll talk about here. To put it in layman's terms, a memory fence resets the caches and ensures that all pending memory writes and reads have completed. Deferred reads can't cross the memory fence (the fence forces them to happen), and memory writes must happen before leaving the fence. That is, a memory fence will ensure that, in our original code example, the second thread will "see" the assignments of the data in the correct order, and that in the Pop() code that head.Next is read from RAM.

A memory fence is therefore a slow operation. By slow, I mean in comparison with the normal running of code and fetching data that a core CPU does. It's certainly not slow like reading from a disk. A few hundred cycles, perhaps; nothing you'd notice. It forces everything, at least for that moment, to be synchronized across all threads.

How do we impose a memory fence to ensure synchronization? By far the most traditional way is to use a lock. This is the granddaddy of all memory fences in a way, since a lock is all about the explicit synchronization of threads. I'm sure, just like me, you never even thought about a lock including data synchronization — you just assumed it was so and never thought about any other possibility. But my lock-free stack and queue don't use locks.

Another memory fence is provided by the interlocked primitives, including the .NET methods that call them. And, since it uses the Interlocked.CompareExchange method, the CAS method I wrote includes a memory fence. The global value of head.Next is forced to be updated by the completion of the CAS method, and the CAS method is forced to read its value anew from shared RAM. Phew. The lock-free stack code still works.

Album cover for Blade Runner Now playing:
Vangelis - Blade Runner Blues
(from Blade Runner)


Someone is wrong on the internet — the lock-free stack edition

Getting on for 4 years ago, I wrote a series of posts on lock-free containers in .NET, like the lock-free stack and the lock-free queue. They were fun to write, mainly because of the difficulty of the topic and trying to rationalize why early versions didn't work, and so on so forth. Along the way I learned about the .NET memory model, volatile, memory fences, and other arcana, but in the end I had something that worked and worked well. Over the years since then I've had many emails from readers about lock-free containers and I've been happy to have been linked from many different places, so much so that currently I'm number 1 for "lock free stack" and queue on Google.

Dunce's Chair And then yesterday, I suffered from "Someone is wrong on the internet" syndrome. Big time.

Someone, let's call him GC after the garbage collector in .NET (since as you'll see he demonstrated that he doesn't understand it) sent me an email. It seems GC had been reading up on the ABA problem in wikipedia as part of learning about lock-free containers and then found my lock-free stack and queue implementations. Without taking any time to understand my implementation or even consider the differences between the environment for the code in the wikipedia article and that for my code, someone was wrong on the internet, and that person was me. An email was duly written to me saying my implementation was broken. You see, the wikipedia article had a broken lock-free stack, therefore all lock-free stacks were broken.

Huh? Let's see: we have an anonymous author on wikipedia demonstrating some code that was broken and a named author (that would be me) showing off some similar code that purported to be working, and therefore the named author was automatically wrong? Wow.

Sure the person who's willing to put his name to some technical article could be wrong, but isn't it likely that, you know, he could be right (by naming himself he's staking his reputation on what he writes after all — that stake isn't so visible on wikipedia) and that you were missing some information? So, wouldn't you be, like, "I think I'm misunderstanding something. I read this on wikipedia and your code doesn't seem to do anything about this problem. Could you help me understand why?" And I'd be all, sure, let me explain why the wikipedia problem, although a valid problem, doesn't apply to the code I wrote, and we'd have this illuminating conversation. It's happened in the past, and it'll happen in the future, and I'm happy to do it.

But no. Someone was wrong on the internet and it was obviously me.

GC sends me this email that said things like "[Your implementations of the lock-free stack and queue] are busted.  You have not solved the ABA problem." and then rehashed the wikipedia article to prove I was wrong, only to terminate with a snide "You might consider a disclaimer on your web site . People trying to use your algorithms may believe they are getting atomic lock-free algorithms, only to discover their applications crash (in the field) in ways that are almost impossible to reproduce deterministically." (No, I'm not sure what the word atomic means in there, either.)

To put it mildly I was shocked that someone who obviously hadn't understood any of the issues with lock-free programming (and I don't pretend to know them all) or the ABA problem was willing to spend five minutes reading up on it and then ten minutes writing an email to a stranger saying that said stranger was wrong despite the fact he could have downloaded and run the code that showed said stranger was right. It blew me away.

Take a look at the wikipedia article on the ABA problem, especially the example. It's a not very well written lock-free stack (there's hardly any encapsulation) written in C++ and exemplifies a standard ABA problem. To recap, ABA is all about the invalid reuse of memory before all threads have finished with that memory. The thing about the lock-free stack code on wikipedia is that, in order to show off ABA easily, it exposes the nodes that go to make up the stack to the callers. Obviously the callers can then do silly things to said nodes, including deleting them (that is, deallocating them) and reusing them inappropriately. And by doing so, it's then very easy to show the ABA problem. If you like, the code was a pedagogical device, not an example of how to write a lock-free stack or that all lock-free stacks are flawed. Unfortunately for my friend GC he wasn't experienced or equipped enough to understand the difference.

Here's the full code for the lock-free stack that I wrote.

  public class LockFreeStack<T> {

    private SingleLinkNode<T> head;

    public LockFreeStack() {
      head = new SingleLinkNode<T>();
    }

    public void Push(T item) {
      SingleLinkNode<T> newNode = new SingleLinkNode<T>();
      newNode.Item = item;
      do {
        newNode.Next = head.Next;
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, newNode.Next, newNode));
    }

    public bool Pop(out T item) {
      SingleLinkNode<T> node;
      do {
        node = head.Next;
        if (node == null) {
          item = default(T);
          return false;
        }
      } while (!SyncMethods.CAS<SingleLinkNode<T>>(ref head.Next, node, node.Next));
      item = node.Item;
      return true;
    }

    public T Pop() {
      T result;
      Pop(out result);
      return result;
    }
  }

It's slightly different from the code in my original post since it implements two different Pop() methods, one that returns true/false plus the item, the other that just returns the item or default<T> if the pop failed. Also the code shown in my article simplifies the CAS call syntax. Nevertheless all the main features are there.

Notice that it is the stack that is in charge of allocating nodes (this is in Push()). There is no "deallocation" of nodes because .NET doesn't do deallocation; instead it's the garbage collector (the real "GC") that comes along every now and then and compacts the heap, removing all objects that are no longer referenced. (A point that my friend GC was unable to appreciate given that he was fixated on the correctness of the wikipedia article.) The main cause of ABA in the wikipedia code is simply not there since the nodes are completely encapsulated inside the stack code. Nevertheless, there could be an ABA problem solely inside the stack. Let's destroy that notion.

Imagine the following scenario. The stack looks like this: head -> A -> B -> C -> etc, and thread 1 is in the middle of a pop operation. Unfortunately, it's a really slow thread. It takes a copy of the value of head.Next; in this case a reference to the node A. Slowly, but surely, it enters the CAS. The code evaluates the address of the head.Next field (it's a ref parameter), the local variable node (which is a reference to the node A), and whatever node.Next evaluates to, which is a reference to the node B, we hope. Leisurely, it executes the underlying operation to the CAS method (which happens to be cmpxchg, another point that old GC was sure to point out to me to show off his knowledge).

Meanwhile, other threads are whipping through their execution, pushing and popping items (and therefore nodes) off the stack. The node A gets removed, long before thread 1's execution actually does the cmpxchg. Or, even. horror, before thread 1 evaluates A.Next.

We're obviously well into the doom scenario from the wikipedia article, surely? Nope, not even close. This was bubba GC's argument too: the memory for node A could get reused! It could be added back onto the stack! Thread 1's view of what its reference to A points to could be completely different! It could point to another node on the stack! If thread 1 continued you could lose many other items on the stack! YOU HAVE THE ABA PROBLEM, DOOFUS!

OK, perhaps he didn't quite go so wild, but that was the thrust of his argument (and indeed the wikipedia argument too). Unfortunately, in the C# code I presented (and present above) the basis of the argument is specious and fallacious. Why? Because node A never gets deallocated and therefore cannot be allocated again. The reason for that is that thread 1 still contains a reference to A and the object is live. The GC will therefore never collect it, even if I put code like if (CurrentThread == thread1) Sleep(ABloodyLongTime); in the middle of the Pop() method. The reference to A only goes away if we go round the loop and read head.Next again, or if we succeed in the CAS and the reference is discarded at the final brace for the method (that is, goes out of scope). It's only when all references to A in all threads have gone that the GC will collect its memory.

However, to be fair, there could be other ABA problems. Even though A won't get freed/reused while thread 1 has it, the data within it (Next and Item) may change with similar ABA results. Let's consider Next. Apart from head.Next, the only place in the stack code where a node's Next field gets set is in Push(), and that only ever happens on a brand new, just-newed-up node. Since I've shown that the memory for A cannot get freed while thread 1 has a reference to it, it can't be reallocated, and so this scenario can't happen either. (In fact, there's an even stronger conclusion: A.Next will always reference B.)  A similar argument can be made for a node's Item field. The final case is that the outside world gets hold of nodes and does something silly with them. As I pointed out above, that doesn't happen here either.

All in all, there is no ABA problem with the code as written above. The main reason for this avoidance of such a large issue is .NET's garbage collection. It ensures that we completely avoid the problems of reusing memory that one or more threads could still be using.

But, sadly, my friend GC refused to see this reasoning. He ignored it completely, because it didn't fit into his wikipedia worldview. In fact he helpfully wrote back a final time to say "Wow, what a fountain of crap.", and "Perhaps it’s difficult for you to accept that all of that work [that is, my articles and code] is so fundamentally flawed. But, I assure you, it is.", and "Your simple stack class is the poster child for the ABA problem." (Yes, he discovered bold.) And then, for good measure, he went on to mention his "co-workers at Microsoft", especially those in the "Many Core" team who seem to be uproariously laughing away at me whenever they take a break. (I'm guessing he thought I'd be intimidated by this or something.)

Sad to say, GC does indeed work at Microsoft. Happy to say, though, he's not an actual employee just a contractor in the Project Management Excellence (PMX) group, although he's obviously falling short on the excellence part.

Now playing:
Vegas - Wise guy
(from Vegas)

PCPlus 262: Best Data Use

I write a monthly column for PCPlus, a computer news-views-n-reviews magazine in the UK (actually there are 13 issues a year — there's an Xmas issue as well — so it's a bit more than monthly). The column is called Theory Workshop and appears in the back of every issue. When I signed up, my editor and the magazine were gracious enough to allow me to reprint the articles here after say a year or so. After all, the PDFs do appear on each issue's DVD after a couple of months.

PCPlus logo (Before you ask: no article of mine appeared in PCPlus 261. There was some kind of delay on my side — probably our annual vacation to England — and I didn't complete this article in time for that issue, despite the simplicity of the topic. So, it got pushed over to issue 262.)

OK, wacky article title, not mine, for what is essentially an article on writing an efficient stack and queue. This discussion came about from writing a lock-free stack and lock-free queue: these particular structures are the best way of implementing a stack and queue if you want to make them lock-free.

But, really, when all's said and done, very much a beginner's article aimed at the hobby programmer.

This article first appeared in issue 262, December 2007.

You can download the PDF here.

Album cover for Greatest Hits Now playing:
Mott the Hoople - All the Young Dudes
(from Greatest Hits)

Search

About Me

I'm Julian M Bucknall, the M because it's my middle initial and because I and the other Julian Bucknall (the movie guy) would like to differentiate ourselves.

I'm a programmer by trade, an actor by ambition, and an algorithms guy by osmosis. I write articles for PCPlus in my spare time, not that there's much of that.

Julian M Bucknall Apart from that, an ex-pat Brit, atheist, microbrew enthusiast, Pet Shop Boys fanboy, slide rule and HP calculator collector, amateur photographer, Altoids muncher.

DevExpress

I'm Chief Technology Officer at Developer Express, a software company that writes some great controls and tools for .NET and Delphi. I'm responsible for the technology oversight and vision of the company.

Validation

Validate markup as HTML5 (beta)     Validate CSS

Bottom swirl

Archives

February 2012 (4)
SMTWTFS
« Jan  
1234
567891011
12131415161718
19202122232425
26272829

Like this Archive Calendar widget? Download it here.

Social networking

Google ads

The OUT Campaign

The OUT Campaign

My Tweets

Bottom swirl