Theme NexT works best with JavaScript enabled

[ 2014] Cores unleashed - Exploiting Parallelism in Ruby with STM



  1. Good morning. Thank you for being here with me. I know it’s tough to get up this morning especially after the party last night, since I personally struggle to do so. But don’t worry, this talk is not as complex as it may seem. I try to make it as simple as possible so I wouldn’t screw it up even with serious hangover symptoms like what I have right now.

  2. Okay, let me briefly introduce myself. I’m 許斯凱. You can find me on Twitter and Github by my id @brucehsu.
    I’m a student from National Chung Cheng University, major in Computer Science and currently on the second year of my Master’s Degree program.

  3. This is a post I found few days ago, which pretty much marks the time I began to write Rails. That was mid-2007, I was a 11th grade student. I’ve been using and loving Ruby ever since.

  4. Let us begin with a short story. There was a time

  5. When we, as programmers, were happy. Well, at least relatively happier. Because even if we write some terribly slow programs

  6. Moore’s Law tells us that our programs will run twice faster in 18 months.

  7. Life was good back then.

  8. But the tide has changed when the chip clock rate hit the brick wall of physics laws. Manufacturers no longer can elevate clock rate without burning the entire computer down, therefore they begin to build these shiny new multi-core processors then claim the overall performance correlates to the number of cpu cores.
    This kind of throw the problem back to us programmers. It’s our fault if we cannot leverage the advantage of these processors. So our good days are over.

  9. We need to adopt the parallel paradigm instead of old sequential design. Unfortunately, this can be difficult and problematic. Let me give you an example. Say our program manipulates a large list with four threads. Everything is alright in the ideal scenario: each thread works just like they’re independent from each other.

  10. But what if, some of them try to modify the very same element? The result is unpredictable and likely to be wrong. These data synchronization issues are what make writing parallel programs hard. To deal with them, various mechanisms are developed. Among those, Locking may be the most common one.

  11. Just like the name infers, the idea is to restrict the data from being changed by other possible access. A straight-forward solution is to lock the entire list whenever a thread is using it. This guarantees the list can never be modified by others during the period of thread manipulation. In other words, only one thread can use the list at a time. Doesn’t that sound familiar? Yes, the biggest downside of this approach is that it literally becomes sequential execution again.

  12. So how do we solve this dilemma? The key lies in the granularity of the lock. What we can do here is to divide a single coarse-grained lock into several fine-grained locks. In this case, we can place a lock inside every element instead of only one for the entire list. Each thread can now freely modify any element unless they bumped into each other when trying to use the same element. Only then will the execution order be in a first-come-first-serve fashion.

  13. By now you may be wondering about applying parallelism on your existing ruby scripts. If you already did so, you might find a surprising fact. No matter how many threads you use, how well designed is your script, the official Ruby implementation, MRI, would only use up one core of the processor. We actually mentioned the reason of this just few slides ago.

  14. MRI uses a global interpreter lock to simplify synchronization issues. Let me quote what’s on wikipedia here. The idea behind this is just like the first example introduced earlier. The downside is pretty much the same, only one thread is allowed to run at any given time.

  15. If you really want to make use of the rest of cores, instead of process forking in MRI, most alternative Ruby implementations now use fine-grained locks. JRuby, Rubinius, MacRuby, you name it. I don’t know if Topaz which built on PyPy does the same though. But there’s a subtle problem of fine-grained locks I didn’t mention earlier. Of course, we don’t want to have a lock that is too coarse-grained, yet locks that are too fine-grained are quite likely to introduce performance overheads. Let’s take Rubinius as an example, though I have to confess that I know little about it.

  16. This is a benchmark on the internet. You can follow the link for further details. We can see from this chart that Rubinius 2.0dev was significantly slower than Rubinius 1.2 which uses GIL. In some certain benchmark scripts, like ones testing thread synchronizations, 2.0 was almost 70 times slower. This is a perfect example to show just how difficult it is to have an adequate lock design.

  17. Besides that, using locks doesn’t guarantee it’s free from synchronization hazards. Let’s look at the classic example of transferring between bank accounts. As we all know, transfer consists of two operations: first withdraw given amount of money from the original account, then deposit to destination account. To avoid the account from being modified when we’re performing these two operations, we locks the origin and the destination respectively. While it sounds rights, we cannot be sure what will we get when we access these accounts in the lockless window between actual withdraw and deposit. The next thing that comes to our mind is to combine these two into a single transfer operation and lock both accounts during transfer.

  18. Now it seems like our transferring program finally works correctly. However, the program will encounter a synchronization problem in a rare situation that is when both accounts try to transfer to each other. On doing so, both accounts will first be locked as origins, then the program will continuously fail to acquire permission to modify them as destination accounts. We may assign a global lock order to every account and locks them accordingly. But when we try to add a new feature, we need to rethink the flow all over again to avoid potential pitfalls.

  19. Simon Peyton Jones wrote these 6 reasons on why locks are bad in his article Beautiful Concurrency at 2007. So the question becomes: do we have other choice besides locking?

  20. Software Transactional Memory or STM may be one interesting alternative to look into. It borrows the concept of transactions from Database and applies it on memory manipulations. A modification on the variable is treated as a transaction.

  21. What is a transaction? The original definition consists of 4 attributes. Atomicity, Consistency, Isolation and Durability, or ACID for short. Atomicity means a transaction is either successfully committed or abort on failure. Consistency makes sure that data follow our integrity constraints. Isolation indicates each transaction is independent. Durability requires the effect to persist after commit.

  22. So STM takes these attributes and transforms them to fit its need. Simply put, in most STM implementations, we initialize a transaction, execute it, validate it, then we commit. The result follows the atomicity attribute, is either success or fail. The underlying implementation mostly uses locks, yes. But the way we execute a transaction makes it much easier and safer for programmers to utilize.

  23. In short, STM provides a sound abstraction to programmers. So

  24. How do we implementation STM in Ruby, you might ask. Let me ask you back:

  25. Why not in Go?

  26. And why not a new virtual machine?

  27. Well, many of my friends tell me that this is a terrible name. If you have any better idea, please get in touch with me or maybe file a issue on github afterwards.

  28. Well, this is embarassing. I have to apologize for this.

  29. Since my cat has been pretty lazy lately, I couldn’t meet the deadline. So he’s the one to blame. That’s why I’m not able to show you what it looks like. But I’m going to share some details about it and hopefully see how you think about it.

  30. I guess many of you might ask, why Go? There are various reasons I chose Go over other programming languages, like C, Rust or anything else. First of all, although Go is a compiled, static typed language, it does learn something from dynamic languages which makes it easier to pick up. Meanwhile it still has the performance advantage. With support of Google, the community is expanding fast. I’ll mention Goroutines and channels in a minute. But I guess the most important reason is that writing Go makes me feel like a hipster instead of a geek.

  31. Okay, let’s be serious. As I’ve mentioned, Goroutines and channels are probably the most important features why I chose Go. You may see channels as message queues. One can pass a variable to a channel for any goroutine to pick up and process. In my opinion it seems suitable for transaction dispatching.

  32. Back to RunVM. It is designed as a subset of tinyrb, which is a subset of Ruby. The architecture of VM itself is a stack-based machine. The bytecode mainly resembles YARV instruction set. Therefore the whole internal design is also similar to MRI though relatively simpler and meanwhile less robust. Thanks to Ruby Under a Microscope, the process is less painful.

  33. Last but not the least, let us quickly scan through the basic algorithm of STM here. Transactional Locking II is pretty much the basis of most STM implementations and algorithms. RunVM also uses TL2 as the fundamental algorithm to begin with.

  34. The VM will maintain a global timestamp while each transaction has its own local timestamp serving as revision. If a variable is modified, its revision will be set to the global timestamp. As you can see from here, transactions also maintain a read set and a write set. They will record how each variable is used and classify it as to read or to write at commit stage. Reading is no big deal, as long as the local revision is earlier than local timestamp, we can be sure the variable hasn’t been modified. If the variable has been modified, the transaction will abort and restart. For those in the write set, transaction will acquire a global lock and write the modified value back to memory at commit stage. Revision of each written variable gets updated in the meanwhile.
    Currently the way RunVM handles such operations is pretty naive. It only creates transaction on encountering Thread objects. For normal sequential segments, it stays in GIL mode. Variables are copied into local transaction scope. Since Ruby variables are basically objects, the memory overhead is high in this manner. Above all, the system still doesn’t work correctly so far.

  35. That’s the reason why I still keep this in a private repo. So anyone who likes to lend a hand or just to discuss about it, please free to grab me afterwards or find me on twitter.