You may have heard that a compiler uses a data structure called an abstract-syntax-tree, or AST, when it parses your program, but it’s less common knowledge that an AST is normally used just at the very start of compilation, and a powerful compiler generally uses a more advanced type of data structure as its intermediate representation to optimize your program and translate it to machine code in the later phases of compilation. In the case of TruffleRuby, the just-in-time compiler for Ruby that we’re working on at Shopify, this data structure is something called a sea-of-nodes graph.
I want to show you a little of what this sea-of-nodes graph data structure looks like and how it works, and I think it’s worth doing this for this for a couple of reasons. First of all, I just think it’s a really interesting and beautiful data structure, but also a practical one. There’s a lot of pressure to learn about data structures and algorithms in order to pass interviews in our industry, and it’s nice to show something really practical that we’re applying here at Shopify. Also, the graphs in sea-of-nodes are just really visually appealing to me and I wanted to share them.
Secondly, knowing just a little about this data structure can give you some pretty deep insights into what your program really means and how the compiler understands it, so I think it can increase your understanding of how your programs run.
I should tell you at this point that I’m afraid I’m actually going to be using Java to show my examples, in order to keep them simpler and easier to understand. This is because compared to Ruby, Java has much simpler semantics—simpler rules for how the language works—and so much simpler graphs. For example, if you index an array in Java that’s pretty much all there is to it, simple indexing. If you index an array in Ruby you could have a positive index, a negative index, a range, conversion, coercion, and lots more—it’s just more complicated. But don’t worry, it’s all super-basic Java code, so you can just pretend it’s pseudo code if you’re coming from Ruby or any other language.
Lets dive straight in by showing some code and the corresponding sea-of-nodes graph.
Here’s a Java method. As I said, it’s not using any advanced Java features so you can just pretend it’s pseudo code if you want to. It returns a number from a mathematical sequence known as the Fibonacci sequence, using a simple recursive approach.
Here’s the traditional AST data structure for this program. It’s a tree, and it directly maps to the textual source code, adding and removing nothing. To run it you’d follow a path in the tree, starting at the top and moving through it depth-first.
Abstract syntax tree for the Fibonacci sequence program
And here’s the sea-of-nodes graph for the same program. This is a real dump of the data structure as used in practice to compile the Java method for the Fibonacci sequence we showed earlier.
Sea-of-nodes graph for the Fibonacci sequence program
There’s quite a lot going on here, but I’ll break it down.
We’ve got boxes and arrows, so it’s like a flowchart. The boxes are operations, and the arrows are connections between operations. An operation can only run after all the operations with arrows into it have run.
A really important concept is that there are two main kinds of arrows that are very different. Thick red arrows show how the control flows in your program. Thin green arrows show how the data flows in your program. The dashed black arrows are meta-information. Some people draw the green arrows pointing upwards, but in our team we think it’s simpler to show data flowing downwards.
There are also two major kinds of boxes for operations. Square red boxes do something; they have a side-effect or are imperative in some way. Diamond green boxes compute something (they’re pure, or side-effect free), and green for safe to execute whenever.
P(0) means parameter 0, or the first parameter. C(2) means a constant value of 2. Most of the other nodes should be understandable from their labels. Each node has a number for easy reference.
To run the program in your head, start at the Start node at the top and move down thick red arrows towards one of the Return nodes at the bottom. If your square red box has an arrow into it from an oval or diamond green box, then you run that green box, and any other green boxes pointing into that green box, first.
Here’s one major thing that I think is really great about this data structure. The red parts are an imperative program, and the green parts are mini functional programs. We’ve separated the two out of the single Java program. They’re joined where it matters, and not where it doesn’t. This will get useful later on.