Google Summer of Code 2018: Ruby

My Google Summer of Code project for Ruby is coming to an end. In a nutshell, the project was to add type annotations to Ruby. When I first approached this problem, my thoughts were “this is just a syntatical change, this shouldn’t be hard”. Haha, yeah, that wasn’t quite true. While I did end up adding type annotations, I didn’t manage to complete all of my goals, and the ones that I did complete, I did by the skin of my teeth. Why couldn’t I complete my goals? What happened? Was I simply too lazy? Well yes, but wait, there’s more!

When I first started the project, I had never seen the Ruby codebase before. Frankly, I didn’t really think I was going to get accepted to GSoC, as the precise feature I proposed had already been rejected by Matz. But to my surprise, I was accepted into the program, for which I am endlessly grateful to my mentor Tony Arcieri.

So when GSoC started, I forked the repo, cloned it, and immediately attempted to understand the parser. The Ruby parser is a LALR parser generated by Bison contained in a single file parse.y that numbers over 11k lines. For context, the only file larger than parse.y that are actual source code files (not counting JIT headers, configure files or libraries) is io.c at 13,270 lines.

Anyways, I began reading parse.y and attempting to understand it. The first mistake I made was attempting to understand the entire file. With a codebase of this size, basically any function call or macro will lead you down a rabbit hole. I remember trying to trace the YYMALLOC calls, which lead to having to understand various structs with other nested structs, and having to understand how Ruby implements its own heap, and other various minutae. I attempted to do this for a bit. What didn’t help was that since parse.y is not strictly speaking a C file, all of my navigation tools like ggtags didn’t work.

In the meanwhile, I also taught myself the basics of Bison. I already knew some stuff about parsing from Crafting Interpreters and had written some simple recursive descent parsers before. Therefore, Bison wasn’t too much of a reach. I will say that Bison’s overall documentation isn’t fantastic. Often times when I learn a new tool, I read a couple different tutorials. That way a topic that I may have found confusing in one tutorial will be reinforced by a different tutorial. Plus, if I get bored with one tutorial I can always switch to a different one. For Bison, first there’s a real dearth of tutorials. Sure, there’s a lot of slides and theoretical documents on Bison, but there’s not a lot of hands on, real world information.

Thus, I ended up using the Bison manual a lot, which was pretty informative. However, even the most complicated example in the Bison manual was essentially a fancy calculator. Compared to the complex beast that is the Ruby grammar, that’s not exactly equivalent.

Sidenote, I’ve noticed a severe lack of hands on, cookbook style books or tutorials on programming language topics (other than Crafting Interpreters). A book that would cover topics like “here’s how to make a package manager” or “how to write a simple typechecker” would be extremely valuable. I get that books like the Dragon book are great and all, but I really don’t want to read 100 pages on parsing theory just to get to the point where I can write a simple parser.

Anyways, armed with some basic Bison knowledge, I decided to dive into parse.y again. My mentor, Tony, gave me some nice resources, including a parser written by whitequark. In addition, whitequark helped translate a chapter from the Ruby Hacking Guide, a book in Japanese about the Ruby source code. This chapter covered the Ruby lexer, which, not gonna lie, is pretty insane. Essentially, the lexer requires a lot of context specific information, such as whether the lexer is lexing a function name, whether it’s at the beginning of an expression, etc. The way that this is done is through a finite state machine. The parser sets the lexer state at various grammar rules. Depending on the lexer state, you’ll get different tokens or even errors, so this is pretty important. I’ll return to this for…reasons.

With all of this info, I started to take some notes on the parser. While I didn’t end up getting too far with the notes, this was pretty useful to organize my thoughts. Eventually, I figured out enough about the grammar to add a new option to f_norm_arg, essentially the rule for a normal argument (Ruby has a few different kinds of arguments, such as keyword args and default args). Tony and I had decided in a colon to delimit types. Interestingly enough, Ruby doesn’t have a tCOLON token. Instead, the Bison file just uses a literal ‘:‘. While that’s fine for the ternary operator (the only case where it’s used), we wanted an explicit token for the type rule. I added a tCOLON token, which was pretty easy, since the lexer code outside of the states is pretty simple.

I also noticed some interesting decisions with the grammar rules. For instance, Ruby does not have a function declaration grammar rule. Instead, it is grouped along with a bunch of other rules in a gigantic primary rule. This means that changing function grammar rules can be kind of tricky.

However, it was after implementing argument type signatures where I got stuck. You see, adding the annotations to the parser was only the first part of the project. The second, and arguably more important part was to add output to Ripper, Ruby’s parser library. After all, these annotations are meaningless unless a third party can utilize them. Ripper provides this third party access. It allows Ruby developers to parse Ruby code into an s-expression form. And from the outside, adding the type annotations to Ruby should be rather simple, no?

Well, no. I already discussed some of the issues with Ripper in this post. Basically, Ripper uses an embedded domain specific language within parse.y which is then converted to ripper.y with a quite ridiculous Ruby script, which is then run through Bison and converted into a .so file which is then imported by Ripper. This took a while to figure out, as there wasn’t much existing documentation on Ripper.

There were some other interesting aspects of Ripper, which I might get around to writing about, such as the dispatch macros that would call Ruby functions manually into the VM. Or the metaprogramming aliases that link to these functions. It’s a bit crazy.

Anyways, while this was happening, I decided to take a short break and implement return type signatures. This also ended up kind of tricky. I’d implement the return type and get a lot of weird syntax errors (kind of funny that syntax errors imply an issue in the interpreter, not an issue in the code). For instance, I’d get the following error:

ruby/lib/fileutils.rb:1654: syntax error, unexpected :: at EXPR_BEG, expecting keyword_end (SyntaxError) {|...

The code for context:

	  def _do_nothing(*)end {|name| alias_method name, :_do_nothing}

Which…by the way is not a great error message. If I received that without knowing that EXPR_BEG is a lexer state, I’d find that pretty confusing. Anyways, I struggled with that one for a while. I tried moving around the various existing actions, changing the lexer state, etc. I even added a unique token (‘=>>’ or tDASSOC) and used that as the return type symbol:

'(' f_args rparen tDASSOC type_sig

Given that this should be completely unambiguous, it was kind of frustrating that it didn’t work. Furthermore, the error message didn’t even make much sense. Using the debugger mode for the Ruby parser, I figured out that when the code was parsed correctly, the lexer was in the EXPR_BEG state. So…yeah.

After banging my head against this wall for a bit, I decided to look around for other people’s implementations of type annotations. Eventually I stumbled upon this branch on GitHub’s fork of Ruby. To my surprise, when I implemented the rules in this format, it worked! I had to do some minor tweaking, as how Ruby sets lexer states changed since 2.1. I still don’t exactly know why this works. My best guess is that it has to do with the p->command_start = TRUE line I put in, since I’m fairly sure that’s the main difference.

I will admit, it’s pretty bittersweet that I didn’t get to figure out this on my own, but simply relied on someone else’s code. On the other hand, I’ll take the working code.

After that, I continued my work on Ripper. I tried a few different approaches, such as adding an annotation to the DSL for type signatures, using an existing one for constant path references, such as Foo::Bar, etc. None of them stuck.

Then I examined how Ripper actually does function arguments. Interestingly enough, function arguments are done by calling a function called new_args, which, depending on whether or not you’re parsing for Ripper or for the interpreter, goes to different functions (this is an unfortunate pattern in parse.y that is achieved via the C preprocessor). This function is essentially an alias for a dispatch with params as the function name. Which via some sort of magic gets turned into an sexpr with params as the name. Pretty much straight forward as far as Ripper is concerned.

However, one thing I noticed, was that the arguments were of type VALUE, which is pretty much the least useful type in existence. It’s a u_intptr_t which doesn’t print out nicely and tells you zero about what’s actually going on. Interestingly enough, all the grammar rules output VALUE's in Ripper, which makes even less sense. After all, how do you connect the values into a tree, like a normal parser?

I then tried some other modifications, such as changing the function calls, printing out various info, etc. None worked. None even changed the output of Ripper. What then dawned upon me is a simple yet slightly horrifying realization. What if I’m testing the wrong Ripper? Now, the reason this hadn’t occurred to me before was fairly simple. When you give Ripper incorrect syntax, it returns nil instead of the s-expression syntax (which is terrible for error handling by the way). Since Ripper wasn’t returning nil for the type annotations, but it was returning nil when I tried using the type annotations on the trunk branch, I figured I was using the correct version of Ripper.

BUT, there’s a caveat here. Since Ripper relies on a .so file, which it imports via Ruby’s require, I realized that it’s quite possible that it’s importing an old .so file. Now, this wouldn’t explain why none of my changes are showing up while the type annotations parse correctly, but I supposed it was worth a shot.

At that point, I decided to try to make a trivial change to Ripper on the main trunk branch and see if it is visible. I choose a fairly simple one: since Ruby function arguments don’t require parenthesis, I just removed the paren! annotation in that rule. Simple. I compile, build a new version of Ruby, and boom, no change. Therefore, there should be something wrong with the version of Ripper that I’m using, no?

I then examine the require path. I add a basic print statement to print out the location of the file once it’s been required. And…nope, it’s the right file. I even switched it to a load with the exact unambiguous path and it still didn’t work. So yeah, that’s where I ended. I have no clue why my changes aren’t appearing.

Frankly it’s been a long and frustrating process. I do want to finish this project, but right now I kind of want a break. I’ve had to do a lot of detective work and careful reading and rereading.

If I had to give myself some feedback, I’d say the following:

  • Take notes about everything. Keep a journal that you can update about all the different methods you’ve tried.

  • Do more preliminary research before trying to write stuff on your own. Email people, talk to others and get an idea of the context that your project is in.

  • It’s okay if you don’t understand everything. Isolate the subset that you want or need to understand and focus on that.

  • Try to do something else instead of just being stuck or procrastinating. I definitely had some times where I should have just tossed some part back and worked on a different aspect of my code. But instead I tried to just keep on grinding and would end up frustrated.

  • Always do the simplest possible modification first. Don’t try anything fancy.

On the other hand, if I were to give some suggestions about Ruby (as a highly inexperienced insolent developer), here are a few ordered from possible to less possible:

  • parse.y seriously needs to be split up and cleaned up. Something as simple as some comments and a little reorganization would do wonders.

  • Ripper is very very confusing and needs to be either refactored or rewritten entirely. Adding a DSL via comments and overriding functions using the C preprocessor just to add an API into the parser is not exactly great. It’s confusing, hard to modify and I’m not even sure how many people are actively using it. Plus there’s little to no documentation. As far as I can tell, Ripper#parse doesn’t even do anything!

  • The Ruby parser should be decoupled from anything else. I’ve noticed a lot of intermixed code in parse.y. There’s various connections to compilation, some testing, the Ripper manual calls into the Ruby virtual machine, etc. In my view, the parser shouldn’t be concerned with all of this. The parser should parse to a simple, platform independent, usage independent format. After that, what the interpreter does with the format is out of the parser’s purview. It can compile it to bytecode, export it via a library, heck it could turn it into a visualization. I understand that this was not done by design and that there are probably a lot of excellent reasons why the parser was written this way. But at this point, it’s pretty tricky to write extensions for Ruby. If the parser was independent of the bytecode compiler, I could conceivably see extensions to Ruby that just act like a pipeline for the parse tree:

  Parse Tree -> Typechecker -> Parse Tree

Or even:

  Parse Tree -> Static Analyzer -> Parse Tree
  • The Ruby parser kinda needs to be rewritten. Sure, when Ruby was first written, Bison was probably the best option for parsing. But there’s a lot of baggage that comes with Bison. It’s hard to debug, it’s nearly impossible to introspect (tooling is straight up non-existent), and it imposes a lot of odd constraints (e.g. single file). Sure, this is pretty presumptuous of me to say, as I’m not even half as skilled as the rest of the Ruby team, but comparing the code for the new MJIT to parse.y is kind of depressing. The MJIT code has comments, clear names, goto definition works, etc. The parser relies on stateful lexing, comment DSLs and various other hacks to work around Bison’s limitations. The old school YY- names aren’t very clear either. A new parser written by hand would allow for significantly better error handling, better ways of resolving stateful lexing (communication between the lexer and the parser)) and would be a lot easier to read. Heck, I’d do it if I weren’t A. very inexperienced, B. currently in school and C. interested in other projects.

I’d like to give my thanks to Tony, my mentor for being extremely helpful in his advice. I’d also like to thank Yusuke Endoh (mame) for taking the time to answer my questions. Furthermore, I’d like to thank Matz and the Ruby team as a whole for making such a great language. It’s been an invaluable experience learning about one of my favorite languages. Finally, I’d like to thank Google for funding these projects. It’s pretty awesome being able to get paid for open source contributions. I hope that I can continue to contribute to both Ruby and to other important open source projects.

Thanks, Nicholas