Parse Solidity incrementally using tree-sitter
4 min read

Parse Solidity incrementally using tree-sitter

Parse Solidity incrementally using tree-sitter

I recently wrote and released an incremental parser for Solidity using tree-sitter ( built by Max Brunsfield at GitHub). Let me tell you about the experience of working with tree-sitter to create an incremental parser.

💡 At GitHub they use tree-sitter as one of the core components for code highlighting and code navigation.

I recently learned about tree-sitter and became super excited about it!
What is tree-sitter? It’s an incremental parser generator /toolchain.

And it’s:

  • well maintained
  • super fast (and incremental 🎉)
  • easy to test
  • used by cool projects like semantic and semgrep for analysis magic

As luck would have it, I was on the lookout for a Solidity parser to augment Vertigo. So this was the perfect opportunity to take tree-sitter for a stroll.


🧑‍🎓 Learning tree-sitter

Before writing a grammar, I first needed to become familiar with tree-sitter; how it works, and how to write a grammar that tree-sitter can understand.

Luckily, tree-sitter has excellent documentation!
Tree-sitter|Creating Parsers

In short, a new tree-sitter project will have a grammar.js file that contains a tree-sitter grammar.  Running the tree-sitter command tree-sitter generate will generate all the files implementing a parser for the language described by the grammar.

Instead of writing the grammar in a custom DSL (like you would do for ANTLR), we write our grammar using JavaScript.

module.exports = grammar({
  name: 'Solidity',

  rules: {
    source_file: $ => choice('contract', 'library')
  }
});

Using javascript is fantastic because we can write custom functions to clean up & make our grammar more concise.

arguments: $ => commaSeparated($.argument),

ugly: $ => sequence($.argument, repeat(sequence(',', $.argument)))
An example of how we can use JavaScript for clean grammar rules.

🌱 First Steps

In addition to reading about tree-sitter, the following resources are invaluable when writing a new tree-sitter grammar:

Language Documentation

Arguably, the best source of information on a language is its documentation. For Solidity, I was able to find a very detailed language specification:  Language Grammar — Solidity documentation.

Existing Solidity Parsers

I’m not the first person to write a grammar for Solidity. ConsenSys Diligence maintains GitHub - ConsenSys/solidity-parser-antlr: A Solidity parser for JS built on top of a robust ANTLR4 grammar, an ANTLR grammar for Solidity.

Looking at an existing grammar is helpful; we can use it to guide us with structuring our grammar.

Note: ANTLR is a LL(*) parser, which means it has different grammar requirements. In particular, LL( *)  grammars/parsers deal differently with ambiguities.

Existing tree-sitter grammars

Looking at examples is useful for many tools and languages, including tree-sitter. So I took a look at the official javascript tree-sitter grammar as a reference implementation.

This was super useful for two reasons:

  1. Solidity takes many syntactical ideas from JavaScript. This allows us to take bits and pieces from the JavaScript grammar and use it in our own!
  2. A reference implementation shows how the more difficult concepts like associativity, precedence and conflicts are used. We can learn from this and speed up development!

✍️ Starting to write

With the documentation open and tree-sitter installed, I was finally able to get started. I set up a GitHub repository and initialised a tree-sitter project.

I then followed the structure of the language when writing the different rules of the grammar. Starting with high-level constructs, and ending up at literals:

  1. A solidity File
  2. Top level Source Elements: Pragma directives, Import directives, Contracts, Libraries, Helper Functions, Structs and Enums
  3. Definitions: State Variable Declarations, constructor definitions, modifier definitions, function definitions
  4. Statements: If statements, for loops, …
  5. Expressions: Binary operations, …
  6. Literals: Strings, integers and booleans

At every step, I’d consult existing grammar definitions, and write down grammar rules to describe the language feature. I did sometimes skip over a complex language feature to fix it up later on.

If you also want to write a tree-sitter parser then be sure to read: Tree-sitter|Creating Parsers

🧪 Testing

If you’re in software development, then you’re probably familiar with writing unit tests. Grammar writing in tree-sitter is not different.
The tree-sitter-solidity project has a directory tests/corpus filled with text files with test cases. Each test case has a snippet of Solidity and an S-expression that describes the expected parse result.

Running the command  tree-sitter test  will use the grammar to parse all the snippets and check whether the actual results match the expected.

====================
Contract Test Case
====================
contract Example {}
---

(source_file
	(contract_declaration
		name: (identifier)
		body: (contract_body)))
Example test case

I love ❤️ this feature.  I can easily check whether tree-sitter successfully parses all my test samples and I also get to check whether they are parsed correctly.


🐛 Debugging

Another insanely helpful feature is the debug option for tree-sitter parse.
When you use the debug option, tree-sitter will generate a webpage in addition to parsing. This webpage visualises the steps taken during parsing, which is super useful if you’re still learning about (G)LR parsing.

Seeing how the parser progresses allows you to understand why something isn’t parsed correctly. You can also experiment and see the effects of /precedence/, /associativity/ and  /conflicts/.

I can imagine that even tree-sitter veterans spin up this debug view once in a while.


🚀 Releasing

There are tree-sitter bindings for many languages, but releasing a javascript package is by far the easiest. Which is due to tree-sitter generating almost everything you need (even node-gyp magic for compiling the parser).

Once finished with the grammar, releasing a package was as easy as:

# Make sure to have an npm account where you can publish the package first
npm login

# Make sure to generate the tree-sitter parser 
tree-sitter generate

# Now publish the package
npm publish

Thanks for reading through the end!

Are you interested in reading more about parsers? Solidity parsing or something completely different? Let me know on Twitter @JoranHonig!

Enjoying these posts? Subscribe for more