16 Feb 2024 • 19 min read
This article is the second of a two part series. The previous article covers topics like Lezer, ANTLR, a toy language called Zephyr, and how to make an ANTLR-based language server. If you're unfamiliar with those topics, you may want to start with the first article.
In my previous article "Building a Language Server", I described how to create a language server for Zephyr using the ANTLR framework. We created a functioning language server that can take in a document of Zephyr code and return a stream of tokens. Next, we’ll connect that language server to CodeMirror 6 by taking that token stream and turning it into a Lezer parse tree.
At a high level, we need to take the following steps to make that happen:
We covered the first step in the previous article–we’ll cover the rest in this one.
CodeMirror relies on Lezer, its parsing engine, to manage a parse tree representation of the document. A Lezer parse tree is made up of a collection of nodes, and Lezer represents those nodes with the NodeType
class. It isn't a node directly, but a typed description of a node. A node type includes an id at minimum, and often also includes a name.
Node types are statically created and then reused throughout the tree when the tree is built. Throughout the tree, we use the node type's id to reference the type of a particular node. We provide Lezer information about what node types exist by passing it a NodeSet
, a specific kind of set that includes node types. Lezer uses the node set as a lookup table when it needs to lookup a node type by id.
Before we can construct a tree, we need to create a mapping between our language server's token types and Lezer node types. I have found it’s easiest to simply use the same names between the token types and the node types. Let’s go ahead and set up that mapping. We have defined these token types previously:
const supportedTokens = [
'const',
'let',
'semicolon',
'assign',
'blockComment',
'lineComment',
'number',
'string',
'identifier',
'unknown',
] as const;
export type Token = (typeof supportedTokens)[number];
Let’s now define NodeTypes that map to those tokens:
import {NodeType} from '@lezer/common';
export const tokenToNodeType: {[key in Token | 'document']: NodeType} = {
document: NodeType.define({id: 0, name: 'document', top: true}),
const: NodeType.define({id: 1, name: 'const'}),
let: NodeType.define({id: 2, name: 'let'}),
semicolon: NodeType.define({id: 3, name: 'semicolon'}),
assign: NodeType.define({id: 4, name: 'assign'}),
number: NodeType.define({id: 5, name: 'number'}),
string: NodeType.define({id: 6, name: 'string'}),
identifier: NodeType.define({id: 7, name: 'identifier'}),
unknown: NodeType.define({id: 8, name: 'unknown'}),
blockComment: NodeType.define({id: 9, name: 'blockComment'}),
lineComment: NodeType.define({id: 10, name: 'lineComment'}),
};
This mapping defines a NodeType for each of the token types, with each one possessing a unique id. These ids can be any number, they don't necessarily need to start at 0.
You may notice that there is one extra type we have added called document
with top: true
passed to the call to NodeType
. Lezer requires us to provide a tree with a single node as the entrypoint–all other nodes should connect to this node. The top node performs this function (it doesn't hold any semantic meaning for the code). I've decided to name this top node document
, since it describes the entrypoint for a Zephyr document, but it can be renamed to suit the specific language it represents. Since we have a mapping between our tokens and node types, we can also build a node set:
import {NodeSet} from '@lezer/common';
export const parserAdapterNodeSet = new NodeSet(Object.values(tokenToNodeType));
I’ve found it useful to manage the node types this way because it becomes straightforward to add a new node type–simply add an entry to tokenToNodeType
. Now that we have node types in a node set, we can look at building the connecting piece between the language server and Lezer’s Parser.
Lezer has a abstract Parser class that we can utilize to connect our language server to Lezer. The class defines functions that Lezer will call to generate a parse tree. We are required to implement createParse
, and we will also implement the startParse
function. We can treat these functions as "hooks" where we provide our own implementation that reads the document and generates a tree based off of our own language server. Let’s first start by declaring an adapter class and subclassing Lezer's Parser
class:
import {Parser} from '@lezer/common';
import {LanguageServer} from './language';
import {tokenToNodeType} from './constants';
export class ParserAdapter extends Parser {
private languageServer = new LanguageServer();
private getNodeTypeIdForTokenIndex(index: number) {
const tokenType = this.languageServer.getTokenTypeForIndex(index);
return tokenToNodeType[tokenType].id;
}
}
This code instantiates a new language server and creates a private method that makes it easy to derive the node type from a token’s index. We’ll use this later when constructing the tree.
Lezer’s Tree
class has a static build method that we can use to build our own parse tree. We can provide it with either a BufferCursor
or just an array of integers that correspond to the shape of the tree. When we provide it with a buffer of integers, Lezer reads the buffer in groups of 4 integers, where each group of four integers corresponds to a single node. Here’s what each position in the group means:
Tree.build()
.As a practical example, let’s say that we have two token Zephyr document:
const hello
If we were to construct a tree for this query, we would need:
const
hello
The buffer for this query might look like:
// each node follows the grouping
// id, start, end, size
const buffer = [
// const
2, 0, 5, 4,
// hello
3, 6, 11, 4,
// top node
1, 0, 11, 12,
];
The start and end offsets for the first two tokens correspond to the position of that node relative to the start of the document, and each has the typical size of 4. However, the top node has a start offset of 0 and an end offset of 11. As the parent of both of the previous nodes, it needs to encompass both of those nodes’ offsets. Additionally, the size of the top node is 12–4 for the first node, 4 for the second node, and 4 for itself (making 12 in total).
In order to build a tree in our ParserAdapter
class, we’ll need to create a buffer that describes the tree. We can use the tokens that come back from the language server as the basis for this buffer, since it has most of the information we need. Let’s add a helper method to our class to transform the token stream into a buffer that Lezer can understand:
const DEFAULT_NODE_GROUP_SIZE = 4;
private createBufferFromTokens(tokens: Token[]) {
const buffer = [];
// 1
tokens.forEach((token) => {
const nodeTypeId = this.getNodeTypeIdForTokenType(token.type);
const startOffset = token.startIndex;
// 2
const endOffset = token.stopIndex + 1;
buffer.push(
nodeTypeId,
startOffset,
endOffset,
// 3
DEFAULT_NODE_GROUP_SIZE,
);
});
const topNodeId = tokenToNodeType.document.id;
const startOffset = tokens[0].startIndex;
const endOffset = tokens[tokens.length - 1].stopIndex;
// 4
const topNodeSize =
tokens.length * DEFAULT_NODE_GROUP_SIZE + DEFAULT_NODE_GROUP_SIZE;
buffer.push(topNodeId, startOffset, endOffset, topNodeSize);
return buffer;
}
Here’s what’s happening in this function:
stopIndex
refers to the index of the last character of the word; Lezer expects the end offset to be fully inclusive of the word (and so the offset should refer to the character just past the word).DEFAULT_NODE_GROUP_SIZE
is a named constant to make the code a little less ✨ magic-number-y ✨.tokens.length * DEFAULT_NODE_GROUP_SIZE
) plus the size of the top node itself.The docs specify that a node buffer must be in postfix order–children come before parents, and children are ordered by offsets. If you have offsets out of order, Lezer won’t generate the tree properly and will throw an error.
// this won't work
const invalidBuffer = [
[nodeId, 0, 4, size],
[nodeId, 8, 12, size],
[nodeId, 5, 6, size],
[parentId, 0, 12, size],
];
// this works
const validBuffer = [
[nodeId, 0, 4, size],
[nodeId, 5, 6, size],
[nodeId, 8, 12, size],
[parentId, 0, 12, size],
];
The Zephyr language server returns the tokens in the order they appear in the document, which means that we can always assume that they are in the correct order.
Now that we have constructed a buffer, let’s build a parse tree! We’ll add a helper method for generating a tree:
private buildTree(document: string) {
// 1
const tokens = this.languageServer.getTokenStream(document);
// 2
if (tokens.length < 1) {
return Tree.build({
buffer: [
tokenToNodeType.document.id,
0,
document.length,
DEFAULT_NODE_GROUP_SIZE,
],
nodeSet: parserAdapterNodeSet,
topID: tokenToNodeType.document.id,
});
}
// 3
const buffer = this.createBufferFromTokens(tokens);
// 4
return Tree.build({
buffer: buffer,
nodeSet: parserAdapterNodeSet,
topID: tokenToNodeType.document.id,
});
}
What’s happening in this code:
With our helper methods written, we are ready to finish the our parser adapter by implementing the required methods. We need to implement both startParse
and createParse
–createParse
is used for the initial creation of the tree, and startParse
is used when the code is edited. The code for these functions looks like this:
// 1
createParse(
input: Input,
): PartialParse {
return this.startParse(input);
}
startParse(
input: string | Input,
): PartialParse {
// 2
const document =
typeof input === "string" ? input : input.read(0, input.length);
// 3
const tree = this.buildTree(document);
// 4
return {
stoppedAt: input.length,
parsedPos: input.length,
stopAt: (_) => {},
advance: () => tree,
};
}
Here’s what’s happening in this code:
createParse
, I’ve opted to simply forward calls to that function to startParse
.input
in this parse function, Lezer uses both an Input
type and a simple string type–we need to do a little bit of unwrapping of this type to get at the document.If we put all of this together, we end up with the following code:
import {Parser, Tree, Input, PartialParse, TreeFragment} from '@lezer/common';
import {Token} from 'antlr4ts';
import {LanguageServer} from './language';
import {parserAdapterNodeSet, tokenToNodeType} from './constants';
const DEFAULT_NODE_GROUP_SIZE = 4;
export class ParserAdapter extends Parser {
private languageServer = new LanguageServer();
private getNodeTypeIdForTokenType(index: number) {
const tokenType = this.languageServer.getTokenTypeForIndex(index);
return tokenToNodeType[tokenType].id;
}
private createBufferFromTokens(tokens: Token[]) {
const buffer = [];
tokens.forEach((token) => {
const nodeTypeId = this.getNodeTypeIdForTokenType(token.type);
const startOffset = token.startIndex;
const endOffset = token.stopIndex + 1;
buffer.push(nodeTypeId, startOffset, endOffset, DEFAULT_NODE_GROUP_SIZE);
});
const topNodeId = tokenToNodeType.document.id;
const startOffset = tokens[0].startIndex;
const endOffest = tokens[tokens.length - 1].stopIndex;
const topNodeSize =
tokens.length * DEFAULT_NODE_GROUP_SIZE + DEFAULT_NODE_GROUP_SIZE;
buffer.push(topNodeId, startOffset, endOffest, topNodeSize);
return buffer;
}
private buildTree(document: string) {
const tokens = this.languageServer.getTokenStream(document);
if (tokens.length < 1) {
return Tree.build({
buffer: [
tokenToNodeType.document.id,
0,
document.length,
DEFAULT_NODE_GROUP_SIZE,
],
nodeSet: parserAdapterNodeSet,
topID: tokenToNodeType.document.id,
});
}
const buffer = this.createBufferFromTokens(tokens);
return Tree.build({
buffer: buffer,
nodeSet: parserAdapterNodeSet,
topID: tokenToNodeType.document.id,
});
}
createParse(
input: Input,
): PartialParse {
return this.startParse(input);
}
startParse(
input: string | Input,
): PartialParse {
const document =
typeof input === 'string' ? input : input.read(0, input.length);
const tree = this.buildTree(document);
return {
stoppedAt: input.length,
parsedPos: input.length,
stopAt: (_) => {},
advance: () => tree,
};
}
}
Since we have a finished parser adapter, we now need to connect that to CodeMirror by using the CodeMirror-provided Language
classes.
CodeMirror uses the Language
class to describe a code language–it includes:
In addition to the Language
class, CodeMirror also has a LanguageSupport
class. The docs explain the function of this class:
This class bundles a language with an optional set of supporting extensions. Language packages are encouraged to export a function that optionally takes a configuration object and returns a LanguageSupport instance, as the main way for client code to use the package.
In order to connect our parser adapter to CodeMirror, we need to use both the Language
and LanguageSupport
classes. Here's an example of what this might look like:
// 1
const parserAdapter = new ParserAdapter();
// 2
const language = new Language(Facet.define(), parserAdapter, [], 'Zephyr');
// 3
const zephyr = new LanguageSupport(language, []);
Here's what's happening in this code:
ParserAdapter
, which will be used to generate the parse tree.Facet
(we'll cover why this is necessary in the language data section) and the second argument is our parser adapter. We're passing an empty array for the extraExtensions
argument, and passing Zephyr
in order to name the language.support
extensions argument for now.If we pass the zephyr
(langauge support) object as an extension within CodeMirror, CodeMirror will then be able to read a Zephyr document and generate a syntax tree from it. 🎉
One of the most basic features any language needs is syntax highlighting. Now that CodeMirror can generate a syntax tree, we can leverage the syntaxHighlighting
function to provide highlighting. But first we need to tell CodeMirror a little bit more about our language in order to make that work.
Lezer contains a highlight module that connects a node type with style information by using the Tag
class. The docs describe this class as:
Highlighting tags are markers that denote a highlighting category. They are associated with parts of a syntax tree by a language mode, and then mapped to an actual CSS style by a highlighter.
I think of tags as generic descriptors of tokens within a language. For example, you might use def
or func
or function
to describe a function within a language. A tag is the generic descriptor of a keyword, so all of those language-specific words would be mapped to the generic function
tag.
In order for CodeMirror's syntaxHighlighting
function to work properly, we need to provide a mapping between Lezer's tags and our node types. The mapping between node types and language tokens that we constructed previously looks like this:
const tokenToNodeType: {[key in Token | 'document']: NodeType} = {
document: NodeType.define({id: 0, name: 'document', top: true}),
const: NodeType.define({id: 1, name: 'const'}),
let: NodeType.define({id: 2, name: 'let'}),
semicolon: NodeType.define({id: 3, name: 'semicolon'}),
assign: NodeType.define({id: 4, name: 'assign'}),
number: NodeType.define({id: 5, name: 'number'}),
string: NodeType.define({id: 6, name: 'string'}),
identifier: NodeType.define({id: 7, name: 'identifier'}),
unknown: NodeType.define({id: 8, name: 'unknown'}),
blockComment: NodeType.define({id: 9, name: 'blockComment'}),
lineComment: NodeType.define({id: 10, name: 'lineComment'}),
};
const parserAdapterNodeSet = new NodeSet(Object.values(tokenToNodeType));
We need to use two methods to connect our types to tags:
NodeSet.extend
: this method appends data to each of our node types.styleTags
: this method creates a key/value mapping between a node type's name and the particular style tag we want to associate with this node type.Here's an example of what this code might look like:
const parserAdapterNodeSet = new NodeSet(Object.values(tokenToNodeType)).extend(
styleTags({
const: tags.keyword,
let: tags.keyword,
assign: tags.operator,
number: tags.number,
string: tags.string,
identifier: tags.variableName,
blockComment: tags.comment,
lineComment: tags.comment,
}),
);
Notice that you can call .extend
in the same statement after you've initialized the node set. Also, the keys I'm passing in to styleTags
are the names of each of the node types found in tokenToNodeType
. Note that these tags
are used to describe the semantic purpose of the nodes to lezer's highlight module–they are not the CSS classnames themselves.
Now that we have a mapping between the highlight tags and the node types, we can provide syntax highlighting that describes our language:
const syntaxHighlight = syntaxHighlighting(
HighlightStyle.define([
{tag: tags.comment, class: 'text-slate'},
{tag: tags.keyword, class: 'text-fuchsia'},
{tag: tags.variableName, class: 'text-blue'},
{tag: tags.string, class: 'text-lime'},
{tag: tags.number, class: 'text-violet'},
{tag: tags.operator, class: 'text-orange'},
]),
);
The class
prop here describes the value that is used for the CSS classname in the markup. Note that we aren't using the names of our node types anymore, but instead we are using the tags
object as the key. Because syntax highlighting is associated with tags and not directly with a language's node types, it becomes portable between editors. This enables you to use highlighting from other sources.
We can provide this syntax highlighting extension as part of our language via the support
array in the LanguageSupport
class:
const zephyr = new LanguageSupport(language, [syntaxHighlight]);
This will automatically provide this extension alongside our language in any place we pass the zephyr
extension. We now have both parsing and coloring for our language!
In order to tell CodeMirror more about our language, we can pass in language data. Language data describes features of the language in a way that CodeMirror understands, and if done correctly, will provide certain code editing features "for free" (since these features are generic enough that CodeMirror has a pattern to support them). The docs give the following examples of language data:
commentTokens
for specifying comment syntaxautocomplete
for providing language-specific completion sourceswordChars
for adding characters that should be considered part of words in this languagecloseBrackets
controls bracket closing behaviorTypically we would pass in language data via the first argument of the Language
class (the data
argument). However, since we are maintaining our own node set, we need to take a different approach. We instead need to provide language data directly through top node's props argument. This is why we passed an empty facet to our call to Language
earlier–any data passed there will be unused (but the types still require something to be passed in).
CodeMirror gives us two functions–languageDataProp
and defineLanguageFacet
–to make this part easy. We can update the top node in our tokenToNodeType
object to include language data:
document: NodeType.define({
id: 0,
name: 'document',
top: true,
props: [
[
languageDataProp,
defineLanguageFacet({
commentTokens: {
block: {open: '/*', close: '*/'},
line: '//'
},
}),
],
],
}),
Providing this data to CodeMirror will give us editor features for "free"–in this case, the commentTokens
make it so that the default toggle comment command works with our language.
This implementation isn’t perfect and has a couple of drawbacks:
While these issues are solvable, I’ve opted not to solve them as part of this example and instead have kept this example as simple as possible to demonstrate how to connect a language server to Lezer.
So with all of that–we now have working code editor that supports Zephyr, our little example language written in an ANTLR grammar and powered by a language server.
Get my posts in your feed of choice. Opt-out any time.
If you enjoyed this article, you might enjoy one of these: