Defining Languages

Language

The language is a function that returns a Language object. The object must contain 3 required properties, id, name and grammar, and accepts 2 more optional properties.

Property	Description
`id`	The unique language ID
`name`	The language name
`alias`	Optional. Define aliases that must be also unique among all IDs and other aliases
`source`	Optional. Define source patterns
`grammar`	A `Grammar` object that is a collection of tokenizers

The alias is an optional property that defines aliases of the language, such as a "js" shorthand for JavaScript. The id and arias must be unique among all IDs and aliases. The grammar object must have at least one property named main which is an array of tokenizers:

export function myLang() {
  return {
    id: 'mylang',
    name: 'My Language',

    // alias: [ 'mine' ],
    // source: {},
    
    grammar: {
      main: [],
    },
  };
}
JavaScript

export function myLang() {
  return {
    id: 'mylang',
    name: 'My Language',


    // alias: [ 'mine' ],
    // source: {},
    
    grammar: {
      main: [],
    },
  };
}

This is the minimal definition of the language. Let's register and test it:

import { myLang } from './myLang';

RyuseiLight.register( myLang() );

const ryuseilight = new RyuseiLight();
const html = ryuseilight.html( `console.log( 'hi!' )`, { langauge: 'myLang' } );

console.log( html );
JavaScript

import { myLang } from './myLang';


RyuseiLight.register( myLang() );


const ryuseilight = new RyuseiLight();
const html = ryuseilight.html( `console.log( 'hi!' )`, { langauge: 'myLang' } );


console.log( html );

This works well, but all code tokenized as a text (which is a default category). Now we have to define the grammar by providing some tokenizers.

Main Tokenizers

Each tokenizer is defined by using a regular expression. There are a few things to keep in mind before start:

A sticky flag (y) is automatically added for better performance.
When you use a dotAll flag (s), make sure your transpiler transforms it to [\s\S], since it is not supported by IE.
Do not use a lookbehind assertion because it doesn't work on IE (a lookahead assertion is supported).

Here is a simple example to tokenize single line comments (// comment) and strings ('string'):

{
  main: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}
JavaScript

{
  main: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}

The first parameter of the tokenizer is a category that the matched string will be categorized into, and the second parameter is a RegExp object to search a token.

Even if the string contains //, such as ' // trap! ', this grammar works properly because the RyuseiLight searches tokens from the head of code and therefore the 'string' tokenizer matches it before the 'comment' does.

By adding regexes like this, you can simply define your own language.

Sub Tokenizers

The main purpose of sub tokenizers to apply another set of tokenizers to the matched string. For example, when tokenizing HTML tags, it makes more sense to extract tags first and then tokenize attributes that belong to them, rather than doing that at the same time.

To do this, use # notation that indicates the ID of sub tokenizers like this:

{
  main: [
    [ '#tag', /<.*?>/s ],
  ],

  tag: [
    ...
  ],
}
JavaScript

{
  main: [
    [ '#tag', /<.*?>/s ],
  ],


  tag: [
    ...
  ],
}

The matched string by /<.*?>/s will be tokenized a set of tag tokenizers. As you may have already noticed, sub tokenizers can also use other tokenizers:

{
  main: [
    [ '#tag', /<.*?>/s ],
  ],

  tag: [
    [ '#attr', /\s+.+(?=[\s/>])/s ],
    [ 'tag', /[^\s/<>"'=]+/ ],
    [ 'bracket', /[<>]/ ],
    [ 'symbol', /[/]/ ],
  ],

  attr: [
    ...
  ],
}
JavaScript

{
  main: [
    [ '#tag', /<.*?>/s ],
  ],


  tag: [
    [ '#attr', /\s+.+(?=[\s/>])/s ],
    [ 'tag', /[^\s/<>"'=]+/ ],
    [ 'bracket', /[<>]/ ],
    [ 'symbol', /[/]/ ],
  ],


  attr: [
    ...
  ],
}

This system lets you break down a complex string into small fragments that are easy to handle.

Including Tokenizers

Tokenizers can include other tokenizers to share by using # notation without providing a regex:

{
  common: [
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
  ],

  tokenizers1: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ '#common' ],
  ],

  tokenizers2: [
    [ '#common' ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}
JavaScript

{
  common: [
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
  ],


  tokenizers1: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ '#common' ],
  ],


  tokenizers2: [
    [ '#common' ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}

This is totally same with this grammar:

{
  tokenizers1: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
  ],

  tokenizers2: [
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}
JavaScript

{
  tokenizers1: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
  ],


  tokenizers2: [
    [ 'space', /[ \t]+/ ],
    [ 'symbol', /[:;.,]/ ],
    [ 'string', /'(?:\\'|.)*?'/ ],
  ],
}

Using Other Languages

For example, HTML can contain JavaScript inside a <script> tag. In such cases, you can use other languages by the use property and @ notation:

import { languages } from '@ryusei/light';

const { javascript, css } = languages;

export function myLang() {
  return {
    id  : 'mylang',
    name: 'My Language',
    
    use:  {
      javascript: javascript(),
      css       : css(),
    },

    grammar: {
      main: [
        [ '#script', /<script.*?>.*?<\/script>/s ],
        [ '#style', /<style.*?>.*?<\/style>/s ],
      ],
      
      script: [
        ...
        [ '@javascript', /.+(?=<\/script>)/s ]
      ],
      
      style: [
        ...
        [ '@css', /.+(?=<\/style>)/s ],
      ],
    },
  };
}
JavaScript

import { languages } from '@ryusei/light';


const { javascript, css } = languages;


export function myLang() {
  return {
    id  : 'mylang',
    name: 'My Language',
    
    use:  {
      javascript: javascript(),
      css       : css(),
    },


    grammar: {
      main: [
        [ '#script', /<script.*?>.*?<\/script>/s ],
        [ '#style', /<style.*?>.*?<\/style>/s ],
      ],
      
      script: [
        ...
        [ '@javascript', /.+(?=<\/script>)/s ]
      ],
      
      style: [
        ...
        [ '@css', /.+(?=<\/style>)/s ],
      ],
    },
  };
}

Source

Sometimes same regex patterns appear in different tokenizers. You can reuse such frequent patterns by defining the source property and % notation.

{
  source: {
    func: /[_$a-z\xA0-\uFFFF][_$a-z0-9\xA0-\uFFFF]*/;
  },
  
  grammar: {
    main: [
      ...
      [ 'function', /%func(?=\s*\()/i ],
      [ 'keyword', /\b((get|set)(?=\s+%func))/i  ],
    ],
  }
}
JavaScript

{
  source: {
    func: /[_$a-z\xA0-\uFFFF][_$a-z0-9\xA0-\uFFFF]*/;
  },
  
  grammar: {
    main: [
      ...
      [ 'function', /%func(?=\s*\()/i ],
      [ 'keyword', /\b((get|set)(?=\s+%func))/i  ],
    ],
  }
}

The %[ key ] will be replaced with the corresponding source[ key ]. Note that you can not use flags in source patterns.

Actions

For a regex based tokenizer, nested syntax is often very difficult to tokenize (drives me crazy 😖), such as a template literal in JavaScript.

const string = `container \` ${
  isMobile()
    // ${ I am a comment }
    // `I am a fake template literal`
    ? 'is-mobile'
    : `container--${ page.isFront() ? 'front' : 'page' }`
}`;
JavaScript

const string = `container \` ${
  isMobile()
    // ${ I am a comment }
    // `I am a fake template literal`
    ? 'is-mobile'
    : `container--${ page.isFront() ? 'front' : 'page' }`
}`;

The template literal above begins at the first line and ends at the last line. How should we define a regex that matches the open/close backticks?

`@rest` and `@break`

...okay, giving up tokenizing it only by regex approach, let's utilize a simple state system with @rest and @break actions (the idea is inspired by Monarch).

{
  main: [
    // Imagine here are some tokenizers for JavaScript syntax.
    [ '#backtick', /`/, '@rest' ],
  ],

  backtick: [
    [ 'string', /^`/ ],
  ],
}
JavaScript

{
  main: [
    // Imagine here are some tokenizers for JavaScript syntax.
    [ '#backtick', /`/, '@rest' ],
  ],


  backtick: [
    [ 'string', /^`/ ],
  ],
}

Although the # notation explained before applies sub tokenizers to the matched string, the target is changed to the rest of the whole string if the @rest action is specified as the third parameter. You can think the @rest action switches the Lexer state (from main to backtick in this example).

In the backtick state, we have to settle the matched open backtick as a string first, otherwise, we may accidentally dive into a terrible infinite loop.

Next, search a close backtick:

{
  main: [
    [ '#backtick', /`/, '@rest' ],
  ],

  backtick: [
    [ 'string', /^`/ ],
    [ 'string', /`/, '@break' ],
  ],
}
JavaScript

{
  main: [
    [ '#backtick', /`/, '@rest' ],
  ],


  backtick: [
    [ 'string', /^`/ ],
    [ 'string', /`/, '@break' ],
  ],
}

The @break action makes the state return to the previous one. The following picture illustrates how this grammar changes tokenization states. Once the state transitions to #backtick, only the rules in the #backtick will be applied:

Changing states by @rest and @break actions

A template literal may contain embedded expressions that enclosed with ${ and }. They can be pulled out in the same way:

{
  main: [
    [ '#backtick', /`/, '@rest' ],
  ],

  backtick: [
    [ 'string', /`/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],

  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
  ],
}
JavaScript

{
  main: [
    [ '#backtick', /`/, '@rest' ],
  ],


  backtick: [
    [ 'string', /`/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],


  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
  ],
}

Additionally, a template literal can also appear in an expression, such as `outer ${ `inner` }`. If we encounter another backtick in the expression state, change the state to the #backtick again:

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ '#backtick', /`/, '@rest' ],
    [ 'bracket', /}/, '@break' ],
  ],
}
JavaScript

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ '#backtick', /`/, '@rest' ],
    [ 'bracket', /}/, '@break' ],
  ],
}

So far, we've defined basic states for a template literal. Then, we have to tokenize a text inside backticks as string. In this process, it is important to think how to stop the search position of the Lexer before ` and ${.

The regex can be written like /[^`$]+/ by using the negation operator. It is almost enough, but has 2 flaws:

Missing $ that doesn't lead {
Missing an escaped backtick and ${

To fix these problems, we need to add patterns to include them:

{
  backtick: [
    [ 'string', /`/ ],
    [ 'string', /(\$[^{]|\\[$`]|[^`$])+/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],
}
JavaScript

{
  backtick: [
    [ 'string', /`/ ],
    [ 'string', /(\$[^{]|\\[$`]|[^`$])+/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],
}

Finally, let's tokenize a text inside an expression where any JavaScript syntax may appear. This can be simply achieved by including #main:

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ '#backtick', /`/, '@rest' ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}
JavaScript

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ '#backtick', /`/, '@rest' ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}

Since the main tokenizers contain '#backtick', we should remove it:

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}
JavaScript

{
  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}

That's all! ✨

You may wonder what if the } appears in comments or strings, which can break the state unexpectedly. Right, this does not work well in such cases, but the #main is expected to include comment and string tokenizers for practical use. Before [ 'bracket', /}/, '@break' ] matches '{', it will be tokenized by the string tokenizer:

{
  main: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'string', /'(?:\\'|.)*?'/ ],
    ...
    
    [ '#backtick', /`/, '@rest' ],
  ],

  backtick: [
    [ 'string', /`/ ],
    [ 'string', /(\$[^{]|\\[$`]|[^`$])+/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],

  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}
JavaScript

{
  main: [
    [ 'comment', /\/\*.*?\*\//s ],
    [ 'string', /'(?:\\'|.)*?'/ ],
    ...
    
    [ '#backtick', /`/, '@rest' ],
  ],


  backtick: [
    [ 'string', /`/ ],
    [ 'string', /(\$[^{]|\\[$`]|[^`$])+/ ],
    [ '#expression', /\${/, '@rest' ],
    [ 'string', /`/, '@break' ],
  ],


  expression: [
    [ 'bracket', /^\${/ ],
    [ 'bracket', /}/, '@break' ],
    [ '#main' ],
  ],
}

You can see my JavaScript tokenizers here.

`@back`

The @back action also breaks a state, but a matched string is not tokenized.

{
  // "}" is tokenized as a "bracket":
  tokenizers: [
    [ 'bracket', /}/, '@break' ],
  ],
    
  // "}" is not tokenized:
  tokenizers: [
    [ '', /}/, '@back' ],
  ],
}
JavaScript

{
  // "}" is tokenized as a "bracket":
  tokenizers: [
    [ 'bracket', /}/, '@break' ],
  ],
    
  // "}" is not tokenized:
  tokenizers: [
    [ '', /}/, '@back' ],
  ],
}

This is useful when a previous state also need to end with the same character.

`@debug`

If you want to know what your tokenizer matches, set the @debug special action to the forth parameter (not the third!). This action outputs the matched result as a log on your browser console.

[ 'bracket', /{/, '@rest', '@debug' ]
JavaScript

[ 'bracket', /{/, '@rest', '@debug' ]

Do not forget removing it for the production build.