Back to Projects

GoLexer – High-Performance Go Lexical Analyzer

Go
GoLexer – High-Performance Go Lexical Analyzer

GoLexer - High-Performance Lexical Analyzer

GoLexer is a production-ready lexical analyzer (tokenizer) for Go that transforms source code into structured tokens. Built for compilers, interpreters, DSLs, configuration parsers, and code analysis tools.

View on GitHub →

Why I Built This

I built GoLexer to understand how compilers work at the lowest level. The challenge was supporting modern number formats, Unicode identifiers, and robust error recovery while keeping it fast and memory-efficient. This project taught me Go's string handling, rune management, and practical patterns for building developer tools.

Core Features

  • 50+ Token Types: Keywords, identifiers, numbers, strings, operators, punctuation
  • Unicode Support: International identifiers (café, résumé, 变量)
  • Number Formats: Decimal, hex, binary, octal, scientific notation
  • Error Recovery: Detects errors while continuing tokenization
  • JSON Configuration: Extend with custom keywords and operators
  • Performance: Single-pass, low-memory, streaming or batch mode

How It Works

GoLexer reads source code character by character and converts it into structured tokens:

Input:  let total = 42 + 3.14 * count;

Output:
LET          let        (Line 1, Col 1)
IDENT        total      (Line 1, Col 5)
ASSIGN       =          (Line 1, Col 11)
NUMBER       42         (Line 1, Col 13)
PLUS         +          (Line 1, Col 16)
NUMBER       3.14       (Line 1, Col 18)
MULTIPLY     *          (Line 1, Col 23)
IDENT        count      (Line 1, Col 25)
SEMICOLON    ;          (Line 1, Col 30)

Quick Start

package main

import (
    "fmt"
    "github.com/codetesla51/golexer/golexer"
)

func main() {
    source := `let total = 42 + 3.14 * count;`
    lexer := golexer.NewLexer(source)

    for {
        token := lexer.NextToken()
        if token.Type == golexer.EOF {
            break
        }
        fmt.Printf("%-12s %-10s (Line %d, Col %d)\n",
            token.Type, token.Literal, token.Line, token.Column)
    }
}

Batch Processing with Error Handling

lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()

fmt.Printf("Generated %d tokens\n", len(tokens))
if len(errors) > 0 {
    fmt.Printf("Found %d errors:\n", len(errors))
    for _, err := range errors {
        fmt.Printf("  %s (Line %d, Col %d)\n", 
            err.Message, err.Line, err.Column)
    }
}

Custom Configuration

Extend GoLexer with custom keywords, operators, and punctuation using JSON:

{
  "additionalKeywords": {
    "unless": "UNLESS",
    "async": "ASYNC"
  },
  "additionalOperators": {
    "**": "POWER",
    "??": "NULL_COALESCE"
  }
}
lexer := golexer.NewLexerWithConfig(source, "config.json")

Real-World Applications

Compiler Frontend

type Compiler struct {
    lexer *golexer.Lexer
}

func (c *Compiler) CompileFile(filename string) error {
    source, _ := os.ReadFile(filename)
    c.lexer = golexer.NewLexer(string(source))
    tokens, errors := c.lexer.TokenizeAll()
    
    if len(errors) > 0 {
        return fmt.Errorf("lexical errors: %v", errors)
    }
    
    return c.parse(tokens)
}

Configuration Parser

func ParseConfig(file string) (*Config, error) {
    content, _ := os.ReadFile(file)
    lexer := golexer.NewLexerWithConfig(string(content), "config.json")
    
    config := &Config{}
    for {
        tok := lexer.NextToken()
        if tok.Type == golexer.EOF {
            break
        }
        // Parse configuration syntax
    }
    return config, nil
}

Code Analysis

func AnalyzeCode(source string) {
    lexer := golexer.NewLexer(source)
    tokens, errors := lexer.TokenizeAll()

    tokenCounts := make(map[golexer.TokenType]int)
    for _, token := range tokens {
        tokenCounts[token.Type]++
    }

    fmt.Printf("Total tokens: %d\n", len(tokens))
    fmt.Printf("Unique types: %d\n", len(tokenCounts))
    fmt.Printf("Errors: %d\n", len(errors))
}

Project Structure

golexer/
├── golexer/     Core lexer implementation
├── config/      JSON configuration templates
├── cmd/         CLI examples and tests
├── go.mod       Go module definition
└── LICENSE      MIT License

Testing & Validation

Validated against 1700+ tokens, complex nested expressions, and edge cases.

go test ./...
go run cmd/main.go test.lang

Technical Details

  • Single-pass tokenization
  • Low memory allocation
  • Full UTF-8 support
  • Streaming and batch modes
  • Comprehensive error reporting

View Source Code on GitHub →

Built by Uthman | @codetesla51