GoLexer – High-Performance Go Lexical Analyzer

GoLexer - High-Performance Lexical Analyzer
GoLexer is a production-ready lexical analyzer (tokenizer) for Go that transforms source code into structured tokens. Built for compilers, interpreters, DSLs, configuration parsers, and code analysis tools.
Why I Built This
I built GoLexer to understand how compilers work at the lowest level. The challenge was supporting modern number formats, Unicode identifiers, and robust error recovery while keeping it fast and memory-efficient. This project taught me Go's string handling, rune management, and practical patterns for building developer tools.
Core Features
- 50+ Token Types: Keywords, identifiers, numbers, strings, operators, punctuation
- Unicode Support: International identifiers (café, résumé, 变量)
- Number Formats: Decimal, hex, binary, octal, scientific notation
- Error Recovery: Detects errors while continuing tokenization
- JSON Configuration: Extend with custom keywords and operators
- Performance: Single-pass, low-memory, streaming or batch mode
How It Works
GoLexer reads source code character by character and converts it into structured tokens:
Input: let total = 42 + 3.14 * count; Output: LET let (Line 1, Col 1) IDENT total (Line 1, Col 5) ASSIGN = (Line 1, Col 11) NUMBER 42 (Line 1, Col 13) PLUS + (Line 1, Col 16) NUMBER 3.14 (Line 1, Col 18) MULTIPLY * (Line 1, Col 23) IDENT count (Line 1, Col 25) SEMICOLON ; (Line 1, Col 30)
Quick Start
package main
import (
"fmt"
"github.com/codetesla51/golexer/golexer"
)
func main() {
source := `let total = 42 + 3.14 * count;`
lexer := golexer.NewLexer(source)
for {
token := lexer.NextToken()
if token.Type == golexer.EOF {
break
}
fmt.Printf("%-12s %-10s (Line %d, Col %d)\n",
token.Type, token.Literal, token.Line, token.Column)
}
}
Batch Processing with Error Handling
lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()
fmt.Printf("Generated %d tokens\n", len(tokens))
if len(errors) > 0 {
fmt.Printf("Found %d errors:\n", len(errors))
for _, err := range errors {
fmt.Printf(" %s (Line %d, Col %d)\n",
err.Message, err.Line, err.Column)
}
}
Custom Configuration
Extend GoLexer with custom keywords, operators, and punctuation using JSON:
{
"additionalKeywords": {
"unless": "UNLESS",
"async": "ASYNC"
},
"additionalOperators": {
"**": "POWER",
"??": "NULL_COALESCE"
}
}
lexer := golexer.NewLexerWithConfig(source, "config.json")
Real-World Applications
Compiler Frontend
type Compiler struct {
lexer *golexer.Lexer
}
func (c *Compiler) CompileFile(filename string) error {
source, _ := os.ReadFile(filename)
c.lexer = golexer.NewLexer(string(source))
tokens, errors := c.lexer.TokenizeAll()
if len(errors) > 0 {
return fmt.Errorf("lexical errors: %v", errors)
}
return c.parse(tokens)
}
Configuration Parser
func ParseConfig(file string) (*Config, error) {
content, _ := os.ReadFile(file)
lexer := golexer.NewLexerWithConfig(string(content), "config.json")
config := &Config{}
for {
tok := lexer.NextToken()
if tok.Type == golexer.EOF {
break
}
// Parse configuration syntax
}
return config, nil
}
Code Analysis
func AnalyzeCode(source string) {
lexer := golexer.NewLexer(source)
tokens, errors := lexer.TokenizeAll()
tokenCounts := make(map[golexer.TokenType]int)
for _, token := range tokens {
tokenCounts[token.Type]++
}
fmt.Printf("Total tokens: %d\n", len(tokens))
fmt.Printf("Unique types: %d\n", len(tokenCounts))
fmt.Printf("Errors: %d\n", len(errors))
}
Project Structure
golexer/ ├── golexer/ Core lexer implementation ├── config/ JSON configuration templates ├── cmd/ CLI examples and tests ├── go.mod Go module definition └── LICENSE MIT License
Testing & Validation
Validated against 1700+ tokens, complex nested expressions, and edge cases.
go test ./... go run cmd/main.go test.lang
Technical Details
- Single-pass tokenization
- Low memory allocation
- Full UTF-8 support
- Streaming and batch modes
- Comprehensive error reporting
Built by Uthman | @codetesla51