Tokenizer

The tokenizer is the first stage of the processing pipeline. It transforms a raw LaTeX string into a stream of tokens — discrete units that the parser can work with.

What Is a Token?

A token is a meaningful unit extracted from the input. For example:

Input	Token Type	Value
`3.14`	NUMBER	3.14
`x`	VARIABLE	x
`\sin`	FUNCTION	sin
`+`	OPERATOR	add
`{`	LBRACE	-
`^`	OPERATOR	power

The tokenizer doesn't understand structure — it just identifies what each piece of text represents.

How Tokenization Works

The tokenizer reads the input character by character, applying these rules in order:

1. Skip Whitespace

Spaces, tabs, and newlines are consumed but produce no tokens. This is why 2+3 and 2 + 3 parse identically.

2. Recognize Numbers

When a digit or decimal point is encountered, the tokenizer greedily consumes the entire number:

"3.14159" → NUMBER(3.14159)
"1e-5"    → NUMBER(0.00001)
".5"      → NUMBER(0.5)

3. Recognize Operators

Single characters that are operators produce operator tokens:

"+"  → OPERATOR(add)
"-"  → OPERATOR(subtract)
"*"  → OPERATOR(multiply)
"/"  → OPERATOR(divide)
"^"  → OPERATOR(power)
"="  → OPERATOR(equals)

4. Recognize LaTeX Commands

When \ is encountered, the tokenizer reads the command name:

"\sin"    → FUNCTION(sin)
"\frac"   → FUNCTION(frac)
"\pi"     → CONSTANT(π)
"\alpha"  → VARIABLE(α)

The tokenizer maintains a registry of known commands. Unknown commands produce an error with suggestions for similar commands.

5. Recognize Identifiers

Single letters (when not part of a LaTeX command) become variables:

"x"  → VARIABLE(x)
"y"  → VARIABLE(y)

Multi-character sequences like xy are handled specially — see Implicit Multiplication below.

Implicit Multiplication

One of TeXpr's key features is implicit multiplication — the ability to write 2x instead of 2*x. This is how textbook notation works.

How It Works

When allowImplicitMultiplication is enabled (the default), the tokenizer inserts multiplication tokens between certain adjacent tokens:

Input	Tokens Produced
`2x`	NUMBER(2), OPERATOR(*), VARIABLE(x)
`xy`	VARIABLE(x), OPERATOR(*), VARIABLE(y)
`\sin x \cos x`	FUNCTION(sin), VARIABLE(x), OPERATOR(*), FUNCTION(cos), VARIABLE(x)
`(a+b)(c+d)`	..., RPAREN, OPERATOR(*), LPAREN, ...

The Rules

Implicit multiplication is inserted when:

Number followed by variable: 2x → 2 * x
Number followed by function: 2\sin{x} → 2 * sin(x)
Variable followed by variable: xy → x * y
Closing paren followed by opening paren: (a)(b) → (a) * (b)
Closing paren followed by variable/number: (a)b → (a) * b
Variable followed by opening paren: x(a+b) → x * (a+b)

When to Disable It

Disable implicit multiplication when you need multi-character variable names:

dart

final evaluator = Texpr(allowImplicitMultiplication: false);
// Now "velocity" is a single variable, not v*e*l*o*c*i*t*y
evaluator.evaluate('velocity * time', {'velocity': 10, 'time': 5});

Token Types

The complete list of token types:

Type	Examples	Purpose
`NUMBER`	3.14, 1e-5, .5	Numeric literals
`VARIABLE`	x, y, α, β	Variable references
`FUNCTION`	sin, cos, sqrt	Function names
`CONSTANT`	π, e, ∞	Mathematical constants
`OPERATOR`	+, -, *, /, ^	Binary operators
`LBRACE`	{	Opening brace
`RBRACE`	}	Closing brace
`LPAREN`	(	Opening parenthesis
`RPAREN`	)	Closing parenthesis
`LBRACKET`	[	Opening bracket
`RBRACKET`	]	Closing bracket
`COMMA`	,	Argument separator
`UNDERSCORE`	_	Subscript marker
`AMPERSAND`	&	Matrix column separator
`DOUBLEBACKSLASH`	\\	Matrix row separator
`EOF`	(end)	End of input

Error Handling

When the tokenizer encounters invalid input, it throws a TokenizerException with:

Position: The character index where the error occurred
Message: What went wrong
Suggestion: How to fix it

dart

try {
  evaluator.parse(r'\unknowncommand{x}');
} on TokenizerException catch (e) {
  print(e.position);    // 0
  print(e.message);     // "Unknown command: unknowncommand"
  print(e.suggestion);  // "Did you mean 'sin', 'cos', or 'tan'?"
}

Common errors:

Unknown LaTeX command
Invalid character
Malformed number (e.g., 1.2.3)

Unicode Support

The tokenizer accepts Unicode mathematical symbols directly:

Unicode	Equivalent LaTeX	Token
`π`	`\pi`	CONSTANT(π)
`∑`	`\sum`	FUNCTION(sum)
`∫`	`\int`	FUNCTION(int)
`√`	`\sqrt`	FUNCTION(sqrt)
`α`, `β`, `γ`...	`\alpha`, `\beta`, `\gamma`...	VARIABLE
`∞`	`\infty`	CONSTANT(∞)

This allows copy-pasting mathematical expressions from documents.

Extending the Tokenizer

You can register custom commands via ExtensionRegistry:

dart

final registry = ExtensionRegistry();
registry.registerCommand('myfunction', (cmd, pos) =>
  Token(type: TokenType.function, value: 'myfunction', position: pos)
);

final evaluator = Texpr(extensions: registry);

The registered command becomes available as \myfunction in expressions.

Performance Considerations

Tokenization is O(n) where n is the input length. It's a single pass with no backtracking.

The bottleneck is usually:

Large expressions (thousands of characters)
Many LaTeX commands (each requires a lookup)

For repeated evaluation of the same expression, use parse() once and evaluateParsed() multiple times — this skips tokenization on subsequent calls.

Summary

The tokenizer:

Reads input character-by-character
Produces a stream of typed tokens
Handles implicit multiplication
Recognizes LaTeX commands, Unicode symbols
Reports errors with position and suggestions

The tokens are then consumed by the Parser to build the AST.

Tokenizer ​

What Is a Token? ​

How Tokenization Works ​

1. Skip Whitespace ​

2. Recognize Numbers ​

3. Recognize Operators ​

4. Recognize LaTeX Commands ​

5. Recognize Identifiers ​

Implicit Multiplication ​

How It Works ​

The Rules ​

When to Disable It ​

Token Types ​

Error Handling ​

Unicode Support ​

Extending the Tokenizer ​

Performance Considerations ​

Summary ​

Tokenizer

What Is a Token?

How Tokenization Works

1. Skip Whitespace

2. Recognize Numbers

3. Recognize Operators

4. Recognize LaTeX Commands

5. Recognize Identifiers

Implicit Multiplication

How It Works

The Rules

When to Disable It

Token Types

Error Handling

Unicode Support

Extending the Tokenizer

Performance Considerations

Summary