MSE Graph Language Model — A Deterministic, Explainable Architecture

Zero learned weights Fully deterministic CPU-only 56/56 tests v2.1 — implemented

1. Introduction

Most language models are built around the same idea: train a neural network on enormous amounts of text, let it adjust billions of floating-point weights until it learns to predict the next word reasonably well, and then sample from a probability distribution at inference time. The model is powerful, but it is also a black box — you cannot point to the weight that caused a particular word to be chosen, and two runs with the same input can produce different output.

The MSE Graph Language Model (MSE-GLM) takes a different approach entirely. Language is represented as a directed graph: tokens are nodes, observed transitions are edges, and inference is graph traversal under a small set of explicit, inspectable rules. There are no learned weights, no gradients, no probability sampling — and because of that, every generation decision can be traced back to the exact rule and candidate set that produced it.

What this is not MSE-GLM is not a transformer competitor for open-domain generation or reasoning. It is an architecture for settings where guarantees matter more than fluency — grammar-constrained generation, embedded AI, audit-trail-required tooling, and any pipeline where reproducibility is non-negotiable.

2. Design Philosophy

The core bet is that language, in many constrained domains, does not need to be modeled probabilistically. If the valid output space is a finite set of token transitions — all valid SQL clauses, all valid JSON keys for a schema, all valid assembly mnemonics — then a graph that memorizes exactly those transitions can generate correctly constrained output with zero chance of emitting something it never observed, and zero need for a GPU.

Where the graph is genuinely ambiguous — two equally plausible next tokens given the same context — the architecture resolves that ambiguity using principled, inspectable rules rather than a probability sample. That is the core engineering problem this system solves, and the three-matrix design described below is how it does it.

3. Architecture Overview

Corpus ─▶ Tokenizer (BPE) ─▶ Sentence Splitting ─▶ Sequence Encoding │ ├──────────────────────────────────┬──────────────────────────────────┐ ▼ ▼ ▼ Edge Matrix (E) Bridge Matrix (B) Relationship Matrix (R) deduplicated bigrams deduplicated trigrams (triple_id, relationship_id) CSR-indexed by source + dual-axis cluster_id no triple content duplicated + T_index │ ▼ Inference Engine 4-stage pipeline + lineage tie-break + infer_shared_role() │ ▼ MSEGraphLanguageModel generate · explain_step · infer_shared_role · save/load

Training is a single O(N) pass over the corpus — no backpropagation, no epochs, no GPU. The trained model persists to a self-contained folder of JSON files (vocabulary, edges, bridges, relationships, metadata) that can be loaded and queried on any machine with Python.

4. Tokenizer

The tokenizer is a from-scratch Byte Pair Encoding (BPE) implementation — the same approach used by GPT-2, but written from the ground up with no external dependencies. It converts raw text into integer token IDs through iterative character-pair merging.

Four reserved special tokens anchor the system:

Token	ID	Role
<PAD>	0	Padding placeholder (reserved)
<UNK>	1	Unknown character fallback
<BOS>	2	Beginning of sequence — prepended to every prompt
<EOS>	3	End of sequence — appended during training only

Sentence boundaries (on . ! ? \n) are preserved during training so the graph learns where sequences legally end. Streaming training from a file is supported, so corpus size is not bounded by available RAM.

5. The Three Matrices

The trained model is three compact, array-backed, CSR-indexed structures. All storage uses Python's array.array('i') — 4 bytes per integer, roughly 7× smaller than equivalent Python lists.

Edge Matrix (E)

A deduplicated list of every adjacent token pair observed in the corpus, sorted by source token and indexed for O(1) successor lookup. This is the bigram graph — it answers "what tokens have I seen follow token X?"

Bridge Matrix (B)

Extends E to three-token context: every observed (source, bridge, target) triple is stored once, giving the model trigram-equivalent context with no attention mechanism. The key structural innovation is a fourth column, cluster_id, described in the next section.

Relationship Matrix (R)

The most novel structure. It stores only two columns:

Column	Type	Meaning
triple_id	int	Foreign key into B — which bridge triple this row annotates
relationship_id	int	Which training sentence produced this triple

R contains no copy of the triple's own content. It is a pure many-to-many edge list: a triple shared across multiple training sentences carries one R row per sentence. This is what enables lineage-aware tie-breaking at inference time (see §7) and O(1) batch audit of "every triple belonging to sentence N."

# Example R matrix — corpus: "the cat sat on the mat." + "the dog sat on the carpet."
triple_id  |  relationship_id
-----------+-----------------
    1      |       1           ← triple (the→sat, bridge=cat) appeared in sentence 1
    2      |       1
    3      |       1           ← triple (sat→the, bridge=on) — shared
    4      |       1
    5      |       2
    6      |       2
    3      |       2           ← same triple_id 3, now also in sentence 2
    7      |       2

6. Distributional Clustering

The cluster_id column in B is a weights-free, symbolic analogue of the distributional hypothesis underlying word2vec and other embedding models: tokens that are interchangeable in the same structural slot are functionally related, without any claim of shared meaning.

Clustering is assigned by a dual-axis rule:

Bridge axis — triples sharing the same (source, target) pair get a shared cluster_id. This groups tokens interchangeable as the middle token (the "bridge").
Target axis — triples sharing the same (source, bridge) pair, not already grouped on the bridge axis, get a shared cluster_id. This groups tokens interchangeable as the destination.
cluster_id = 0 — a triple that matches neither axis with any other triple is left unclustered.

Worked example the cat sat on the mat. + the dog sat on the carpet.

• Triples 1 and 5 share (source=the, target=sat) → bridge axis → cluster 1
Members: cat and dog are interchangeable bridges
• Triples 4 and 7 share (source=on, bridge=the) → target axis → cluster 2
Members: mat and carpet are interchangeable targets
• Triples 2, 3, 6 → cluster_id = 0 (no shared slot with another triple)

A derived T_index maps each token to the non-zero cluster IDs it participates in. Token relatedness is then defined as |T_index[a] ∩ T_index[b]| — a set-intersection analogue of cosine similarity that needs no embedding space.

infer_shared_role()

This enables a new type of query: give the model an unordered set of tokens and ask what they have in common structurally:

model.infer_shared_role(["cat", "dog"])
# → [('sat', 'bridge_axis', {'cluster_id': 1, 'overlap': 2}), ...]
# cat and dog both filled the bridge slot in (the → ___ → sat)

7. Inference Pipeline

At every generation step the engine knows the current token, the previous token (if available), and a running set active_rels — the relationship IDs the path has been consistent with so far. It resolves the next token through four ordered stages:

Stage 1

Exact Bridge Match

Find all triples where source=previous and bridge=current. If multiple matches, prefer those whose R lineage intersects active_rels (lineage tie-break). Fall back to storage order only if no lineage match.

Stage 2

Bridge Voting

Every bridge triple from previous casts a vote for its target. A triple whose bridge equals current is weighted 2×. Highest vote wins.

Stage 3

Bigram Voting

Fall back to the Edge Matrix. Every observed successor of current receives an equal vote. Highest wins.

Stage 4

Termination

No candidate found at any stage. Emit <EOS> and stop. Generation also stops if <EOS> itself is selected at any earlier stage.

The lineage-narrowing rule (critical detail)

When a Stage 1 step is resolved, active_rels is updated by intersecting it with the chosen triple's lineage — not replacing it. This is subtle but essential: a triple shared across several training sentences (e.g. a common clause like sat on the) must not erase the more specific lineage a prior step already established.

# BUG (old): replace active_rels entirely
active_rels = new_triple_rels  # ← wipes specificity at shared triples

# FIX: narrow by intersection
narrowed = active_rels & new_triple_rels
active_rels = narrowed if narrowed else new_triple_rels  # reset only on divergence

Without this fix, the dog would generate the dog sat on the mat instead of the correct the dog sat on the carpet, because the shared sat → on → the triple would widen active_rels back out to all lineages just before the branch point. This regression is covered by an automated test in the suite.

8. Explainability

Every generation step is fully traceable. The explain_step() method and /explain chat command return the stage, rule, candidate set, and active_rels for any given step:

you> /explain the | dog
model> next='sat'  stage=1  rule=storage_order_fallback  active_rels={1}

you> /explain sat | on
model> next='the'  stage=1  rule=single_match  active_rels={1}

you> /explain on | the
model> next='carpet'  stage=1  rule=lineage_match  active_rels={1}

The analyse.py CLI provides full step-by-step traces from the command line:

python3 analyse.py --model runs/demo trace "the dog" --max-tokens 12

The output names the chosen token, stage, rule, and active lineages at every step — making the model fully auditable without any post-hoc interpretation.

9. MSE-GLM vs. Transformers

Property	MSE-GLM	Transformer
Weights	None	Billions of floats
Training cost	O(N), one pass, CPU	GPU weeks / months
Inference	Deterministic, O(out-degree)	Stochastic, O(seq²)
Explainability	Full — every step traceable	Post-hoc approximation only
Hallucination	Impossible for unseen transitions	Can and does
Context length	(previous, current) + lineage	Thousands of tokens
Semantic understanding	None	Strong
Generalisation	None beyond training data	Strong
RAM / GPU requirement	CPU only, array-backed	GPU required at scale

The two architectures optimize for different things. Transformers win on fluency, generalization, and long-range reasoning. MSE-GLM wins on guarantees, auditability, and resource efficiency. They are not competitors — they are tools for different problems.

10. Use Cases

Grammar-constrained generation — SQL, JSON, config files, shell commands: any domain where the valid output space is closed and can be observed from a corpus.
LLM guardrails — attach MSE-GLM as a structural validator on top of a transformer's output to catch illegal transitions before they reach users.
Embedded / edge AI — no GPU, no framework, pure Python + stdlib. Deploy on a Raspberry Pi or a microcontroller-class device.
Compliance-sensitive tooling — legal, finance, healthcare contexts where every output decision must be auditable by a human reviewer.
Autocomplete engines — deterministic, fast, constrained to observed patterns.
Distributional similarity without embeddings — infer_shared_role() identifies which tokens are interchangeable in a given structural slot, with no embedding model required.

11. Track Record

PHASE 1 — SDD v2.0

Core architecture designed and implemented

Tokenizer (BPE, streaming), Edge Matrix, Bridge Matrix (trigram context), four-stage deterministic inference engine, model orchestrator, corpus analyser. Full SDD written and verified against the implementation.

PHASE 2 — Relationship Matrix

Lineage-aware tie-breaking

The R matrix (triple_id, relationship_id only — no content duplication) added to resolve Stage 1 ties by sequence lineage rather than arbitrary storage order. Batch audit of any training sentence added at O(1). SDD v2.1 addendum written.

PHASE 3 — Dual-axis clustering

Distributional clustering without embeddings

cluster_id column added to B via bridge-axis + target-axis rules. T_index built alongside for O(1) cluster membership lookup. infer_shared_role() inference mode added — verified live that cat+dog → sat, mat+carpet → (EOS/the).

PHASE 4 — Lineage-narrowing bug fix

Critical regression caught and fixed

"the dog" was generating "the dog sat on the mat" instead of carpet. Root cause: active_rels was replaced rather than narrowed at shared triples, losing dog-specific lineage at the branch point. Fixed via intersection. Dedicated regression test added.

regression: confirmed fixed

PHASE 5 — Training pipeline + REPL

train.py, chat.py, production save/load

Self-contained model folder persistence (vocabulary.json, edges.json, bridges.json, relationships.json, meta.json). CLI training pipeline supporting inline text or streamed files. Interactive chat REPL with /explain, /shared, /clusters, /stats commands.

PHASE 6 — Professional analysis layer

analyse.py — 12 CLI subcommands

corpus, stats, topology, clusters, cluster <id>, relationships, relationship <id>, token, similarity, shared, trace, report — all with optional JSON export. No external dependencies.

CURRENT — v2.1 Complete

56 / 56 automated tests passing

Full regression suite on a 12-sentence multi-lineage corpus (cats/dogs/boys/girls, birds/planes, fish/ducks) verifying every layer: tokenizer, graph construction, R schema, lineage narrowing, determinism, explain_step(), infer_shared_role(), analyser reports, and save/load round-trip.

✅ 56 / 56 passing

System Design Document v2.1

Full architecture specification — 20 sections covering every component,
worked examples, complexity analysis, and implementation notes.

View on GitHub

Download SDD v2.1 (PDF)

12. Download & Run

The full implementation — tokenizer, graphs, inference engine, training CLI, chat REPL, analysis CLI, and test suite — is on GitHub. No pip installs required beyond the Python standard library.

git clone https://github.com/fodokidza/mse_glm.git
cd mse-glm

# train a model
python3 train.py --text "your corpus here." --out runs/model --vocab-size 500

# chat with it
python3 chat.py --model runs/model

# analyse it
python3 analyse.py --model runs/model report

# run the tests
python3 test.py

MSE Graph Language Model:Deterministic, Explainable,Zero-Weight Inference

1. Introduction

2. Design Philosophy

3. Architecture Overview

4. Tokenizer

5. The Three Matrices

Edge Matrix (E)

Bridge Matrix (B)

Relationship Matrix (R)

6. Distributional Clustering

infer_shared_role()

7. Inference Pipeline

Exact Bridge Match

Bridge Voting

Bigram Voting

Termination

The lineage-narrowing rule (critical detail)

8. Explainability

9. MSE-GLM vs. Transformers

10. Use Cases

11. Track Record

Core architecture designed and implemented

Lineage-aware tie-breaking

Distributional clustering without embeddings

Critical regression caught and fixed

train.py, chat.py, production save/load

analyse.py — 12 CLI subcommands

56 / 56 automated tests passing

System Design Document v2.1

12. Download & Run

MSE Graph Language Model:
Deterministic, Explainable,
Zero-Weight Inference