Architectural Stabilization of LLM Output — Crystal Tubbs

Architectural stabilization of LLM output under behavioral prompt variation.

An empirical study — and a working pipeline — demonstrating that LLM failures in enterprise settings are not a model problem. They are an architecture problem. A generate → critique → validate → repair loop achieves 100% schema compliance across every prompt style tested, where baseline systems collapse.

Researcher Crystal Tubbs

Program MSAI · Kennesaw State

Course Big Data Analytics

Advisor Dr. Martin Brown

A generate → critique → validate loop that doesn't break.

Enterprise teams blame models when agents misfire. This study tested that assumption by running three systems — a naive baseline, a retrieval-augmented baseline, and an architectural reliability pipeline — across structured, conversational, ambiguous, and casual prompt styles on a VA veteran-benefits extraction task.

CRAFT — the research, turned into a product.

If architectural stability is the answer, prompt repair is the operating surface. CRAFT — Contextual Rewriting and Fidelity Tester — takes the reliability pipeline findings and delivers them as a working prompt coach: classify the task, audit the prompt against a task-appropriate rubric, rewrite it, and score the output against the same rubric. It's the reliability research, shipped.

From research finding to working tool.

CRAFT operationalizes the reliability pipeline at the prompt layer. Paste any enterprise prompt — CRAFT diagnoses it, scores it, rewrites it, and shows you the output difference side by side.

Built on a FastAPI/SSE backend with a streaming frontend. Five enterprise task types, task-weighted rubrics, and a ceiling score that surfaces exactly which gaps only a human can fill.

Launch CRAFT → Source code →

Live · CRAFT Prompt Analyzer

Classify

LLM-powered task type classifier across five enterprise categories.

ii.

Audit

Rubric-based prompt quality scoring with task-weighted criteria.

iii.

Rewrite

Automated prompt repair with explicit user-gap placeholders.

iv.

Score

Output quality evaluated against the same rubric. Ceiling surfaced.

CHRYSALIS — governance at the belief layer.

The reliability pipeline proved a structural idea: you can stabilize AI output by intervening in its architecture, not its training. CHRYSALIS extends that same thesis one layer deeper — into the belief state of autonomous agents, with on-chain attestation, metacognitive intervention, and regulatory-ready provenance.

Every governance tool today watches what agents do. CHRYSALIS intervenes at the layer before action — validating what the agent believes, catching contradictions in real time, and writing an immutable record to Solana. Governance that operates as a performance multiplier, not a cost center.

MEMOIR

Immutable on-chain belief pipeline.

ORACLE

Learning loop + belief quality scoring.

MIRROR

Metacognitive intervention layer.

Visit chrysalisai.io →

Crystal Tubbs — researcher, builder, founder.

I'm an MSAI candidate at Kennesaw State University and the founder of Metamorphic Curations, an AI transformation consultancy. My research sits at the intersection of LLM behavior, AI fairness, and governance — the parts of the field where the metric doesn't match reality, and where the fix has to be structural.

My coursework projects span research on covert bias transfer, surrogate accountability for agentic systems, and context-dependent bias in LLM resume screening alongside a product line that puts that research into builders' hands: the reliability pipeline featured here, CRAFT as its operational surface, and CHRYSALIS as its extension to agentic belief governance.

I build agentic systems and RAG pipelines and advise clients on AI transformation. The thread through all of it is the same: use technology to consciously build a more equitable world.

Governance isn't a cost center. It's a performance multiplier when it operates at the right layer.

Three projects. One thesis through-line.

Each of these projects attacks the same underlying problem; that AI systems fail in ways standard evaluation misses, but from different angles. Bias that evades audit. Accountability that evades policy. Screening behavior that evades scrutiny.

Fairness · Bias

CIPHER

Covert Influence Passed via Hidden Encoding in Representations

Key finding — the metric illusion

Standard fairness metrics (SPD/EOD) score 0.0 despite significant accuracy disparity between groups. Reframed as label-mediated bias propagation invisible to conventional fairness audits.

Project site Repository

Governance · Policy

SAF

Surrogate Accountability Framework for Agentic AI

Key finding — the accountability gap

Four-pillar governance model: entitlement governance, continuous observability, lifecycle accountability, and proportional authority; closing the gap between AI policy and the operational reality of autonomous agents.

Repository

LLM Behavior · Hiring

PRISM

Proxy Recognition and Inclusion Scoring Method

Key finding — context rewrites the model

Controlled study across 324 synthetic resumes. Demographic proxy signals in first names shift LLM screening behavior even when candidate qualifications are identical. Co-authored with Destiny Raburnel.

Project site Repository

Architectural stabilization of LLM output under behavioral prompt variation.

A generate → critique → validate loop that doesn't break.

The conjecture

The method

The finding

See the pipeline run, live.

Compliance across prompt styles

CRAFT — the research, turned into a product.