KSU C-Day · Spring 2026 · Masters Research

Architectural stabilization of LLM output under behavioral prompt variation.

An empirical study — and a working pipeline — demonstrating that LLM failures in enterprise settings are not a model problem. They are an architecture problem. A generate → critique → validate → repair loop achieves 100% schema compliance across every prompt style tested, where baseline systems collapse.

Researcher Crystal Tubbs
Program MSAI · Kennesaw State
Course Big Data Analytics
Advisor Dr. Martin Brown
LLM Reliability Pipeline dashboard — Run Experiment view
Live · LLM Reliability Pipeline Dashboard

A generate → critique → validate loop that doesn't break.

Enterprise teams blame models when agents misfire. This study tested that assumption by running three systems — a naive baseline, a retrieval-augmented baseline, and an architectural reliability pipeline — across structured, conversational, ambiguous, and casual prompt styles on a VA veteran-benefits extraction task.

The conjecture

LLM output failures in production are driven by user prompt behavior, not model capability. If that's true, the fix is structural — an architectural layer that stabilizes the model's output regardless of how the prompt is phrased.

The method

A controlled evaluation of three systems on identical tasks and identical prompts across four prompt styles. The reliability pipeline wraps the model in a four-stage loop: generate a candidate, critique it against the task schema, validate structure and fields, and repair any violations before returning. Every stage is observable and measurable.

The finding

Baseline A showed a 100-point compliance gap between structured and casual prompts despite a 100% JSON parse rate — proving the failure is silent and style-driven. The reliability pipeline achieved 100% schema compliance across every prompt style at roughly 2× LLM call cost. Architectural stability without fine-tuning, without retraining, without touching the model.

Reliability pipeline live dashboard
Run experiment · live streaming results

See the pipeline run, live.

The reliability dashboard runs the full experiment on demand. Pick a system, a task, the prompt styles, and the sample limit — then watch real Python execute on the backend and stream schema compliance, parse rate, latency, and total runs into the panel in real time.

Bring your own Anthropic API key. Reproduce the finding. Test it against your own enterprise prompts.

Compliance across prompt styles

Output stability, measured as schema compliance averaged across structured, conversational, ambiguous, and casual prompt styles.

System A
Naive baseline
0%
System B
RAG baseline
0%
System C
Reliability pipeline
100%

CRAFT — the research, turned into a product.

If architectural stability is the answer, prompt repair is the operating surface. CRAFT — Contextual Rewriting and Fidelity Tester — takes the reliability pipeline findings and delivers them as a working prompt coach: classify the task, audit the prompt against a task-appropriate rubric, rewrite it, and score the output against the same rubric. It's the reliability research, shipped.

From research finding to working tool.

CRAFT operationalizes the reliability pipeline at the prompt layer. Paste any enterprise prompt — CRAFT diagnoses it, scores it, rewrites it, and shows you the output difference side by side.

Built on a FastAPI/SSE backend with a streaming frontend. Five enterprise task types, task-weighted rubrics, and a ceiling score that surfaces exactly which gaps only a human can fill.

CRAFT prompt analyzer dashboard
Live · CRAFT Prompt Analyzer
i.

Classify

LLM-powered task type classifier across five enterprise categories.

ii.

Audit

Rubric-based prompt quality scoring with task-weighted criteria.

iii.

Rewrite

Automated prompt repair with explicit user-gap placeholders.

iv.

Score

Output quality evaluated against the same rubric. Ceiling surfaced.

CHRYSALIS — governance at the belief layer.

The reliability pipeline proved a structural idea: you can stabilize AI output by intervening in its architecture, not its training. CHRYSALIS extends that same thesis one layer deeper — into the belief state of autonomous agents, with on-chain attestation, metacognitive intervention, and regulatory-ready provenance.

CHRYSALIS — chrysalis and butterfly mark

Every governance tool today watches what agents do. CHRYSALIS intervenes at the layer before action — validating what the agent believes, catching contradictions in real time, and writing an immutable record to Solana. Governance that operates as a performance multiplier, not a cost center.

MEMOIR
Immutable on-chain belief pipeline.
ORACLE
Learning loop + belief quality scoring.
MIRROR
Metacognitive intervention layer.

Crystal Tubbs — researcher, builder, founder.

Crystal Tubbs

I'm an MSAI candidate at Kennesaw State University and the founder of Metamorphic Curations, an AI transformation consultancy. My research sits at the intersection of LLM behavior, AI fairness, and governance — the parts of the field where the metric doesn't match reality, and where the fix has to be structural.

My coursework projects span research on covert bias transfer, surrogate accountability for agentic systems, and context-dependent bias in LLM resume screening alongside a product line that puts that research into builders' hands: the reliability pipeline featured here, CRAFT as its operational surface, and CHRYSALIS as its extension to agentic belief governance.

I build agentic systems and RAG pipelines and advise clients on AI transformation. The thread through all of it is the same: use technology to consciously build a more equitable world.

Governance isn't a cost center. It's a performance multiplier when it operates at the right layer.

Three projects. One thesis through-line.

Each of these projects attacks the same underlying problem; that AI systems fail in ways standard evaluation misses, but from different angles. Bias that evades audit. Accountability that evades policy. Screening behavior that evades scrutiny.

Fairness · Bias

CIPHER

Covert Influence Passed via Hidden Encoding in Representations

Key finding — the metric illusion

Standard fairness metrics (SPD/EOD) score 0.0 despite significant accuracy disparity between groups. Reframed as label-mediated bias propagation invisible to conventional fairness audits.

Governance · Policy

SAF

Surrogate Accountability Framework for Agentic AI

Key finding — the accountability gap

Four-pillar governance model: entitlement governance, continuous observability, lifecycle accountability, and proportional authority; closing the gap between AI policy and the operational reality of autonomous agents.

LLM Behavior · Hiring

PRISM

Proxy Recognition and Inclusion Scoring Method

Key finding — context rewrites the model

Controlled study across 324 synthetic resumes. Demographic proxy signals in first names shift LLM screening behavior even when candidate qualifications are identical. Co-authored with Destiny Raburnel.