I handed the whole analysis to a model. Not the variants, not the brief. The read. It came back confident, clean, and wrong: it reported a winning variant the data didn’t support. I almost shipped it. This is the exact place AI quietly lies to you, and the check I run now so it can’t.
First, the honest part. AI is genuine leverage in experimentation. It writes variants faster than I can, it summarizes a messy results table in seconds, and it spots patterns I’d skim past. I run more experiments now than I could with a full team. That’s real, and it’s the reason this is worth getting right.
Where it lies
The failure isn’t in the maths. It’s in the narration. Ask a model “which variant won?” and it will answer, because answering is what it does. It will find a story in noise. It told me a 1.8% lift was a winner on a sample that hadn’t reached significance. The number was real. The conclusion was invented.
So I stopped asking it to decide. I ask it to show its work: the sample size, the confidence interval, the dates, the segments. Then I decide. The model is a fast analyst with no skin in the game, which is exactly why the judgment stays with me.
A flashy demo that doesn’t ship anything is a toy. Judge by hours bought back and work shipped.
The check, start to finish
Here’s the workflow I actually run. It took three rebuilds to get under the manual time, so skip my mistakes: pin the raw numbers first, force the model to state significance before any verdict, and never let it summarize and conclude in the same step. Build it once, and the analysis stops being a place you can get fooled.
That’s the whole method. Want it installed in your team? Or read the next one on running the content engine solo.