PillarManaging AI Projects14 min read

AI projects fail in new ways — the old PM playbook has known gaps

Running ML, LLM, and AI-product projects: model risk, evaluation, deployment, ethics, infrastructure cost. The new failure modes the traditional playbook does not cover.

Vizually Team·Apr 28, 2026

Managing AI Projects

AI projects fail in new ways. The old playbook has known gaps; here is what fills them.

Managing an AI project is mostly project management. The charter, the schedule, the stakeholder work, the risk register — all the same disciplines as a traditional software project. But there is a residue of new failure modes specific to AI work that the traditional playbook does not cover, and the projects that fail tend to fail along those new dimensions.

Most of the AI-project content on the open web is either traditional PM with AI buzzwords (which adds nothing) or AI-as-magic (which subtracts trust). The honest position is in the middle: traditional PM still runs the project, plus a known set of new disciplines that are specific to ML and LLM work and need to be added to the playbook explicitly.

This piece is the long-form anchor for the Managing AI Projects pillar. It walks the four project shapes that AI work comes in (each with different risk profiles), the failure modes specific to AI projects, the evaluation discipline that is unique to AI work, and the cost-and-ethics conversations that need to happen at initiation rather than at launch.

§1 — The four shapes AI projects come in

Not all AI projects are the same shape, and the shape determines the risk profile. Four shapes account for almost all the work in the wild.

Predictive ML. Build a model that predicts something — churn, fraud, demand, conversion. Trained on historical data; deployed as a service or as a feature inside a product. Risk profile: well-understood. The failure modes are largely known (data drift, label noise, distribution shift) and the discipline is mature. Traditional PM works well with one addition: the evaluation discipline (see §3).

Foundation-model integration. Build a product feature that calls a large language model (or a vision model, or a multi-modal model) — usually via API. The model is not yours; the integration is. Risk profile: dominated by prompt brittleness, cost variance, and vendor risk. The failure modes are recent and the discipline is evolving fast.

Custom-trained foundation model. Train or fine-tune a foundation model for your use case. Rare; expensive; requires both ML and infra teams. Risk profile: dominated by data quality, training cost overruns, and capability mismatches. Most orgs that attempt this should not — fine-tuning a smaller model or prompting a larger one is cheaper and usually adequate.

ML-platform / AI-infrastructure. Build the platform that other teams use to ship AI products. Feature stores, model registries, evaluation harnesses, deployment pipelines. Risk profile: dominated by adoption risk — the platform team can ship something correct that nobody uses. Treat as a developer-platform project (which is what it is) rather than as an AI project per se.

Shape	Risk profile	What is new vs traditional software	Most common failure
Predictive ML	Mature, well-understood	Evaluation discipline, data drift	Skipped evaluation plan at initiation
Foundation-model integration	Recent, evolving fast	Prompt brittleness, cost variance, vendor risk	Unbounded LLM cost surprise at launch
Custom-trained foundation model	Expensive, ML+infra coupled	Data quality, training cost, capability mismatch	Underestimated training infra cost
ML platform / infra	Developer-platform, adoption-driven	Adoption mechanics specific to AI teams	Built without naming the user team

§2 — Failure modes specific to AI projects

Four failure modes show up across AI projects with enough regularity that we now ask about them in every initiation review.

The undefined evaluation. What does it mean for the model (or the AI feature) to be working? Most AI projects start without a written, agreed-on answer. The team builds something, the team thinks it works, the stakeholders disagree, the project ships nothing or ships theater. The fix is structural: define evaluation at initiation, not at the end. The model passes if it scores >X on benchmark Y, evaluated on dataset Z, signed off by person W — that level of specificity. Without it, working is a verbal moving target.

The unbounded cost surface. Traditional software has predictable per-request costs; AI features often have wildly variable per-request costs (a user typing a long prompt costs 10× what a short one costs). The launch happens, real traffic arrives, and the cost line item is 5× what was budgeted. Fix: model the cost variance at design time. If a single user can trigger $0.50 of inference, the product needs to either price that cost in, cap usage, or run on a cheaper model.

The ethics conversation deferred to launch. The product team designs the feature; the data team ships the model; the ethics review happens at launch readiness; the ethics review surfaces issues that should have been caught at design. The launch slips by 6-12 weeks. Fix: ethics is part of the charter, not part of the launch checklist. The conversation about who could this hurt and how happens in week one.

The vendor-lock-in surprise. The product launches against Vendor A's API. Vendor A raises prices 4× three months later. The product team discovers the application's behavior is not portable — prompts that worked on Vendor A do not work the same way on Vendor B. Fix: at design time, build the abstraction layer that lets the underlying model be swapped. Test against at least two providers before committing the API contract.

§3 — The evaluation discipline, expanded

Evaluation is the discipline that distinguishes AI projects from traditional software projects most clearly. A traditional software feature either passes its tests or does not. An AI feature has a quality distribution, and the question is what fraction of that distribution is acceptable, on what tasks, with what failure modes.

A serious AI project has four evaluation artifacts:

The benchmark suite. A fixed set of test cases the model is evaluated against on every change. Hand-curated, diverse, includes the failure modes you have observed in the wild. Evaluated by a metric (accuracy, F1, BLEU, human rating, etc.) that is named and consistent.
The eval set vs the production distribution. Document the gap between what the eval set tests and what production traffic will look like. The gap is information; pretending it does not exist is the mistake. Production may be 50× more diverse than the eval set; the eval may pass while the model fails in production.
The acceptance threshold. The score below which the model does not ship. Written down in advance, agreed by the named decision-maker. Above 80% on benchmark X with no examples in the failure-mode subset — that level of specificity.
The post-launch eval cadence. A set of in-production evaluations that run continuously (canary tasks, sampled real traffic, drift detection on input distributions). Without this, model degradation is invisible until it produces a customer-visible failure.

Most AI projects have one of these four — usually the benchmark suite — and operate as if that is the entire eval discipline. It is not. The other three are where the discipline lives.

§4 — Cost and ethics at initiation

Two conversations need to happen at AI project initiation that do not happen on most traditional projects: the cost-bound conversation and the ethics conversation. Deferring either is the most common reason AI projects miss their original schedule.

The cost-bound conversation. What is the worst-case cost per request, and what is the worst-case aggregate cost at expected traffic? If the answer is we do not know, the project is not ready to start. The conversation forces the design to address cost as a first-class constraint — either by capping per-request cost, capping per-user usage, picking a cheaper model, or pricing the feature to recover the cost. Without the conversation, the cost shows up at launch and someone has to make a decision under time pressure.

The ethics conversation. Who could this hurt, and how? Specific harms: biased predictions affecting protected classes, automation displacing labor without warning, privacy leakage from training data, generated content that misleads users about its provenance. The conversation produces a list of harms; for each, either a mitigation (we will do X to reduce this harm) or an explicit acceptance (we have considered this harm and decided to ship anyway, for these reasons). The output is documented, signed by the project sponsor, and reviewed at launch readiness alongside the technical readiness.

Neither conversation is hard. Both get skipped because they are uncomfortable. The teams that run them at initiation skip the launch-readiness fire drill that would otherwise consume a quarter.

Most AI projects have one of the four evaluation artifacts — usually the benchmark suite — and operate as if that is the entire eval discipline. It is not. The other three are where the discipline lives.

— Vizually editorial team

§5 — Where AI projects break against traditional PM cadences

The traditional PM cadences (sprint, biweekly status, monthly steering) work for AI projects with one adjustment: AI projects have evaluation cycles that may not align to sprint cadences. A model retrain that takes four days does not fit cleanly in a two-week sprint. A benchmark run that takes a day and produces a metric distribution does not produce done in the sprint in a satisfying way.

Two cadence adjustments help:

Decouple evaluation cycles from sprint cycles. The evaluation pipeline runs on its own schedule (daily, weekly, or per-experiment) and reports results into the project's sprint review. The sprint cadence still works; the evaluation cadence runs in parallel.
Treat experiments as work items. A run experiment X with hypothesis Y is a work item with an estimate (often one to three days) and a deliverable (the metric report). The team commits to the experiment, not to the result — because the result is not in the team's control. The discipline is the experimental discipline (one variable changed at a time, written hypothesis, written result), not the prediction of the result.

The orgs that are good at AI project management treat experiments as first-class deliverables. The orgs that struggle treat experiments as side work that does not show up on the sprint board, which produces the we did a lot but cannot point to what shipped failure mode.

Where AI project work intersects the rest of project management

How AI is reshaping project managementThe other side of the coin — what AI changes about how projects themselves get managed.The project charter, in plain languageWhere the evaluation plan, cost bound, and ethics conversation get documented at initiation.Scope creep — preventing it before it startsScope creep is amplified on AI projects because what "done" means evolves with the eval distribution.Risk registers as a tool, not a taxAI projects need every risk a software project does, plus the four AI-specific failure modes.

§6 — How to use this pillar

The rest of the Managing AI Projects pillar walks the evaluation discipline in detail, the cost-modeling conversation, the ethics-at-initiation framework, and the cadence adjustments that AI projects need on top of the standard sprint discipline. If you are starting an AI project, read the evaluation piece first. If you are inheriting one in launch readiness, read the ethics piece first.

The meta-rule: traditional PM still runs AI projects. The new disciplines are additions to the playbook, not replacements for it. The teams that internalize the additions ship; the teams that treat AI as a special case where normal discipline does not apply tend to underperform on both axes.