Artifact-Shaped Artifacts

Artifact-Shaped Artifacts

Your LinkedIn feed makes your eyes bleed with the millionth instance of "this isn't accidental. it's by design" applied to the same ten trite ideas. Your employee produced a fully functional app that looks finished but is just slightly off everywhere, and you need to reverse engineer their thinking -- or lack thereof -- to understand what went wrong. You read the cover letter from a candidate that somehow reads exactly like 20 others.

These are all artifact-shaped artifacts. They look like what they intend to be -- enough to fool the creator into thinking they were done, and maybe good enough to fool you into thinking the effort behind them was larger than it was. AI slop is not AI being bad at making things. It's AI being too good at making things that suggest much higher effort than they take.

A three-sentence stream of consciousness and a deeply researched article are trivially distinguishable when a human writes both. Their AI-assisted equivalents are not. Both come out article-shaped. A weekend side project and a production application are easy to differentiate when built by hand. With AI, both come out app-shaped. In my observation, this is one of the biggest sources of lost productivity with AI tools -- it feels like you're faster, but the time you saved generating is quietly eaten by time spent figuring out if the output is actually any good. And that effort can be offloaded to the next person down the chain.

The bottleneck in intellectual work used to be turning thinking into finished things. Now it's determining whether those finished-looking things are fit for purpose.

The Evaluation Gap

For most of knowledge work history, the bottleneck was production -- more ideas than you could build, more thinking than you could turn into finished things. AI relieved production.

An engineering manager on r/ExperiencedDevs captured the shift: "PRs are clean, tests pass, code compiles. On paper they look like they leveled up overnight. But when I ask them questions during review, I can tell they don't fully understand what they wrote. I feel like I'm evaluating theater now. The artifacts look senior but the understanding is still junior."

Evaluating theater. That's the problem. Evaluation itself is a ladder, and the rungs that matter most are the least accessible:

  • Surface quality -- fluency, grammar, formatting. Anyone can judge this. AI is excellent at it.
  • Coherence -- logical flow, structure, staying on topic. A trained reader can manage.
  • Factual accuracy -- are the specific claims true? Requires domain knowledge.
  • Reasoning validity -- are the logical steps sound? Requires deep expertise.
  • Completeness -- what should be there but isn't? Requires knowing a domain well enough to notice omissions.
  • Calibration -- does the output hedge where it should and assert where it can? Requires meta-expertise rare even among domain experts.

AI output shines at levels one and two, and sometimes three. Article-shaped. Code-shaped. Research-shaped. The quality that actually matters -- reasoning, completeness, calibration -- lives deeper, where casual evaluation can't reach.

"Style Over Substance" found that evaluators preferred polished-but-factually-wrong answers over correct-but-rough ones. Shortness was penalized most, grammatical errors next, factual errors barely registered. They couldn't assess what they didn't know. GPQA quantified the gap: PhD experts scored 65% on hard questions in their field. Skilled non-experts with full internet access and thirty-plus minutes scored 34%. Near random chance.

Bainbridge's Ironies of Automation adds the kicker: the more reliable the automation, the less prepared the human supervisor is to handle its failures. Errors become rarer and therefore harder to catch. The automation earns trust precisely as it becomes harder to verify.

The gap between what's produced and what can be competently judged widens as AI improves. And the dominant interface makes it worse.

No Man's Land

Wattenberger describes an autonomy spectrum for AI tools. At one end, full machine. At the other, full human. In between sits a danger zone she calls "no man's land": the majority of the work is offloaded to the machine, but the human is still nominally responsible and has lost meaningful control over the process. The tool did the work. The human must judge the final piece. The tool didn't expose enough handholds to influence and co-create the thinking process.

Appleton's critique of the generic chat box puts it sharply: "There are no knobs or door handles on this thing." The blank input offloads all the cognitive labor to the user: figure out what to ask, how to ask it, how to evaluate the response, and what to do next. Her prescription: "tiny, sharp, specific tools" composed into observable workflows.

The deeper issue is that any creation process has stages and information hierarchy. Writing involves research, ideation, drafting, revision, and editing. Software development moves through requirements, design, implementation, review, and testing. The chatbot collapses all of these into one conversation that emits "complete" deliverables at every turn. No man's land is where you end up when the stages are invisible. The best workflows right now are self-made -- Garry Tan's gstack is a good example -- but they incorporate staged, hierarchical creation cleanly and ergonomically.

Scaffolding that delivers good process integration is entirely doable. Mollick's research with BCG consultants identified two effective patterns -- Centaurs who strategically divide labor, and Cyborgs who interweave their work with the AI's at every step. Both produce measurably better results. Both require the user to invent the process themselves.

And when that process isn't there, the cost is measurable. Consultants who used AI on tasks outside its reliable frontier got the answer right only 60-70% of the time, compared to 84% without AI. Dell'Acqua: "When the AI is very good, humans have no reason to work hard and pay attention. They let the AI take over, instead of using it as a tool."

The chatbot as premature convergence machine. The design opportunity is wide open.

Designing for the Process

If the problem is that chatbots collapse stages, the solution is tools that surface them. Two principles I find exciting.

Show the sources, not just the summary

The most common AI research experience: you ask a question and receive a paragraph. Whether it's accurate, complete, or drawing from credible sources is opaque.

More importantly, there's a reasoning process behind those sources that the paragraph erases. In a manual research process you iterate through findings and their implications, challenging them back. You drop sources, increase or decrease their weight in your conclusion. There's no way in hell you can do that with ChatGPT deep research. But ask it about something you don't know and it will blow your brain to bits with its explosion of prose.

Elicit does something different. It returns structured tables -- papers organized by columns you define, each cell linked to the specific sentence in the source PDF. The workflow is decomposed: search, screen papers with rationale for each decision, extract data, synthesize. Each step has its own interface. Each can be inspected and edited.

One comparison between Elicit and ChatGPT for systematic reviews put it well: "Elicit's report is much more transparent and clear in its steps. The ChatGPT output is sophisticated but does not show its working." There's enough opacity in deep neural networks without making the harnesses equally opaque.

When reasoning is visible, you can evaluate it. When it's hidden behind a polished paragraph, you're back to judging surfaces and reverse-engineering conclusions to get to what matters. Answers should be entry points into a source graph, not endpoints.

Decompose into stages with human checkpoints

"Write me a content piece" could mean many things. A five-minute slop post or a multi-step process -- research, structure, drafting, revision -- that produces something you're genuinely proud of. Different requests, different workflows, different levels of human involvement. The chatbot treats them identically: one input box, one output.

Linear shows what process-aware AI looks like. Issues move through named workflow stages -- Triage, In Progress, Review, Done -- and AI agents operate within them, picking up tasks, writing code, and submitting changes for review without bypassing human gates. The process has structure the user can see and shape.

Elicit's systematic review workflow does the same for research: define inclusion criteria, review the AI's screening decisions, then data extraction, then the report. Each stage has its own interface and its own human checkpoint. The stages mirror how a researcher actually works -- the AI handles volume, the human handles judgment.

Vygotsky's Zone of Proximal Development applies directly. Support should match the user's capability and withdraw as competence grows. Making the automation level explicit and user-chosen prevents the skill atrophy Bainbridge warned about. You can't lose the ability to intervene if the tool never hid the intervention points.


Shape -- rungs one and two of the evaluation ladder -- has always been a proxy for quality when the entire production process was costly. AI made shape trivial. What remains hard is everything shape doesn't show: whether the reasoning holds, the evidence checks out, the right things were considered and the wrong things excluded.

Tools that ergonomically expose process and information hierarchy are crucial, and I want to look deeper into how to make them.

A key skill I'm finding: discerning the content behind the shape. Does a polished app from three sentences blow your mind, or do you look for where more input is needed?

-- A