design-to-code — Testing Methodology

Version 1.0 | March 2026


Overview

What is design-to-code.fyi?

An open-source benchmark that measures how Claude Code translates Figma designs into front-end code via the Figma MCP server.

Why we exist

AI design-to-code tools are routinely evaluated through demos and "vibes", making it difficult to know what they actually do well and where they consistently fall short. design-to-code.fyi exists to cut through the noise and provide a reliable source of truth by measuring outputs systematically and publishing all data and findings openly.

Through this approach we aim to provide a factual reference point for comparing tools and prompting strategies, documented evidence of where AI tools fail so that designers can account for these gaps in their workflow, and a foundation for a more honest dialogue about what AI can and cannot reliably do in a design-to-code workflow.

What we test

The beta version tests individual M3 components rather than full products or systems. This bottom-up approach reveals AI failure modes in a more granular and systematic way — it's easier to isolate where things go wrong at the component level before scaling to more complex layouts.

We use the Material Design 3 Figma Kit as the reference design system. M3 is publicly available and follows all the file conditions Figma's MCP documentation identifies as optimal for code generation — Figma Variables for design tokens, Auto Layout throughout, and semantic layer naming. This makes M3 a best-case baseline: it tells us what Claude Code produces when given the best possible Figma input.

How we test

Each component is tested at three prompt levels (see Prompts) to evaluate how prompt specificity affects output quality. Claude Code runs in an empty environment with no CLAUDE.md files, no pre-configured skills, and only the Figma desktop MCP server connected, replicating the conditions a designer would encounter on a fresh project. Output is automatically measured across four dimensions — visual fidelity, layout responsiveness, code maintainability, and asset handling — using a reproducible script pipeline (see Evaluation). Every result is published as an immutable row in a public CSV, so findings can be verified, challenged, and built on.

Scope

The beta tests one model (Claude Code), one design system (M3), and one output framework (React + Tailwind). It does not cover full products, interactive states, accessibility, or real-world Figma files with imperfect hygiene. For specific measurement limitations see Known Limitations.


What Gets Tested

design-to-code v1.0 beta tests components from design systems that follow Figma's documented best practices for MCP-based code generation — specifically, systems that use Figma Variables for design tokens, Auto Layout, semantic layer naming, and live component instances. The beta version uses the Material Design 3 Figma Kit as the reference design system.

Components are organized into three types in order of complexity:

Type Description Example
Element A single self-contained component Button, checkbox, input field
Pattern A functional grouping of elements Form, card, navigation bar
Layout A full screen or page-level composition Homepage, dashboard, app screen

The v1.0 beta primarily focuses on Elements and Patterns. Layouts are included in scope but are tested as single fixed-canvas compositions, not as responsive components that adapt across multiple viewport sizes. Multi-viewport Layout testing requires a different measurement approach and is planned for a future version.

Each component is tested three times — once per prompt level — using the same test variant across all three runs. The test variant is the property combination most prominently featured in the kit, recorded explicitly in the component_description field of each result row. Variant testing, interactive states, dark/light mode, and accessibility are out of scope for v1.0.


Prompts

Every component is tested at three prompt levels. Each level represents a real prompting approach a designer might use and is grounded in official Figma documentation.

P1 — Modified Dev Mode default

The minimal prompt Figma surfaces in Dev Mode. Output is specified as React + Tailwind CSS to create a consistent baseline across all three levels — without this, Claude Code defaults to HTML/CSS output in an empty directory.

Implement this design from Figma as a React component with Tailwind CSS.

[FIGMA_LINK]

Source: Figma Dev Mode UI


P2 — Explicit tool sequencing

Adds explicit MCP tool calls based on Figma's guidance that the model does not always automatically select the right tool — in particular, get_variable_defs must be requested explicitly to return design token references rather than raw values.

Implement this Figma design as a React component with Tailwind CSS.

Before writing any code:
1. Run get_design_context on the node
2. Run get_variable_defs to extract design tokens
3. Run get_screenshot for visual reference
4. If the response is too large, run get_metadata first, then fetch child nodes individually

Then implement. Match the visual design exactly. Use semantic HTML.
Do not invent values not present in the design.

[FIGMA_LINK]

Source: figma/mcp-server-guide — "Trigger specific tools when needed"


P3 — Figma Implement Design skill

A prompt-level adaptation of Figma's official recommended workflow for design-to-code with the MCP — combining the required tool sequence from the implement-design skill with Figma's Claude Code asset handling rules. Project-specific instructions are omitted since tests run in an empty environment.

Implement this Figma design as a React component with Tailwind CSS.

Required workflow — do not skip steps:
1. Run get_design_context on the node
2. If the response is too large, run get_metadata first, then re-fetch specific nodes with get_design_context
3. Run get_screenshot — this is your visual source of truth throughout implementation
4. Download any assets returned by the MCP server before writing code

Asset rules:
- If the MCP server returns a localhost URL for an image or SVG, use it directly — do not substitute
- Do not import icon packages — all assets must come from the Figma MCP payload
- Do not create placeholders if a localhost asset is provided

Implementation:
- Match the design exactly — use design tokens from Figma where available
- Use semantic HTML elements
- Before finishing, validate your output against the get_screenshot

[FIGMA_LINK]

Sources: figma/mcp-server-guide — skills/figma-implement-design/SKILL.md (workflow steps) · figma/mcp-server-guide — Claude Code asset rules (asset handling)


Evaluation

Outputs are evaluated across four dimensions. No composite score is assigned — each dimension is published independently.

Dimension What it measures
Visual Fidelity Does the rendered output visually match the design?
Layout Responsiveness Does the code use a flexible layout system or absolute positioning?
Code Maintainability Is the code structured for clarity, reuse, and long-term evolution?
Asset Handling Does the component correctly load assets from the Figma MCP?

Visual Fidelity, Layout Responsiveness, and Code Maintainability follow the metric framework established in FIGMA2CODE: Automating Multimodal Design-to-Code in the Wild (Gui et al., ICLR 2026). v1.0 implements a subset of their metrics — FU (Flex/Grid Utilization) and BC (Breakpoint Coverage) under Layout Responsiveness, and CCR (Custom Class Reuse) under Code Maintainability are deferred to v1.1. VES is upgraded from DINOv2 to DINOv3. Asset Handling is original to design-to-code.

3.1 Visual Fidelity

Visual Embedding Similarity (VES) — cosine similarity between DINOv3 embeddings of the rendered screenshot and the Figma baseline export. Uses DINOv3 ViT-B/16, loaded locally. Both images are resized to fit within 512px, padded to the nearest 16px patch boundary with neutral grey (RGB 128,128,128), and ImageNet normalized before comparison.

VES = cosine_similarity(DINOv3(screenshot), DINOv3(baseline))

Range: 0 to 1 in practice. A higher score indicates greater visual similarity between the rendered output and the original design. Primary visual metric.

Mean Absolute Error (MAE) — average pixel-level difference between screenshot and baseline. A lower score indicates fewer pixel-level differences. Sensitive to global color shifts unrelated to structural quality. Published as a supplementary field only — not included in any score.

Visual Fidelity Score = VES

3.2 Layout Responsiveness

Measures whether the code uses a flexible layout system or hardcoded coordinates. In v1.0, components are tested on a fixed canvas — APR and RUR are structural signals that predict how the component will behave at other viewport sizes rather than measuring observed responsiveness. BC (Breakpoint Coverage) and FU (Flex/Grid Utilization) are deferred to v1.1.

Absolute Positioning Ratio (APR) — proportion of positioned elements using position: absolute or position: fixed. A lower score indicates less reliance on absolute positioning.

APR = absolute/fixed elements / total positioned elements

Relative Unit Ratio (RUR) — proportion of layout Tailwind classes using relative units (no arbitrary value syntax, no px). A higher score indicates greater use of relative units. Approximated via Tailwind class inspection — see Known Limitations.

RUR = relative unit layout classes / total layout classes
Responsiveness Score = ((1 − APR) + RUR) / 2

3.3 Code Maintainability

Measures structural code quality. CCR (Custom Class Reuse) is deferred to v1.1.

Semantic Tag Ratio (STR) — proportion of semantic HTML elements (header, nav, main, article, section, aside, footer, button, input, textarea, select, label, form, ul, ol, li, h1h6, p) among all DOM elements. A higher score indicates greater use of semantic HTML, reflecting better structural clarity.

STR = semantic elements / total DOM elements

Inline Style Ratio (ISR) — proportion of elements with inline style attributes. A lower score indicates less direct value mapping from Figma metadata and greater use of a proper styling system.

ISR = elements with inline styles / total DOM elements

Arbitrary Value Usage (AVU) — proportion of Tailwind class tokens using arbitrary value syntax (e.g. w-[123px], bg-[#6750A4]). A lower score indicates less reliance on hardcoded values and greater alignment with Tailwind's token system.

AVU = arbitrary Tailwind class tokens / total Tailwind class tokens
Maintainability Score = (STR + (1 − ISR) + (1 − AVU)) / 3

3.4 Asset Handling

Measures whether the component uses assets provided by the Figma MCP server rather than substituting external packages or inlining hardcoded SVG paths.

Three asset presence patterns are detected in the rendered DOM:

Pattern Example Presence Source
<img src="..."> <img src="http://localhost:3845/..."> ✓ if src non-empty ✓ if src is localhost
Inline <svg> with path data <svg><path d="M10 2.5..."/></svg> ✓ always ✗ always (hardcoded)
<svg><image href="..."> <svg><image href="http://localhost:3845/..."/> ✓ if href non-empty ✓ if href is localhost

A fourth source pattern is detected from the component source code:

Pattern Detection Source
Local asset import Component imports a file (e.g. import icon from '../assets/star-icon.svg') that exists in the output assets/ folder ✓ if imported filename matches a collected MCP asset

This accounts for the production-viable pattern where Claude downloads the MCP asset via curl during the session and imports it as a local file — the only approach that works outside the session, since the localhost MCP server is unavailable once the session ends.

Asset Presence Ratio — proportion of detected assets that are present. A higher score indicates more assets rendered correctly in the output.

Asset Presence Ratio = assets present / assets detected

Asset Source Ratio — proportion of present assets sourced from MCP localhost. A higher score indicates more assets correctly sourced from the Figma MCP server.

Asset Source Ratio = assets from MCP localhost / assets present

No composite score. Both ratios published independently alongside raw counts.


Test Environment

Every test runs in a clean, empty directory with no CLAUDE.md files and no pre-configured project context. Claude Code must have only the Figma desktop MCP server connected — not the remote Figma MCP server, which returns Figma API asset URLs instead of localhost URLs, producing invalid asset source scores regardless of model behavior. No skills may be installed — Figma's implement-design skill overrides P1 and P2 prompt behavior, collapsing all three levels into equivalent output.

The token output limit is set to 64,000 before each session to prevent truncation on complex components.

Claude Code is always allowed to proceed — all tool use approvals are granted and all file creation requests are accepted. No clarifications or additional instructions are provided beyond the initial prompt.

All components are measured against a 1440×600px white canvas exported at 3x from Figma. Playwright captures screenshots at the same dimensions with a forced white background, ensuring 1:1 pixel comparison against the baseline.


Data Schema

Every test run produces one immutable row in data/results.csv. Errors are never edited or deleted — corrections add a new row with supersedes: [original_id] and mark the original score_error: true.

Field Format Description
id dtc-[nanoid8] Unique run ID
date YYYY-MM-DD Date run
methodology_version 1.0 Version of this document — no v prefix
script_version string Git commit hash of measurement scripts
submitted_by string GitHub handle for community submissions
design_system M3 Source design system
figma_kit_version string From Figma file title — no v prefix
component_name string e.g. Button
component_type element | pattern | layout
component_description string Primary variant properties
figma_file_key string From Figma URL
figma_node_id string ID of the 1440×600px frame
figma_version string Figma desktop app version — no v prefix
test_type baseline v1.0 baseline tests only
prompt_level p1 | p2 | p3
claude_version string From claude --version — no v prefix
model string Underlying model name — extracted from JSONL
output_format react | html Format Claude produced
estimated_token_count integer From Figma Dev Mode
ves 0–1 Visual embedding similarity
mae 0–1 Mean absolute error (supplementary)
visual_fidelity_score 0–1 Equal to VES
absolute_positioning_ratio 0–1 APR
relative_unit_ratio 0–1 RUR
responsiveness_score 0–1 Dimension composite
semantic_tag_ratio 0–1 STR
inline_style_ratio 0–1 ISR
arbitrary_value_usage 0–1 AVU
maintainability_score 0–1 Dimension composite
asset_presence_ratio 0–1 or null Assets present / assets detected. Null only when component has no detectable assets.
asset_source_ratio 0–1 or null MCP localhost assets / assets present. Null when no assets present.
asset_presence_pass integer Raw count
asset_presence_fail integer Raw count
asset_source_pass integer Raw count
asset_source_fail integer Raw count
actual_token_count integer From Claude Code session
mcp_tool_calls integer MCP tool calls made
p_value float or null Two-sample t-test — null when fewer than 5 runs
effect_size float or null Cohen's d — null when fewer than 5 runs
score_error true or blank Marks a known scoring error
supersedes dtc-[nanoid] or blank ID of the row this corrects
errors string MCP errors, truncation, tool failures
notes string Session anomalies, deviations, findings

Data Integrity

Record claude --version before every session. Results produced by different Claude versions are not directly comparable — if Claude's version changes mid-batch, affected components must be rerun.

Published rows are immutable. Measurement errors are corrected by adding a new row with supersedes: [original_id] and marking the original score_error: true. Rows are never edited or deleted.


Known Limitations

Limitation Notes
Remote Figma MCP server returns Figma API asset URLs instead of localhost URLs — asset_source_ratio is always 0 when using the remote server Tests standardized on figma-desktop only; the behavioral difference between servers is itself a documented finding
RUR approximated via Tailwind class inspection, not computed CSS unit types Acceptable for v1.0; direct CSS inspection planned for v1.1
P1 may produce HTML/CSS output if Claude ignores the React + Tailwind instruction output_format field records what Claude actually produced; AVU is only meaningful for Tailwind output
AVU is 0 by default for HTML/CSS output — not a genuine signal of token alignment Filter by output_format: react when analyzing AVU
VES captures structural similarity but is relatively insensitive to subtle color or font differences MAE published as supplementary signal for pixel-level comparison
Fixed canvas means APR and RUR measure layout code structure, not observed responsiveness across viewport sizes Multi-viewport testing and BC deferred to v1.1
Inline SVG asset source is always scored as a fail — the metric cannot distinguish between Claude drawing an icon from scratch and Claude theoretically fetching and inlining an MCP asset In practice Claude does not fetch and inline MCP assets dynamically; all observed inline SVGs have been hand-drawn substitutions
Single run per component in v1.0 — findings are preliminary until replicated Run tracking and evidence labels deferred to v1.1
component_description is manually recorded — no automated verification that the tested node matches the described variant Requires human discipline during test runs
estimated_token_count sourced from Figma Dev Mode, not verified against actual session token usage actual_token_count recorded separately from the Claude Code session
FU (Flex/Grid Utilization) and BC (Breakpoint Coverage) not implemented Deferred to v1.1
CCR (Custom Class Reuse) not implemented Deferred to v1.1

Related Work

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering (Si et al., NAACL 2025) — The first real-world benchmark for UI code generation from screenshots. 484 manually curated webpages, image-only inputs, evaluated across multiple frontier models.

FIGMA2CODE: Automating Multimodal Design-to-Code in the Wild (Gui et al., ICLR 2026) — 213 high-quality Figma designs benchmarked across ten models using direct Figma API access and pre-processed metadata. Source of the VES, MAE, APR, RUR, STR, ISR, and AVU metrics used in this methodology.

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation (Xiao et al., 2025) — Multi-framework, multi-task benchmark covering React, Vue, Angular, and vanilla HTML/CSS across 900 webpage samples. Evaluates generation, editing, and repair tasks.

UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools (Jung et al., 2025) — Large-scale benchmark evaluating visual quality of AI text-to-app tools through expert pairwise comparison. 10 tools, 30 prompts, 4,000+ expert judgments.


Version 1.0 — March 2026 Published at design-to-code.fyi/methodology