design-to-code — Testing Methodology
Version 1.0 | March 2026
Overview
What is design-to-code.fyi?
An open-source benchmark that measures how Claude Code translates Figma designs into front-end code via the Figma MCP server.
Why we exist
AI design-to-code tools are routinely evaluated through demos and "vibes", making it difficult to know what they actually do well and where they consistently fall short. design-to-code.fyi exists to cut through the noise and provide a reliable source of truth by measuring outputs systematically and publishing all data and findings openly.
Through this approach we aim to provide a factual reference point for comparing tools and prompting strategies, documented evidence of where AI tools fail so that designers can account for these gaps in their workflow, and a foundation for a more honest dialogue about what AI can and cannot reliably do in a design-to-code workflow.
What we test
The beta version tests individual M3 components rather than full products or systems. This bottom-up approach reveals AI failure modes in a more granular and systematic way — it's easier to isolate where things go wrong at the component level before scaling to more complex layouts.
We use the Material Design 3 Figma Kit as the reference design system. M3 is publicly available and follows all the file conditions Figma's MCP documentation identifies as optimal for code generation — Figma Variables for design tokens, Auto Layout throughout, and semantic layer naming. This makes M3 a best-case baseline: it tells us what Claude Code produces when given the best possible Figma input.
How we test
Each component is tested at three prompt levels (see Prompts) to evaluate how prompt specificity affects output quality. Claude Code runs in an empty environment with no CLAUDE.md files, no pre-configured skills, and only the Figma desktop MCP server connected, replicating the conditions a designer would encounter on a fresh project. Output is automatically measured across four dimensions — visual fidelity, layout responsiveness, code maintainability, and asset handling — using a reproducible script pipeline (see Evaluation). Every result is published as an immutable row in a public CSV, so findings can be verified, challenged, and built on.
Scope
The beta tests one model (Claude Code), one design system (M3), and one output framework (React + Tailwind). It does not cover full products, interactive states, accessibility, or real-world Figma files with imperfect hygiene. For specific measurement limitations see Known Limitations.
What Gets Tested
design-to-code v1.0 beta tests components from design systems that follow Figma's documented best practices for MCP-based code generation — specifically, systems that use Figma Variables for design tokens, Auto Layout, semantic layer naming, and live component instances. The beta version uses the Material Design 3 Figma Kit as the reference design system.
Components are organized into three types in order of complexity:
| Type | Description | Example |
|---|---|---|
| Element | A single self-contained component | Button, checkbox, input field |
| Pattern | A functional grouping of elements | Form, card, navigation bar |
| Layout | A full screen or page-level composition | Homepage, dashboard, app screen |
The v1.0 beta primarily focuses on Elements and Patterns. Layouts are included in scope but are tested as single fixed-canvas compositions, not as responsive components that adapt across multiple viewport sizes. Multi-viewport Layout testing requires a different measurement approach and is planned for a future version.
Each component is tested three times — once per prompt level — using the same test variant across all three runs. The test variant is the property combination most prominently featured in the kit, recorded explicitly in the component_description field of each result row. Variant testing, interactive states, dark/light mode, and accessibility are out of scope for v1.0.
Prompts
Every component is tested at three prompt levels. Each level represents a real prompting approach a designer might use and is grounded in official Figma documentation.
P1 — Modified Dev Mode default
The minimal prompt Figma surfaces in Dev Mode. Output is specified as React + Tailwind CSS to create a consistent baseline across all three levels — without this, Claude Code defaults to HTML/CSS output in an empty directory.
Implement this design from Figma as a React component with Tailwind CSS.
[FIGMA_LINK]
Source: Figma Dev Mode UI
P2 — Explicit tool sequencing
Adds explicit MCP tool calls based on Figma's guidance that the model does not always automatically select the right tool — in particular, get_variable_defs must be requested explicitly to return design token references rather than raw values.
Implement this Figma design as a React component with Tailwind CSS.
Before writing any code:
1. Run get_design_context on the node
2. Run get_variable_defs to extract design tokens
3. Run get_screenshot for visual reference
4. If the response is too large, run get_metadata first, then fetch child nodes individually
Then implement. Match the visual design exactly. Use semantic HTML.
Do not invent values not present in the design.
[FIGMA_LINK]
Source: figma/mcp-server-guide — "Trigger specific tools when needed"
P3 — Figma Implement Design skill
A prompt-level adaptation of Figma's official recommended workflow for design-to-code with the MCP — combining the required tool sequence from the implement-design skill with Figma's Claude Code asset handling rules. Project-specific instructions are omitted since tests run in an empty environment.
Implement this Figma design as a React component with Tailwind CSS.
Required workflow — do not skip steps:
1. Run get_design_context on the node
2. If the response is too large, run get_metadata first, then re-fetch specific nodes with get_design_context
3. Run get_screenshot — this is your visual source of truth throughout implementation
4. Download any assets returned by the MCP server before writing code
Asset rules:
- If the MCP server returns a localhost URL for an image or SVG, use it directly — do not substitute
- Do not import icon packages — all assets must come from the Figma MCP payload
- Do not create placeholders if a localhost asset is provided
Implementation:
- Match the design exactly — use design tokens from Figma where available
- Use semantic HTML elements
- Before finishing, validate your output against the get_screenshot
[FIGMA_LINK]
Sources: figma/mcp-server-guide — skills/figma-implement-design/SKILL.md (workflow steps) · figma/mcp-server-guide — Claude Code asset rules (asset handling)
Evaluation
Outputs are evaluated across four dimensions. No composite score is assigned — each dimension is published independently.
| Dimension | What it measures |
|---|---|
| Visual Fidelity | Does the rendered output visually match the design? |
| Layout Responsiveness | Does the code use a flexible layout system or absolute positioning? |
| Code Maintainability | Is the code structured for clarity, reuse, and long-term evolution? |
| Asset Handling | Does the component correctly load assets from the Figma MCP? |
Visual Fidelity, Layout Responsiveness, and Code Maintainability follow the metric framework established in FIGMA2CODE: Automating Multimodal Design-to-Code in the Wild (Gui et al., ICLR 2026). v1.0 implements a subset of their metrics — FU (Flex/Grid Utilization) and BC (Breakpoint Coverage) under Layout Responsiveness, and CCR (Custom Class Reuse) under Code Maintainability are deferred to v1.1. VES is upgraded from DINOv2 to DINOv3. Asset Handling is original to design-to-code.
3.1 Visual Fidelity
Visual Embedding Similarity (VES) — cosine similarity between DINOv3 embeddings of the rendered screenshot and the Figma baseline export. Uses DINOv3 ViT-B/16, loaded locally. Both images are resized to fit within 512px, padded to the nearest 16px patch boundary with neutral grey (RGB 128,128,128), and ImageNet normalized before comparison.
VES = cosine_similarity(DINOv3(screenshot), DINOv3(baseline))
Range: 0 to 1 in practice. A higher score indicates greater visual similarity between the rendered output and the original design. Primary visual metric.
Mean Absolute Error (MAE) — average pixel-level difference between screenshot and baseline. A lower score indicates fewer pixel-level differences. Sensitive to global color shifts unrelated to structural quality. Published as a supplementary field only — not included in any score.
Visual Fidelity Score = VES
3.2 Layout Responsiveness
Measures whether the code uses a flexible layout system or hardcoded coordinates. In v1.0, components are tested on a fixed canvas — APR and RUR are structural signals that predict how the component will behave at other viewport sizes rather than measuring observed responsiveness. BC (Breakpoint Coverage) and FU (Flex/Grid Utilization) are deferred to v1.1.
Absolute Positioning Ratio (APR) — proportion of positioned elements using position: absolute or position: fixed. A lower score indicates less reliance on absolute positioning.
APR = absolute/fixed elements / total positioned elements
Relative Unit Ratio (RUR) — proportion of layout Tailwind classes using relative units (no arbitrary value syntax, no px). A higher score indicates greater use of relative units. Approximated via Tailwind class inspection — see Known Limitations.
RUR = relative unit layout classes / total layout classes
Responsiveness Score = ((1 − APR) + RUR) / 2
3.3 Code Maintainability
Measures structural code quality. CCR (Custom Class Reuse) is deferred to v1.1.
Semantic Tag Ratio (STR) — proportion of semantic HTML elements (header, nav, main, article, section, aside, footer, button, input, textarea, select, label, form, ul, ol, li, h1–h6, p) among all DOM elements. A higher score indicates greater use of semantic HTML, reflecting better structural clarity.
STR = semantic elements / total DOM elements
Inline Style Ratio (ISR) — proportion of elements with inline style attributes. A lower score indicates less direct value mapping from Figma metadata and greater use of a proper styling system.
ISR = elements with inline styles / total DOM elements
Arbitrary Value Usage (AVU) — proportion of Tailwind class tokens using arbitrary value syntax (e.g. w-[123px], bg-[#6750A4]). A lower score indicates less reliance on hardcoded values and greater alignment with Tailwind's token system.
AVU = arbitrary Tailwind class tokens / total Tailwind class tokens
Maintainability Score = (STR + (1 − ISR) + (1 − AVU)) / 3
3.4 Asset Handling
Measures whether the component uses assets provided by the Figma MCP server rather than substituting external packages or inlining hardcoded SVG paths.
Three asset presence patterns are detected in the rendered DOM:
| Pattern | Example | Presence | Source |
|---|---|---|---|
<img src="..."> |
<img src="http://localhost:3845/..."> |
✓ if src non-empty | ✓ if src is localhost |
Inline <svg> with path data |
<svg><path d="M10 2.5..."/></svg> |
✓ always | ✗ always (hardcoded) |
<svg><image href="..."> |
<svg><image href="http://localhost:3845/..."/> |
✓ if href non-empty | ✓ if href is localhost |
A fourth source pattern is detected from the component source code:
| Pattern | Detection | Source |
|---|---|---|
| Local asset import | Component imports a file (e.g. import icon from '../assets/star-icon.svg') that exists in the output assets/ folder |
✓ if imported filename matches a collected MCP asset |
This accounts for the production-viable pattern where Claude downloads the MCP asset via curl during the session and imports it as a local file — the only approach that works outside the session, since the localhost MCP server is unavailable once the session ends.
Asset Presence Ratio — proportion of detected assets that are present. A higher score indicates more assets rendered correctly in the output.
Asset Presence Ratio = assets present / assets detected
Asset Source Ratio — proportion of present assets sourced from MCP localhost. A higher score indicates more assets correctly sourced from the Figma MCP server.
Asset Source Ratio = assets from MCP localhost / assets present
No composite score. Both ratios published independently alongside raw counts.
Test Environment
Every test runs in a clean, empty directory with no CLAUDE.md files and no pre-configured project context. Claude Code must have only the Figma desktop MCP server connected — not the remote Figma MCP server, which returns Figma API asset URLs instead of localhost URLs, producing invalid asset source scores regardless of model behavior. No skills may be installed — Figma's implement-design skill overrides P1 and P2 prompt behavior, collapsing all three levels into equivalent output.
The token output limit is set to 64,000 before each session to prevent truncation on complex components.
Claude Code is always allowed to proceed — all tool use approvals are granted and all file creation requests are accepted. No clarifications or additional instructions are provided beyond the initial prompt.
All components are measured against a 1440×600px white canvas exported at 3x from Figma. Playwright captures screenshots at the same dimensions with a forced white background, ensuring 1:1 pixel comparison against the baseline.
Data Schema
Every test run produces one immutable row in data/results.csv. Errors are never edited or deleted — corrections add a new row with supersedes: [original_id] and mark the original score_error: true.
| Field | Format | Description |
|---|---|---|
id |
dtc-[nanoid8] |
Unique run ID |
date |
YYYY-MM-DD |
Date run |
methodology_version |
1.0 |
Version of this document — no v prefix |
script_version |
string | Git commit hash of measurement scripts |
submitted_by |
string | GitHub handle for community submissions |
design_system |
M3 |
Source design system |
figma_kit_version |
string | From Figma file title — no v prefix |
component_name |
string | e.g. Button |
component_type |
element | pattern | layout |
— |
component_description |
string | Primary variant properties |
figma_file_key |
string | From Figma URL |
figma_node_id |
string | ID of the 1440×600px frame |
figma_version |
string | Figma desktop app version — no v prefix |
test_type |
baseline |
v1.0 baseline tests only |
prompt_level |
p1 | p2 | p3 |
— |
claude_version |
string | From claude --version — no v prefix |
model |
string | Underlying model name — extracted from JSONL |
output_format |
react | html |
Format Claude produced |
estimated_token_count |
integer | From Figma Dev Mode |
ves |
0–1 | Visual embedding similarity |
mae |
0–1 | Mean absolute error (supplementary) |
visual_fidelity_score |
0–1 | Equal to VES |
absolute_positioning_ratio |
0–1 | APR |
relative_unit_ratio |
0–1 | RUR |
responsiveness_score |
0–1 | Dimension composite |
semantic_tag_ratio |
0–1 | STR |
inline_style_ratio |
0–1 | ISR |
arbitrary_value_usage |
0–1 | AVU |
maintainability_score |
0–1 | Dimension composite |
asset_presence_ratio |
0–1 or null | Assets present / assets detected. Null only when component has no detectable assets. |
asset_source_ratio |
0–1 or null | MCP localhost assets / assets present. Null when no assets present. |
asset_presence_pass |
integer | Raw count |
asset_presence_fail |
integer | Raw count |
asset_source_pass |
integer | Raw count |
asset_source_fail |
integer | Raw count |
actual_token_count |
integer | From Claude Code session |
mcp_tool_calls |
integer | MCP tool calls made |
p_value |
float or null | Two-sample t-test — null when fewer than 5 runs |
effect_size |
float or null | Cohen's d — null when fewer than 5 runs |
score_error |
true or blank |
Marks a known scoring error |
supersedes |
dtc-[nanoid] or blank |
ID of the row this corrects |
errors |
string | MCP errors, truncation, tool failures |
notes |
string | Session anomalies, deviations, findings |
Data Integrity
Record claude --version before every session. Results produced by different Claude versions are not directly comparable — if Claude's version changes mid-batch, affected components must be rerun.
Published rows are immutable. Measurement errors are corrected by adding a new row with supersedes: [original_id] and marking the original score_error: true. Rows are never edited or deleted.
Known Limitations
| Limitation | Notes |
|---|---|
Remote Figma MCP server returns Figma API asset URLs instead of localhost URLs — asset_source_ratio is always 0 when using the remote server |
Tests standardized on figma-desktop only; the behavioral difference between servers is itself a documented finding |
| RUR approximated via Tailwind class inspection, not computed CSS unit types | Acceptable for v1.0; direct CSS inspection planned for v1.1 |
| P1 may produce HTML/CSS output if Claude ignores the React + Tailwind instruction | output_format field records what Claude actually produced; AVU is only meaningful for Tailwind output |
| AVU is 0 by default for HTML/CSS output — not a genuine signal of token alignment | Filter by output_format: react when analyzing AVU |
| VES captures structural similarity but is relatively insensitive to subtle color or font differences | MAE published as supplementary signal for pixel-level comparison |
| Fixed canvas means APR and RUR measure layout code structure, not observed responsiveness across viewport sizes | Multi-viewport testing and BC deferred to v1.1 |
| Inline SVG asset source is always scored as a fail — the metric cannot distinguish between Claude drawing an icon from scratch and Claude theoretically fetching and inlining an MCP asset | In practice Claude does not fetch and inline MCP assets dynamically; all observed inline SVGs have been hand-drawn substitutions |
| Single run per component in v1.0 — findings are preliminary until replicated | Run tracking and evidence labels deferred to v1.1 |
component_description is manually recorded — no automated verification that the tested node matches the described variant |
Requires human discipline during test runs |
estimated_token_count sourced from Figma Dev Mode, not verified against actual session token usage |
actual_token_count recorded separately from the Claude Code session |
| FU (Flex/Grid Utilization) and BC (Breakpoint Coverage) not implemented | Deferred to v1.1 |
| CCR (Custom Class Reuse) not implemented | Deferred to v1.1 |
Related Work
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering (Si et al., NAACL 2025) — The first real-world benchmark for UI code generation from screenshots. 484 manually curated webpages, image-only inputs, evaluated across multiple frontier models.
FIGMA2CODE: Automating Multimodal Design-to-Code in the Wild (Gui et al., ICLR 2026) — 213 high-quality Figma designs benchmarked across ten models using direct Figma API access and pre-processed metadata. Source of the VES, MAE, APR, RUR, STR, ISR, and AVU metrics used in this methodology.
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation (Xiao et al., 2025) — Multi-framework, multi-task benchmark covering React, Vue, Angular, and vanilla HTML/CSS across 900 webpage samples. Evaluates generation, editing, and repair tasks.
UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools (Jung et al., 2025) — Large-scale benchmark evaluating visual quality of AI text-to-app tools through expert pairwise comparison. 10 tools, 30 prompts, 4,000+ expert judgments.
Version 1.0 — March 2026 Published at design-to-code.fyi/methodology