1. Component Detection:
<aside>
2. Description Generation:
<aside>
- For each component, Claude 3.7 Sonnet is prompted to provide a detailed description. We tested several approaches to description generation:
- Using structured JSON responses with predefined fields for color, font size, font family, and other attributes
- Experimenting with varying levels of description detail (from concise to comprehensive)
- Allowing the LLM to generate freeform descriptions without predefined fields—this produced better results, but required more tokens
- We finalized the third option since it could better capture the intricacies in the description.
</aside>
3. Code Generation:
<aside>
- In this final stage, we translate structured, detailed information about UI components into functional code using Gemini 2.5 Pro.
- We evaluated several models for code generation:
- Gemini 2.5 Pro: Delivers stable and consistent results
- Claude 3.7 Sonnet: Excels at layout structure in some cases
- GPT-4.1: Shows inconsistent performance
- Ultimately, we chose Gemini for code generation due to its superior visual understanding
</aside>
4. Color Palette:
<aside>
Currently, the system relies on the LLM to suggest colors. Also experimented with using Color Thief (https://github.com/fengsp/color-thief-py?tab=readme-ov-file**)** to extract the color palette of a component, but faced several limitations:
- It captures colors from the entire component, including image accents, which adds noise.
- It often returns slightly lighter or darker shades, making the palette less precise.
- When passing Color Thief’s palette to the LLM, it sometimes negatively impacted the layout.
- Explored traditional methods like clustering to extract colors, but since Color Thief already uses a similar approach so continued with it.
- Also experimented with computer vision techniques and superpixel segmentation, combining those images with the original input, but this did not improve the results.
</aside>
5. Layout
<aside>
- Few-shot prompting: Provided the LLM with example website code, images, and description samples to improve layout handling and image understanding. This increased token usage but did not significantly improve accuracy.
- Canny edge detection: Tested to capture layout boundaries, but it had no meaningful impact on results.
- Sobel edge detection: Also tried to highlight horizontal and vertical structures, but it similarly showed no noticeable improvement.
</aside>
6. Initial Thought Process
<aside>
- From the start, it was clear that a divide-and-conquer approach was needed, as one-shot generation using VLMs was not producing good results.
- We initially tried generating code for each individual component and then merging them, but this approach consumed a lot of tokens, increased computational load, and still failed to deliver accurate results.
- Eventually, we settled on generating a comprehensive description for each component before moving to code generation.
</aside>
7. Font Detection
<aside>
Font Recognition Challenges
- Current models have limited font identification capabilities and would require extensive additional training data and model development for accurate font matching. (https://huggingface.co/gaborcselle/font-identifier) For now, we rely on font approximation.
</aside>