The Problem
Generating images with AI sounds attractive for professional contexts. It is cost-efficient, fast, and independent of several external circumstances. But simple prompts rarely produce results that meet serious aesthetic standards. The outputs are probabilistic by nature: without careful preparation, the model has no idea what makes a specific brand look like itself.
Fig. 1 — Four outputs of the naive prompt: "Generate a photo for Horizn Studios, a premium travel brand."
In practice this showed up immediately. Early generations included Lufthansa logos on aircraft in the background, ICE trains pulling into frame, hotel lobbies and airport terminals. The model fills in the gaps with whatever it associates with "travel", which is far too generic to describe any specific brand.
The challenge is to communicate a brand's visual identity to the model as concretely as possible. Not just a mood, but specific signals: how scenes are lit, what colors appear where, how people are framed, what kind of story the image is telling.
The Idea
A brand's website already contains most of this information. The words a brand uses, the emotions they evoke. All of it is there, just unstructured. The question is whether it can be extracted and made usable automatically.
This project attempts exactly that. Using the Berlin-based luggage brand Horizn Studios as a test case, text and images from selected pages of their website were scraped and analysed to build a structured brand profile: a machine-readable description of how the brand looks and feels. That profile then drives image generation prompts instead of a human writing them from scratch.
Fig. 2 — Pipeline overview: from URL to structured brand prompt.
The pipeline draws on several analytical frameworks. Kansei Engineering maps emotional positioning across twelve bipolar dimensions such as premium vs. accessible or minimal vs. ornate. Alongside that, the pipeline extracts primary settings, target audience archetypes using Sinus-Milieus, a structured color palette with role assignments, and photographic signals from lifestyle imagery including shot sizes, lighting conditions, composition styles and people presence. These signals are sampled together and assembled into a structured prompt that gives the model concrete, role-specific instructions rather than a vague description.
Building the Pipeline
Scraping
The starting point is a scraper built with Playwright and BeautifulSoup. Playwright renders each page in a headless Chromium browser, which is necessary for Shopify storefronts where most content is injected dynamically. For Horizn Studios, six pages were scraped: homepage, product pages, about us, impact page, and two brand spotlights.
Text is filtered during collection: navigation elements, footers, size guides, and other non-editorial content are excluded. What remains is grouped into semantic chunks, each built around a heading and its associated body text. The result is a single corpus.json containing 58 text chunks and all usable image URLs found across those pages.
$ python pipeline.py
corpus.json — 6 pages scraped
────────────────────────────────────────────────
about_us 10 chunks 21 images
homepage 12 chunks 18 images
more_about_quality 8 chunks 6 images
more_about_our_efforts 9 chunks 14 images
more_about_our_partners 11 chunks 19 images
spotlight_sofo_backpack 8 chunks 9 images
────────────────────────────────────────────────
Total 58 chunks 87 images
Sample chunk [0] · about_us
heading : Introduction
text : "We seek to build bridges and make connections.
Our mission is to enable curious minds to
travel the world consciously and seamlessly."
Fig. 3 — corpus.json structure after scraping six Horizn Studios pages.
Text Analysis
Each text chunk is passed individually to Gemma 4 running via Ollama. The model analyses each chunk against four established frameworks: Kansei Engineering for product perception across twelve bipolar dimensions, the Geneva Emotion Wheel for emotional impact, IPTC NewsCodes for setting classification, and Sinus-Milieus for target audience archetypes. Each chunk returns a confidence-weighted vote. Neutral responses are treated as abstentions. The winning signal per dimension represents the brand's position on that axis.
brand_profile.json — Kansei profile
────────────────────────────────────────────────────────
dimension result signal strength
────────────────────────────────────────────────────────
modern_traditional modern ████████████ 36.5
minimal_ornate minimal ████████░░░░ 12.7
functional_decorative functional ████████████ 26.6
bold_subtle bold ██░░░░░░░░░░ 5.3
urban_natural urban ██░░░░░░░░░░ 5.3
premium_accessible premium ████████████ 38.3
serious_playful serious ░░░░░░░░░░░░ 1.8
timeless_trendy timeless ████░░░░░░░░ 10.6
individual_collective collective █░░░░░░░░░░░ 3.5
rough_refined refined █████████░░░ 23.4
dynamic_static dynamic █████░░░░░░░ 12.1
transparent_mysterious transparent ██░░░░░░░░░░ —
────────────────────────────────────────────────────────
profile_confidence 0.844
Fig. 4 — Kansei profile extracted from 58 text chunks. Signal strength reflects accumulated confidence-weighted votes.
Vision Analysis
The challenge with images is that not every image on a brand website is useful. Product packshots, icons, and UI elements carry no lifestyle signal. The pipeline filters these out in two passes: first by URL pattern, then by asking the vision model directly whether the image shows a product in a real-life context. Only images that pass both filters are analysed in full for shot size, lighting, composition, color, and narrative style.
Fig. 5 — URL-based packshot filter. Top row: removed. Bottom row: passed to vision analysis.
Aggregation
All text and vision signals flow into a single aggregation step. The model synthesises the raw votes and visual distributions into a final brand profile, including natural-language image prompt components ready for use.
A note on color extraction: during development the pipeline was also tested with a smaller local model. The general approach worked, but color values were significantly off. A color that should have been deep navy blue was returned as a bright saturated blue. The larger cloud model performed considerably better, though even that does not achieve full color accuracy. For a production context this would need a more reliable solution: either defining brand colors explicitly or replacing semantic extraction with direct pixel sampling.
Fig. 6 — Color values extracted per image by the vision model, with role assignments.
Generating the Prompts
The prompt generator reads the latest brand profile and decision log and draws a random combination of visual parameters, each weighted by how frequently it appeared in the analysed images. Setting, shot size, camera angle, lighting, composition, depth of field, narrative style: each is a slot filled from the observed data. Colors are drawn from the full pool of raw vision analysis entries rather than the aggregated palette, which gives 28 distinct values instead of five and preserves the actual variety of the brand's imagery.
The prompt is assembled as a deterministic template following the BFL prompting structure: Subject, Environment, Lighting, Composition, Colors, Mood. No second LLM call, every element is guaranteed to appear in the right order, every time.
Fig. 7 — Prompt template with randomly drawn slot values. Each run produces a different brand-consistent combination.
The Results
Fig. 8 — Four generated images with their corresponding structured prompts. Model: Nano Banana 2 via ComfyUI.
Conclusion
Running the structured prompts through Nano Banana 2 produced results that are visibly more brand-consistent than the naive baseline. The model does not know Horizn Studios, but the prompt encodes enough of what the brand looks like that the output moves clearly in the right direction.
That said, the results are far from production-ready. Color accuracy remains approximate despite the role-based assignment. Shot size and composition instructions are followed inconsistently. Several details that would never appear in Horizn's actual imagery keep surfacing: people visible in the background, static poses that feel staged in the wrong way, framing that does not match the brand's energy. These are not just prompt engineering problems. They point to gaps in what the pipeline currently captures. How people move, how they carry themselves, the specific relationship between subject and background: none of this is encoded yet.
There are also more fundamental limitations. The suitcase in every prompt is described as a generic hard-shell. A real production pipeline would inject an actual product asset instead. The dataset of six pages is small. And while the prompt structure follows BFL guidelines, serious prompt engineering would go considerably further.
What the project does show is that the approach works in principle. Structured brand data produces meaningfully better outputs than a naive description. The more the pipeline knows about a brand, its visual grammar and not just its mood, the more precisely it can steer the model. The natural next step would be larger scraping scope, direct pixel sampling for color, and asset integration for the actual product.