On understanding how machines perceive vibes, aesthetics, and meaning
Lately, I've been thinking about how images are represented in feature space-
Where in a neural network does the vibe of an image actually reside?
This started while playing with same.energy, a visual search engine powered by CLIP embeddings. Compared to Pinterest, its results feel... more semantically fixated. Less stylistic drift, more concept lock-in. And that got me wondering:
Why does one model generalize toward style, and another toward meaning?
Is it how CLIP encodes images? Or something about the final similarity search?
This small difference pulled me deeper into the mechanics of visual representation in diffusion models—and eventually into studies of human vision, LoRA fine-tuning, and latent space disentanglement.
Inspiration
Same.energy is a visual search engine that uses CLIP and vector similarity search to find and display images based on learned embeddings of images and texts.
How is this different from Pinterest?
Functionally, they're similar: both surface visual results and let users explore related imagery. Pinterest has been evolving its visual search algorithm around object recognition and product discovery since 2014. But after spending time with same.energy—and as an avid Pinterest user—I noticed something different:
Same.energy seems to fixate on semantic meaning. Pinterest tends to generalize better to stylistic variation.
Here's what I mean:
Pinterest returns images with diverse subjects—fish, boats, ballerinas—while same.energy focuses on a single type of composition: mostly a solitary figure. I saw this across many queries.
Of course, Pinterest's model is proprietary, so it's possible their search algorithm explicitly diversifies results at the last stage. But this behavior got me thinking:
How exactly are images embedded and organized in latent space?
And where—if anywhere—does a vibe live inside the model?
Related Work
Recent research by Frenkel et al. explores a LoRA-based fine-tuning method for disentangling style and semantic content in SDXL. By training lightweight adapters (LoRAs) in specific layers of Stable Diffusion, they found that certain layers specialize in texture and color, while others focus on structure.
Though their experiments focused on SDXL, it raises questions for other diffusion architectures too—like DiTs. Are there "modules" inside these models that correspond to artistic concepts? Can we trace which layers encode color grading, or spatial rhythm, or emotional tone?
Human Vision and Interpretability: A Parallel
I find this kind of layer-wise specialization fascinating because it parallels how the human visual system works.
- The retina captures light and detail
- Cones specialize in color
- Higher brain regions interpret structure and semantics
If someone suffers a stroke in the occipital lobe, they may still receive visual input, but lose the ability to process what they're seeing.
It reminds me that perception isn't just about what's in front of us—it's also shaped by memory, attention, and learned priors. Similar is true for generative models.
What is a vibe?
A distinctive feeling or quality that can be sensed—but not easily defined.
In visual art—especially in 2D media like painting or photography—artists use techniques like color theory, rhythm, composition, balance, form, and texture to create emotional effects. Yet the final experience is always subjective. There's a gap between the artist's intention and the viewer's perception.
That ambiguity is kind of the point. But it also makes controlling a model's output harder.
The Problem I See
As a visual thinker, I've always struggled to express emotion purely through words—especially during my teenage years of immigration, when language itself felt limiting. I've long believed that text alone can't fully convey mood.
Modern text-to-image models ask users to verbalize their visual intent. That's a mismatch.
There's a cold-start problem baked into the process: you imagine an image in your head, but the moment you translate it into words, you've already distorted it. You've lost something.
Even with tools like ControlNet, the gap between what you intend and what you get remains wide.
So maybe it's time to ask:
Can we create new interfaces—or new embeddings—that allow people to search, steer, and create visually, without needing to describe everything in language first?