Chaehong Lee is a product-minded AI engineer based in NYC, with a background in computer vision, LLMs, and generative models. She specializes in transforming state-of-the-art research—like function-calling agents, Stable Diffusion pipelines, and edge-native vision systems—into polished, real-time products that feel intuitive and expressive to use.

Currently at Meta Reality Labs, and formerly at Microsoft Mixed Reality, she brings technical depth with strong systems thinking, shipping experience, and visual design instincts. Her recent work spans sparse autoencoder interpretability, multimodal prompt interfaces, and agentic tool-calling.

Open to AI-first teams and startups—especially in NYC, SF, or remote—building next-gen tools at the intersection of usability, creativity, and machine intelligence. Also open to software solutions architect consulting opportunities where strategic technical insight and user-centered design can bring research to life.

CV
Education
  • Washington University in St. Louis
    B.S. in Computer Science
    ↳ Double Major in Applied Mathematics
    ↳ Minor in Communication Design
  • Same school
    M.S. in Computer Vision
    ↳ Sparse to Dense Optical Flow with Deep Neural Networks
Experience
  • Meta
    New York, NY
    Software Engineer (2025-)
  • Microsoft
    Seattle, WA
    Research Engineer (2020-2024)
last updated 06.15.25
archives
folder icon
projects
folder icon
something pretty
folder icon
creative tech
folder icon
untitled folder
Diffusion Models: Continuing Studies from a Former Art Student
On understanding how machines perceive vibes, aesthetics, and meaning

Lately, I've been thinking about how images are represented in feature space-
Where in a neural network does the vibe of an image actually reside?

This started while playing with same.energy, a visual search engine powered by CLIP embeddings. Compared to Pinterest, its results feel... more semantically fixated. Less stylistic drift, more concept lock-in. And that got me wondering:
Why does one model generalize toward style, and another toward meaning?

Is it how CLIP encodes images? Or something about the final similarity search?
This small difference pulled me deeper into the mechanics of visual representation in diffusion models—and eventually into studies of human vision, LoRA fine-tuning, and latent space disentanglement.

Inspiration

Same.energy is a visual search engine that uses CLIP and vector similarity search to find and display images based on learned embeddings of images and texts.

How is this different from Pinterest?

Functionally, they're similar: both surface visual results and let users explore related imagery. Pinterest has been evolving its visual search algorithm around object recognition and product discovery since 2014. But after spending time with same.energy—and as an avid Pinterest user—I noticed something different:

Same.energy seems to fixate on semantic meaning. Pinterest tends to generalize better to stylistic variation.

Here's what I mean:

Second-level query for 'dreamy art' on Pinterest Second-level query for "dreamy art" on Pinterest
Second-level query for 'dreamy art' on same.energy Second-level query for "dreamy art" on same.energy

Pinterest returns images with diverse subjects—fish, boats, ballerinas—while same.energy focuses on a single type of composition: mostly a solitary figure. I saw this across many queries.

Of course, Pinterest's model is proprietary, so it's possible their search algorithm explicitly diversifies results at the last stage. But this behavior got me thinking:

How exactly are images embedded and organized in latent space?
And where—if anywhere—does a vibe live inside the model?

Related Work

Recent research by Frenkel et al. explores a LoRA-based fine-tuning method for disentangling style and semantic content in SDXL. By training lightweight adapters (LoRAs) in specific layers of Stable Diffusion, they found that certain layers specialize in texture and color, while others focus on structure.

Though their experiments focused on SDXL, it raises questions for other diffusion architectures too—like DiTs. Are there "modules" inside these models that correspond to artistic concepts? Can we trace which layers encode color grading, or spatial rhythm, or emotional tone?

Human Vision and Interpretability: A Parallel

I find this kind of layer-wise specialization fascinating because it parallels how the human visual system works.

  • The retina captures light and detail
  • Cones specialize in color
  • Higher brain regions interpret structure and semantics

If someone suffers a stroke in the occipital lobe, they may still receive visual input, but lose the ability to process what they're seeing.

Simulated visual cortex stroke A simulation of vision loss from occipital lobe stroke

It reminds me that perception isn't just about what's in front of us—it's also shaped by memory, attention, and learned priors. Similar is true for generative models.

What is a vibe?

Screenshot 2025-01-10 at 2.31.26 PM.png

A distinctive feeling or quality that can be sensed—but not easily defined.

In visual art—especially in 2D media like painting or photography—artists use techniques like color theory, rhythm, composition, balance, form, and texture to create emotional effects. Yet the final experience is always subjective. There's a gap between the artist's intention and the viewer's perception.

That ambiguity is kind of the point. But it also makes controlling a model's output harder.

The Problem I See

As a visual thinker, I've always struggled to express emotion purely through words—especially during my teenage years of immigration, when language itself felt limiting. I've long believed that text alone can't fully convey mood.

Modern text-to-image models ask users to verbalize their visual intent. That's a mismatch.
There's a cold-start problem baked into the process: you imagine an image in your head, but the moment you translate it into words, you've already distorted it. You've lost something.

Even with tools like ControlNet, the gap between what you intend and what you get remains wide.

So maybe it's time to ask:

Can we create new interfaces—or new embeddings—that allow people to search, steer, and create visually, without needing to describe everything in language first?

Pattern Studies

In this study, I explore the potential of AI-assisted design in transforming natural forms and original artwork into complex, aesthetically pleasing patterns suitable for fashion applications. The process begins with either a hand-drawn sketch or a photograph of a natural object, which serves as the seed for stable diffusion models.