FluxSpace: A Conversation with Yusuf Dalva on Semantic Image Editing | Festina Lente - Your leading source of AI news | Turtles AI

FluxSpace: A Conversation with Yusuf Dalva on Semantic Image Editing
With Yusuf Dalva, we explore how FluxSpace, a technology based on transformers and rectified flow, is redefining image editing with semantic precision, opening new possibilities for creativity and research
DukeRem21 January 2025

Today’s interview is highly technical but also incredibly fascinating, as it might reveal the future of image generation through AI algorithms.

Imagine a world where it’s possible to modify images with precision and realism simply by using detailed textual descriptions: this is exactly what FluxSpace aims to achieve.

We had the pleasure of speaking with Yusuf Dalva, a PhD candidate at Virginia Tech, who has conducted groundbreaking research on FluxSpace. This technology combines transformers with rectified flow to deliver state-of-the-art results in semantic image editing. Yusuf will guide us through the foundations of this technology, explaining how the features of transformers can be harnessed to modify images with unprecedented control, without requiring additional training.

In this interview, we’ll cover topics such as:

  • The differences between rectified flow models and traditional diffusion models;
  • How FluxSpace enables detailed and disentangled edits on both real and synthetic images;
  • The practical applications of this technology, from video generation to creative image customization;
  • The ethical implications of such advanced editing capabilities.

 

Enjoy the read, and don’t forget to share this article across your networks using the buttons below! Thank you.

 

Introduction and Background
Q: Could you introduce yourself and share a bit about your academic background, research interests, and what led you to the development of FluxSpace?
A: I am Yusuf Dalva, and I am a second year PhD student at Virginia Tech. I have been involved with research on generative models since the beginning of my graduate studies. I started my research by studying on GANs and their latent space representations and then switched to diffusion-based approaches. My research mainly targets the representations learned by such generative models and how they can be utilized to facilitate visual applications. In the context of FluxSpace, we wanted to investigate how we can control the outputs of the flow-matching transformer models (Flux in our case), which was the motivating factor for the project.

 

Conceptual Foundations
Q: What inspired the focus on disentangled semantic editing in rectified flow transformers, and how does FluxSpace address challenges faced by other image editing models?
A: The main motivation was to be able to utilize the generative capabilities of rectified flow transformers. By developing an editing method on state-of-the-art models, we wanted to enable image editing with superior visual performance and higher realism. In comparison to other methods, which try to perform editing by manipulating the noise predictions on the output level, we follow a different motivation and manipulate the attention features to modify the predicted noise. As we use Flux as the generative model for our approach, we wanted to leverage the pure transformer architecture that the model had with the attention layers, which enabled us to achieve precise control over the generated images (like stylization and semantics).

Q: Can you briefly explain what rectified flow models are and how they differ from typical diffusion models like Stable Diffusion or plain Flux?
A: Compared to diffusion-based approaches like Stable Diffusion, rectified-flow models generalize the diffusion process, by introducing a slightly different optimization objective compared to diffusion models. Instead of predicting the noise in every generation step, rectified-flow models predict the velocity that directs the noisy latent to images. Over studies on training rectified flow models, that velocity prediction objective can now be formulated as a noise prediction objective and generalizes the diffusion process. In FluxSpace, we use Flux as the generative model, which is also a model that is trained as a rectified flow model.

 

Technical Innovations
Q: FluxSpace emphasizes disentangled editing. How does your method achieve fine-grained and coarse-level edits without additional training?
A: We paid attention to the architectural decisions that are made for Flux. The key observation that we made was, since Flux has a transformer-based architecture rather any U-Net like design, the attention layers should be the component that determines the content. And since, there are no residual connections between different transformer blocks, we thought each block in isolation in terms of content creation. Since the Flux model already had a semantic understanding over the edit that we would like to make with text prompts, we were able to both edit the content and enable disentanglement.

Q: What role do the attention layers and joint transformer blocks play in enabling semantic manipulations in FluxSpace?
A: The architecture of Flux is based on a multi-modal diffusion transformer. When the design of the model is inspected further, it is based on dual-stream (joint) transformer blocks, single stream transformer blocks which all include attention layers. As a difference, the dual-stream blocks apply different transformations in text and image features. We interpreted the attention layers in these blocks (joint transformer blocks) as the layers where the semantic content is gradually generated while text and image features are aligned. Because of that reason, we perform the semantic manipulations in these blocks and gradually build the semantic context.

Q: Could you elaborate on the linear editing scheme introduced in FluxSpace and its significance in precise attribute-specific modifications?
A: We find the linearity assumption as an essential piece to be able to control the strength of the applied edit. Over the editing process, we assume that the attention features are added to the noisy latent in a linear way following the attention computation. To both benefit from this linearity for controllable editing strength, we introduce an editing scheme based on a linear direction on the attention layer outputs. Also, since we rely on linear directions on attention outputs, we define the directions based on orthogonal projection of attention features, which improves disentangled in editing.

Q: How does the masking mechanism contribute to the disentanglement of edits in complex image structures?
A: We design the masking approach as a refinement over the applied edits. Even though we operate on a very strong foundational model (Flux), we see that the attention layer outputs contain leaks on the attention maps that effect the disentanglement in a negative way. To prevent this effect, we utilize a self-supervised masking approach to eliminate these leaking attention pixels, which can be identified with low activation values. As an example, if I would like to perform a smile edit, this masking can focus more on the pixels near the mouth of the edited person, where the remaining pixels are suppressed with the mask threshold.

 

Applications
Q: What are some practical applications or industries where you see FluxSpace having the most impact?
A: I believe that this approach can further improve the customizable content generation and enable a wider spectrum of images that can be generated. The primary application that we performed this study was image editing so I believe customizable image generation will further improve. But at the same time since we present our findings on attention layers and transformer blocks, so I believe that FluxSpace can also be extended to other models such as video generation models. In addition, I believe that our research can bring a new perspective on research investigating the attention features of transformer-based models.

Q: How does FluxSpace handle real-world image editing compared to synthetic ones? Are there limitations or challenges specific to real-world data?
A: Since Flux is a relatively new approach, a robust inversion method still does not exist but is an active research area. Considering this, we mainly focused on images generated by Flux rather than real images. However, I believe that once the research focusing on inversion with such models improve, our work will be more transferrable to the task of real image editing. As an example of the limitations for real image editing, we present some examples with RF-Inversion where semantic edits are transferrable but not as successful as in synthetic images.

 

Comparisons with Other Models
Q: In your experiments, FluxSpace outperformed other state-of-the-art methods like LEDITS++ and TurboEdit. What were the key factors that contributed to its superior performance?
A: I believe there are two key factors in this performance improvement. The initial reason was the power of the foundational model that we use which was Flux in our case, SDXL for LEDITS++ and SDXL-Turbo for TurboEdit. On top of that, we target a class of representations that are one level below to the competing approaches where the others focus on output noise predictions, and we focused more on how these noise predictions are constructed with the attention layers. But I think that the primary difference is due to the difference caused by the model used.

Q: FluxSpace integrates pooled and token-wise text embeddings. How does this compare to the methods used in traditional CLIP or T5-based systems?
A: I can say that we use both CLIP embeddings and T5 embeddings during editing. In the architecture of Flux, the pooled embedding from CLIP is used for feature modulation and T5 embeddings are used in the attention calculation directly. When we investigated these different embeddings and how they affect the edit performed, we realized that they work together but CLIP is more involved in coarse edits while T5 embeddings are more involved in fine-grained edits due to the calculation that they are involved. Also, I can say that this joint structure also enables us to distinguish between coarse and fine edits.

 

User Interaction
Q: How intuitive is the user interface for individuals without a technical background? Could non-experts use FluxSpace for creative tasks?
A: We designed the editing signal to be as simple as possible, where one only needs to enter an editing prompt to modify the generated output. However, the only challenging aspect would be understanding the hyperparameters we used in the approach. While it can be a bit problematic once the user gets used to the effect of each hyperparameter, we believe that the content editing is going to be straight-forward even for non-technical users.

Q: Have you received feedback from artists, designers, or other non-technical users? If so, how has it influenced your research or development?
A: Currently no, we did not consult on such groups for feedback. However, following the proposed approach, it is our intention to make this approach as robust as possible and then ask artists on how this work can be incremented to assist their creative efforts. In our lab, we always try to look at the project from a user-centered process, where one priority is to address a real problem addressed by a target audience effectively. This work also will not iterate any different, where we are excited to iterate on this project that it is going to be helpful to the creative community.

 

Ethical Considerations
Q: FluxSpace has powerful editing capabilities. How do you address ethical concerns, such as the potential misuse for deepfakes or misleading media?
A: In this study, we work to explore the editing capabilities of Flux and customizing the images generated by the model with concepts defined by text prompts. As we do not work on images outside the distribution of the model, we do not introduce any new concerns regarding misleading media generation. In addition, we focus on generated images rather than real images. However, we believe that such ethical considerations will have a major role in the future considering our approach has the potential of getting expanded to such applications.

Q: Are there safeguards or technical measures in place to prevent unauthorized or unethical use of this technology?
A: I believe that text prompts both gives us a great creative capability and at the same time creates creative applications where the outputs must be regulated to some extent. In FluxSpace, we do not focus on such preventive measures, but I personally believe such measures should be implemented in product level.

 

Future Developments
Q: What are the next steps for FluxSpace? Are there plans to extend its capabilities beyond image editing into areas like video or 3D modeling?
A: To be honest, there are many possibilities when it comes to extensions. But as a priority, we would like to enhance the robustness of the approach such that the sensitivity to the introduced hyperparameters would be lower. Domain-wise, we would like to investigate transformer-based video generation models and see how our findings translate to such models. I personally believe that video generation models involve more complex representations compared to image generation models, and by investigating the internal structures like attention layers we can find new pieces of information compared to image domain. This can eventually lead to new applications of course.

Q: How do you see FluxSpace evolving with advancements in generative AI and transformer architectures?
A: I think that as long as transformer-based architectures are the dominating approach, our findings will have value in the research community. While saying that, I always believe in constant iteration in research. I certainly expect iterations to our work, which will improve the value of our proposed idea in the long run.

 

Collaborations and Open Source
Q: You’ve made the FluxSpace implementation public. How has the open-source community contributed to its development or application?
A: During the development of our approach, we iterated in closed source but then later decided to make it public. Even though we did not have the chance to improve our approach with the feedback from the open-source community yet, we certainly would like to in the future iterations.

Q: Are there any collaborations or partnerships you’re currently pursuing to expand FluxSpace’s reach or capabilities?
A: We initiated this project as an internal effort in our lab (GEMLAB at Virginia Tech). But I believe that every new contributor brings a certain amount of potential in terms of iteration, so we prefer to keep an open mind for future collaborations and partnerships.

 

Closing Reflections
Q: What advice would you give to young researchers entering the field of generative AI, particularly those interested in the intersection of technology and creativity?
A: The first piece of advice that I would give would be keeping an open mind. All these generative models give us very useful tools along with interesting research questions. I think that people should motivate themselves to pursue a particular question that they have and should be ready to invest in the time and effort for it. We are currently in a very valuable age in the intersection of generative AI and creativity, and even a person who is a novice in the area can make a significant question with the right questions. So, I would advise them to keep an open mind not afraid to pursue their curiosity.

Q: As a final note, feel free to share anything you’d like to communicate to the Turtle’s AI community—be it artists, AI enthusiasts, or casual readers. Is there a message, insight, or thought you’d like to leave with us?
A: Firstly, I would like to thank to the community for giving us a stage to talk about our research. Also, we appreciate every individual involved in this community. In the era of generative AI, we experienced improvements in a very rapid way, all thanks to this community. Thanks to this community, we as researchers can identify the needs and weaknesses of the current approaches and initiate development efforts. I hope communities such as the one built by Turtle’s AI will continue to live on, so that we can continue this two-way feedback in the future.

 

Thank you so much, Yusuf. We’ll reach out to you in the future to discuss the developments of your research!