Microsoft teases Visual ChatGPT | | | | Turtles AI

Microsoft teases Visual ChatGPT
DukeRem11 March 2023
  A new system called Visual ChatGPT has been created, which combines the conversational skills of ChatGPT with the image processing capabilities of Visual Foundation Models (VFMs). While ChatGPT is limited to processing language, VFMs are experts in understanding and generating complex images. Visual ChatGPT allows users to interact with ChatGPT through images, providing complex visual questions and receiving corrected results. It also incorporates a Prompt Manager that specifies the input-output formats of VFMs and handles the histories, priorities, and conflicts of different models. This system opens up opportunities to investigate the visual roles of ChatGPT with the help of VFMs. The Visual ChatGPT system is available for public use on GitHub. In a world where AI is continuously advancing, Microsoft has announced a new addition to the market. Visual ChatGPT, a system designed to incorporate different visual foundation models, is changing the way AI interacts with images. Despite the impressive conversational competency and reasoning capabilities of ChatGPT, its inability to process or generate images from the visual world is a significant limitation. Meanwhile, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, excel in understanding and generating images but are limited to specific tasks with one-round fixed inputs and outputs. To address these limitations, Microsoft has built Visual ChatGPT, which enables users to interact with ChatGPT by sending and receiving not only languages but also images. Visual ChatGPT offers many features, such as providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. Additionally, the system offers the ability to provide feedback and ask for corrected results. Microsoft has designed a series of prompts to inject visual model information into ChatGPT, which considers models of multiple inputs/outputs and models that require visual feedback. Experiments have shown that Visual ChatGPT is a significant step forward in investigating the visual roles of ChatGPT with the help of Visual Foundation Models. The system is publicly available at https://github.com/microsoft/visual-chatgpt. The architecture of Visual ChatGPT is straightforward. The system incorporates ChatGPT, and Prompt Manager bridges the gap between ChatGPT and different Visual Foundation Models. A user uploads an image and enters a complex language instruction, and with the help of Prompt Manager, Visual ChatGPT starts a chain of execution of related Visual Foundation Models. Finally, when Visual ChatGPT obtains the hints from Prompt Manager, it will end the execution pipeline and show the final result. The Visual ChatGPT model still has some limitations that must be considered, including its dependence on ChatGPT and VFMs. Its performance heavily relies on the accuracy and effectiveness of these models. Another limitation is the heavy prompt engineering required to convert VFMs into language, which can be time-consuming and requires expertise in both computer vision and natural language processing. The model also has limited real-time capabilities when handling a specific task and may invoke multiple VFMs, resulting in slower response times. The maximum token length in ChatGPT may also limit the number of foundation models that can be used. Finally, there are security and privacy concerns when plugging and unplugging foundation models, particularly for remote models accessed via APIs. Despite these limitations, Visual ChatGPT has demonstrated great potential and competence for different tasks. However, there are concerns about some generation results being unsatisfactory due to the failure of VFMs and the instability of the prompt, requiring a self-correction module for checking the consistency between execution results and human intentions. Such self-correction behaviour can lead to more complex thinking of the model, significantly increasing the inference time, which is an issue that will be addressed in the future. All that considered, Visual ChatGPT offers a new way for AI models to handle complex visual tasks. The system has been designed to overcome the limitations of both ChatGPT and Visual Foundation Models, enabling users to interact with images and languages. With the significant advancements in AI technology, the future looks promising for the industry.