Jianmin Bao’s research while affiliated with Microsoft and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (98)


MageBench: Bridging Large Multimodal Models to Agents
  • Preprint

December 2024

Miaosen Zhang

·

Qi Dai

·

Yifan Yang

·

[...]

·

Baining Guo

LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents, which demand strong reasoning and planning abilities. Nevertheless, existing benchmarks mostly assess their reasoning abilities in language part, where the chain-of-thought is entirely composed of text.We consider the scenario where visual signals are continuously updated and required along the decision making process. Such vision-in-the-chain reasoning paradigm is more aligned with the needs of multimodal agents, while being rarely evaluated. In this paper, we introduce MageBench, a reasoning capability oriented multimodal agent benchmark that, while having light-weight environments, poses significant reasoning challenges and holds substantial practical value. This benchmark currently includes three types of environments: WebUI, Sokoban, and Football, comprising a total of 483 different scenarios. It thoroughly validates the agent's knowledge and engineering capabilities, visual intelligence, and interaction skills. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human-level. More specifically, we found current models severely lack the ability to modify their planning based on visual feedback, as well as visual imagination, interleaved image-text long context handling, and other abilities. We hope that our work will provide optimization directions for LMM from the perspective of being an agent. We release our code and data at https://github.com/microsoft/MageBench.


Figure 4. Real-world evaluation environments of Realman robot (left) and Franka robot (right).
Figure V. Visual examples of each task on the Google robot driven by our model.
Figure VI. Visual examples of each task on the WidowX robot driven by our model.
Real-world evaluation with the Realman Robot across three tasks. All models are pre-trained on OXE and then fine-tuned on our collected data.
Real-world generalization evaluation with the Realman Robot on unseen tables with additional unseen distractors.

+3

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
  • Preprint
  • File available

November 2024

·

9 Reads

The advancement of large Vision-Language-Action (VLA) models has significantly improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. While existing VLAs adapted from pretrained large Vision-Language-Models (VLM) have demonstrated promising generalizability, their task performance is still unsatisfactory as indicated by the low tasks success rates in different environments. In this paper, we present a new advanced VLA architecture derived from VLM. Unlike previous works that directly repurpose VLM for action prediction by simple action quantization, we propose a omponentized VLA architecture that has a specialized action module conditioned on VLM output. We systematically study the design of the action module and demonstrates the strong performance enhancement with diffusion action transformers for action sequence modeling, as well as their favorable scaling behaviors. We also conduct comprehensive experiments and ablation studies to evaluate the efficacy of our models with varied designs. The evaluation on 5 robot embodiments in simulation and real work shows that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds. It exceeds the average success rates of OpenVLA which has similar model size (7B) with ours by over 35% in simulated evaluation and 55% in real robot experiments. It also outperforms the large RT-2-X model (55B) by 18% absolute success rates in simulation. Code and models can be found on our project page (https://cogact.github.io/).

Download

REDUCIO! Generating 1024×\times1024 Video within 16 Seconds using Extremely Compressed Motion Latents

November 2024

·

2 Reads

Commercial video generation models have exhibited realistic, high-fidelity results but are still restricted to limited access. One crucial obstacle for large-scale applications is the expensive training and inference cost. In this paper, we argue that videos contain much more redundant information than images, thus can be encoded by very few motion latents based on a content image. Towards this goal, we design an image-conditioned VAE to encode a video to an extremely compressed motion latent space. This magic Reducio charm enables 64x reduction of latents compared to a common 2D VAE, without sacrificing the quality. Training diffusion models on such a compact representation easily allows for generating 1K resolution videos. We then adopt a two-stage video generation paradigm, which performs text-to-image and text-image-to-video sequentially. Extensive experiments show that our Reducio-DiT achieves strong performance in evaluation, though trained with limited GPU resources. More importantly, our method significantly boost the efficiency of video LDMs both in training and inference. We train Reducio-DiT in around 3.2K training hours in total and generate a 16-frame 1024*1024 video clip within 15.5 seconds on a single A100 GPU. Code released at https://github.com/microsoft/Reducio-VAE .




SynChart: Synthesizing Charts from Language Models

September 2024

·

44 Reads

With the release of GPT-4V(O), its use in generating pseudo labels for multi-modality tasks has gained significant popularity. However, it is still a secret how to build such advanced models from its base large language models (LLMs). This work explores the potential of using LLMs alone for data generation and develop competitive multi-modality models focusing on chart understanding. We construct a large-scale chart dataset, SynChart, which contains approximately 4 million diverse chart images with over 75 million dense annotations, including data tables, code, descriptions, and question-answer sets. We trained a 4.2B chart-expert model using this dataset and achieve near-GPT-4O performance on the ChartQA task, surpassing GPT-4V.


Figure 1: Toy illustration of the blocksparse attention in phi-3-small with 2 local blocks and vertical stride of 3. The table shows the Keys/values a query token in block 8 attended to. Blue=local blocks, orange=remote/vertical blocks, gray=blocks skipped.
Figure 3: Scaling law close to the "Data Optimal Regime" (from left to right: phi-1.5, phi-2, phi-3-mini, phi-3-small) versus Llama-2 family of models (7B, 13B, 34B, 70B) that were trained on the same fixed data. We plot the log of MMLU error versus the log of model size.
Figure 7: The demo case shows Phi-3.5-Vision's capability in natural image understanding and reasoning.
Figure 8: Comparison of categorized RAI performance of Phi-3.5-Vision with and without the safety post-training on the VLGuard (left) and Internal (right) benchmark, respectively. It clearly indicates that safety post-training can enhance the RAI performance across nearly all the RAI categories.
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Microsoft

August 2024

·

24 Reads

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.





Citations (51)


... The advent of Generative Adversarial Networks (GANs) improved the synthesis of new glyphs from limited examples and the separation of style and content in font design, enhancing flexibility and realism [2,14,15,22,35,43]. Innovations with diffusion models [25,36,39] models [1,12] showing quite promise in artistic text rendering, they still struggle with semantic confusion and style inconsistency. Our work differs from previous research: 1) Rather than stylizing an existing text glyph image, we generate artistic text images based on the text rendering capabilities of the DiT model (Flux [1]). ...

Reference:

FonTS: Text Rendering with Typography and Style Controls
FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation
  • Citing Chapter
  • November 2024

... Recent advancements in video generation techniques have yielded impressive results, particularly in creating short, visually appealing clips [6,8,9,28,69]. These advancements have been powered by increasingly sophisticated generative models, ranging from diffusion models [6,29,49,56] to auto-regressive models [20,39,62,65], supported by largescale datasets [31,50,51]. These methods have enabled the generation of high-quality, realistic short videos. ...

ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models
  • Citing Conference Paper
  • June 2024

... and score distillation based methods. Instruction-based methods [1,11,23,31,50] typically require an instruction editing dataset to train the diffusion model. Blendingbased methods [6,22,27,52] merge the source and target prompts to guide the editing process, while attentionbased methods [2,15,18,46] inject the attention feature of the source image. ...

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
  • Citing Conference Paper
  • June 2024

... FateZero [43] preserves motion and structural information by storing comprehensive attention maps at each stage of the inversion process, which are then fused during editing. Some other methods use auxiliary sequences as correspondences, such as flow [6,22,33], depth [8,9,25,34], or edge [70,76] maps, to model temporal motion. For instance, FLATTEN [6] integrates optical flow into diffusion models through flow-guided attention, aligning patches from different frames along consistent flow paths. ...

CCEdit: Creative and Controllable Video Editing via Diffusion Models
  • Citing Conference Paper
  • June 2024

... Prior to our work, many efforts have been made to turn an autoregressive architecture into a generalist model that can handle various visual tasks [2,17,20,22,25,38,56,61,74,82], such as visual question answering, image completion, and semantic segmentation. However, in-context learning for few-shot image manipulation with autoregressive models is still an understudied problem. ...

Towards More Unified In-Context Visual Understanding
  • Citing Conference Paper
  • June 2024

... Diffusion Training and Inference-Time Sampling We use the discrete-time diffusion framework proposed by Ho et al. [17], employing 1,000 timesteps. To stabilize training and improve performance, we incorporate additional strategies: min-SNR reweighting [50], v-diffusion [51,52], self-conditioning [53,54], a sigmoid noise schedule [42], and exponential moving average (EMA) decay. Ablation results are shown in Table 1. ...

Efficient Diffusion Training via Min-SNR Weighting Strategy
  • Citing Conference Paper
  • October 2023

... question answering. Vision-Language Models (VLMs) like CLIP [64], Gemini [75], and LLaVA [45], have proven highly effective in training on large amounts of noisy image-text data, considerably improving our understanding of visual content through natural language, and bridging the gap between textual annotations and visual data, with broad applicability [14,44,57,63,65,66,80,81,85]. However, these models [40,45] face limitations in contexts like museums, which require a detailed and interdisciplinary understanding of a long tail of objects, and prediction of structured attributes such as age, origin, material, and cultural relevance [7,55,62]. ...

Improving CLIP Fine-tuning Performance
  • Citing Conference Paper
  • October 2023

... While existing detection methods have demonstrated notable successes, they typically encounter challenges in generalizing to images produced by previously unseen generative models Wang et al. (2023a). One promising avenue to enhance the robustness of detection capabilities involves constructing more extensive training datasets by accumulating a diverse array of natural and synthetic images. ...

DIRE for Diffusion-Generated Image Detection
  • Citing Conference Paper
  • October 2023

... Several prior methods [1,9,29,45,47,53,57] have been introduced that directly generate 3D representations encoding both geometry and texture, e.g., implicit fields [9], point clouds [29], and triplanes [53]. This line of work typically have to preprocess the source 3D data (meshes or multiview images) into the target representations for generative models in a lossy fashion, hence limiting their quality and scalabilty. ...

RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

... , M } for M different class names. Recent works [12,25,49] provide a framework to map these similarities to semantic segmentation outputs. To examine the raw alignment of local image tokens v loc with the corresponding input texts, we perform semantic segmentation following [49] without post-processing or segmentation-specific modifications. ...

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
  • Citing Conference Paper
  • June 2023