Microsoft Corporation

MSFT

US5949181045

Software

Market Closed - Nasdaq Other stock markets 01:30:00 06/07/2024 am IST			5-day change	1st Jan Change
467.6 ^USD	+1.47%		+4.61%	+24.34%

05/07	Megacaps power S&P, Nasdaq to record closing highs	RE
05/07	Wall Street indexes end up; Nasdaq, S&P 500 hit record highs on payrolls data	RE

VinVL: Advancing the state of the art for vision-language models

January 15, 2021 at 11:40 pm IST

Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us. For example, vision-language (VL) systems allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language.

As illustrated in Figure 1, a typical VL system uses a modular architecture with two modules to achieve VL understanding:

An image encoding module, also known as a visual feature extractor, is implemented using convolutional neural network (CNN) models to generate feature maps of input image. The CNN-based object detection model trained on the Visual Genome (VG) dataset is the most popular choice before our work.
A vision-language fusion module maps the encoded image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance of their vectors. The module is typically implemented using a Transformer-based model, such as OSCAR.

Recently, vision-language pretraining (VLP) has made great progress in improving the vision-language fusion module by pretraining it on a large-scale paired image-text corpus. The most representative approach is to train large Transformer-based models on massive image-text pair data in a self-supervised manner, for example, predicting the masked elements based on their context. The pretrained vision-language fusion model can be fine-tuned to adapt to various downstream vision-language tasks. However, existing VLP methods treat the image encoding module as a black box and leave the visual feature improvement untouched since the development of the classical bottom-up region features in 2017, despite that there has been much research progress on improving image encoding and object detection.

Here, we introduce recent Microsoft work on improving the image encoding module. Researchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed VinVL (Visual features inVision-Language), and performed a comprehensive empirical study to show that visual features matter significantly in VL models. Combining VinVL with state-of-the-art VL fusion modules such as OSCAR and VIVO, the Microsoft VL system sets new state of the art on all seven major VL benchmarks, achieving top position in the most competitive VL leaderboards, including Visual Question Answering (VQA), Microsoft COCO Image Captioning, and Novel Object Captioning (nocaps). Most notably, the Microsoft VL system significantly surpasses human performance on the nocaps leaderboard in terms of CIDEr (92.5 vs. 85.3).

Microsoft will release the VinVL model and the source code to the public. Please refer to the research paper and GitHub repository. In addition, VinVL is being integrated into the Azure Cognitive Services, powering a wide range of multimodal scenarios (such as Seeing AI, Image Captioning in Office and LinkedIn, and others) to benefit millions of users through the Microsoft AI at Scale initiative.

VinVL: A generic object-attribute detection model

As opposed to classical computer vision tasks such as object detection, VL tasks require understanding more diverse visual concepts and aligning them with corresponding concepts in the text modality. On one hand, most popular object detection benchmarks (such as COCO, Open Images, Objects365) contain annotations for up to 600 object classes, mainly focusing on objects with a well-defined shape (such as car, person) but missing visual objects occupying amorphous regions (such as grass, sky), which are typically useful for describing an image. The limited and biased object classes make these object detection datasets insufficient for training very useful VL understanding models for real-world applications. On the other hand, although the VG dataset has annotations for more diverse and unbiased object and attribute classes, it contains only 110,000 images and is statistically too small to learn a reliable image encoding model.

To train our object-attribute detection model for VL tasks, we constructed a large object detection dataset containing 2.49M images for 1,848 object classes and 524 attribute classes, by merging four public object detection datasets, that is, COCO, Open Images, Objects365 and VG. As most datasets do not have attribute annotations, we adopted a pretraining and fine-tuning strategy to build our object-attribute detection model. We first pretrained an object detection model on the merged dataset, and then fine-tuned the model with an additional attribute branch on VG, making it capable of detecting both objects and attributes. The resultant object-attribute detection model is a Faster-RCNN model with 152 convolutional layers and 133M parameters, which is the largest image encoding model for VL tasks reported.

Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes. As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments. As illustrated in Figure 2, compared with detections of a classical object detection model (left), our model (right) can detect more visual objects and attributes in an image and encode them with richer visual features, which are crucial for a wide range of VL tasks.

State-of-the-art performance on vision-language tasks

Since the image encoding module is fundamental to VL systems, as illustrated in Figure 1, our new image encoding can be used together with many existing VL fusion modules to improve the performance of VL tasks. For example, as reported in Table 1, by simply replacing visual features produced by the popular bottom-up model with the ones produced by our model, but keeping the VL fusion module (for example, OSCAR and VIVO) intact1, we observe significant improvement on all seven established VL tasks, often outperforming previous SoTA models by a significantly large margin.

[1] Note that we still perform training for the VL fusion module, but use the same model architecture, training data, and training recipe.

To account for parameter efficiency, we compare models of different sizes in Table 2. Our base model outperforms previous large models on most tasks, indicating that with better image encoding the VL fusion module can be much more parameter efficient.

Our new VL models, which consist of the new object-attribute detection model as its image encoding module and OSCAR as its VL fusion module, sit comfortably atop several AI benchmarks as of December 31, 2020, including Visual Question Answering (VQA), Microsoft COCO Image Captioning, and Novel Object Captioning (nocaps). Most notably, our VL model performance on nocaps substantially surpasses human performance in terms of CIDEr (92.5 vs. 85.3). On the GQA benchmark, our model is also the first VL model that outperforms NSM, which contains some sophisticated reasoning components deliberately designed for that specific task.

Looking forward

VinVL has demonstrated great potential in improving image encoding for VL understanding. Our newly developed image encoding model can benefit a wide range of VL tasks, as illustrated by examples in this paper. Despite the promising results we obtained, such as surpassing human performance on image captioning benchmarks, our model is by no means reaching the human-level intelligence of VL understanding. Interesting directions of future works include: (1) further scale up the object-attribute detection pretraining by leveraging massive image classification/tagging data, and (2) extend the methods of cross-modal VL representation learning to building perception-grounded language models that can ground visual concepts in natural language and vice versa like humans do.

Acknowledgments: This research was conducted by Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Additional thanks go to the Microsoft Research Service Engineering Group for providing computer resources for large-scale modeling. The baseline models used in our experiments are based on the open-source code released in the GitHub repository; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.

Attachments

Original document
Permalink

Disclaimer

Microsoft Corporation published this content on 15 January 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 15 January 2021 18:09:01 UTC

Latest news about Microsoft Corporation

Megacaps power S&P, Nasdaq to record closing highs	06/07	RE
Wall Street indexes end up; Nasdaq, S&P 500 hit record highs on payrolls data	06/07	RE
Sector Update: Tech Stocks Rise Friday Afternoon	05/07	MT
Megacaps push Nasdaq, S&P 500 to record highs after payrolls data	05/07	RE
Top Midday Stories: Shell Expects to Book $2 Billion Impairment in Q2; Macy's Targeted With Upsized Buyout Offer; JPMorgan Unit Warns of Increasing Costs to Customers	05/07	MT
Global markets live: Apple, Samsung, Shell, Tesla, KKR...	05/07
Sector Update: Tech Stocks Steady Pre-Bell Friday	05/07	MT
Microsoft to Pay $14 Million to Settle Alleged Worker Leave Discrimination, California Agency Says	05/07	MT
European regulators crack down on Big Tech	05/07	RE
Wall St Week Ahead-Earnings season to test hopes for broader stocks rally	05/07	RE
Microsoft's OpenAI Suffered Undisclosed 2023 Security Breach	05/07	MT
Chinese AI firms showcase resilience, innovations at AI event despite US sanctions	05/07	RE
OpenAI details stolen in cyber attack: NYT	05/07	RE
OpenAI's internal AI details stolen in 2023 breach, NYT reports	05/07	RE
KT Eases Internet Connectivity for 5G and LTE Laptops	04/07	MT
Asian stocks, currencies rise on U.S. rate cut wagers	04/07	RE
Amazon to Build $1.3 Billion Top-Secret Cloud for Australia's Government	04/07	DJ
Microsoft to Pay $14.4 Million to Resolve Leave Discrimination Claims in California	03/07	DJ
Microsoft to Pay $14 Million to Settle Alleged Worker Leave Discrimination, California Agency Says	03/07	MT
Microsoft settles California probe over worker leave for $14 mln	03/07	RE
Wall St poised to open lower ahead of more economic data, Fed minutes	03/07	RE
Microsoft Consolidates Retail Channels in Mainland China	03/07	MT
Apple's Phil Schiller to Join OpenAI Board as Observer	03/07	MT
Microsoft in Kenya : A national security issue	03/07	MT
Sectra Launches Sectra One Cloud Service in Two Hospitals in Belgium	03/07	MT

Chart Microsoft Corporation

Duration

Period

More charts

Company Profile

Microsoft Corporation is the world's leader in the design, development and marketing of operating systems and software programs for PC's and servers. The group also builds and sells computer equipment. Net sales break down by activity as follows: - sale of operating systems and application development tools (47.9%): primarily for servers (Azure, SQL Server, Windows Server, Visual Studio, System Center, GitHub, etc.) and (Windows); - development of cloud-based software applications (23%): programs for productivity (Microsoft 365; Word, Excel, PowerPoint, Outlook, OneNote, Publisher and Access), integrated management and customer relationship management (Dynamics 365), online file sharing and management (OneDrive), and unified and collaborative communications (Skype and Microsoft Teams); - sale of video gaming hardware and software (7.3%) : mainly Xbox; - enterprise services (3.6%); - sale of computers, tablets and accessories (2.6%); - other (15.6%). The United States accounts for 50.4% of net sales.

Sector

Software

Calendar

23/07/2024 - Q4 2024 Earnings Release (Projected)

Related indices

Dow Jones Industrial , S&P 500

More about the company

Income Statement Evolution

More financial data

Analysis / Opinion

Microsoft and Nvidia under antitrust investigation in the United States

June 07, 2024 at 03:26 am IST

Microsoft Positioned Well in AI Realm, RBC says

May 18, 2024 at 01:04 am IST

More Strategies

Ratings for Microsoft Corporation

Trading Rating

Investor Rating

ESG Refinitiv

C+

More Ratings

Analysts' Consensus

Sell

Buy

Mean consensus

BUY

Number of Analysts

Last Close Price

467.6 USD

Average target price

491 USD

Spread / Average Target

+5.02%

Consensus

EPS Revisions

Estimates Revisions

Quarterly earnings - Rate of surprise

Company calendar

Sector Other Software

	1st Jan change	Capi.
MICROSOFT CORPORATION	+24.34%	3,475B
SYNOPSYS INC.	+20.66%	95.19B
CADENCE DESIGN SYSTEMS, INC.	+17.89%	88.04B
PALANTIR TECHNOLOGIES INC.	+58.59%	60.64B
DASSAULT SYSTÈMES SE	-19.11%	50.89B
THE TRADE DESK, INC.	+38.83%	48.86B
ATLASSIAN CORPORATION	-21.25%	48.76B
SEA LIMITED	+77.48%	41.28B
TAKE-TWO INTERACTIVE SOFTWARE, INC.	-5.72%	26.56B
ROBLOX CORPORATION	-15.40%	24.75B

Other Software

Microsoft Corporation

Equities

MSFT

US5949181045

Software

VinVL: Advancing the state of the art for vision-language models

EPS Revisions