Download

Introduction

This dataset provides 484 GAN edited COCO val2014 images for studing the interpretability of Visual Question Answering models. For each image, a human annotator looks at the original image and a natural language question about the image from the Visual Question Answering (VQA) dataset, and edit the image such that consistently answering the question on the original and edited images is challenging.

Question: What sport is shown?

Counterfactual	Original

In this dataset there are 4 types of image edits: 1) Inpaint a box region, 2) Inpaint the background except a box foreground, 3) Turning the image black-and-white, 4) Zooming into a part of the original image.

For inpainting we used a modified DeepFillv2 (Arxiv) inpainter, available at https://github.com/zzzace2000/generative_inpainting.

Images

We provide the GAN edited images and the original COCO images in ./counterfactual/{id}.png and ./original/{id}.png. The mapping between image id in our dataset and coco id is available in annotations.

Annotations

We provide the urls to the original coco images, the original VQA question, as well as image editing annotations in annotations.json. At the moment we do not have VQA answers on the edited images so they are not useful for evaluating the consistency of VQA models. Nevertheless they may help as visualizations to explain a VQA model’s performance in consistency to a human user.

References

If you use our dataset as part of published research, please cite the following paper:

@article{Alipour_2021,
	doi = {10.22541/au.162464875.59047443/v1},
	url = {https://doi.org/10.22541%2Fau.162464875.59047443%2Fv1},
	year = 2021,
	month = {jun},
	publisher = {Authorea,
Inc.},
	author = {Kamran Alipour and Arijit Ray and Xiao Lin and Michael Cogswell and Jurgen Schulze and Yi Yao and Giedrius Burachas},
	title = {Improving Users{\textquotesingle} Mental Model with Attention-directed Counterfactual Edits}
}