Can Vision Transformers Learn without Natural Images?

Authors

Kodai Nakashima

National Institute of Advanced Industrial Science and Technology (AIST)

Hirokatsu Kataoka

National Institute of Advanced Industrial Science and Technology (AIST)

Asato Matsumoto

National Institute of Advanced Industrial Science and Technology (AIST)

Kenji Iwata

National Institute of Advanced Industrial Science and Technology (AIST)

Nakamasa Inoue

Tokyo Institute of Technology

Yutaka Satoh

National Institute of Advanced Industrial Science and Technology (AIST)

Proceedings:

No. 2: AAAI-22 Technical Tracks 2

Volume

Issue:

Proceedings of the AAAI Conference on Artificial Intelligence, 36

Track:

AAAI Technical Track on Computer Vision II

Downloads:

Download PDF

Abstract:

Is it possible to complete Vision Transformer (ViT) pre-training without natural images and human-annotated labels? This question has become increasingly relevant in recent months because while current ViT pre-training tends to rely heavily on a large number of natural images and human-annotated labels, the recent use of natural images has resulted in problems related to privacy violation, inadequate fairness protection, and the need for labor-intensive annotations. In this paper, we experimentally verify that the results of formula-driven supervised learning (FDSL) framework are comparable with, and can even partially outperform, sophisticated self-supervised learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. We also consider ways to reorganize FractalDB generation based on our tentative conclusion that there is room for configuration improvements in the iterated function system (IFS) parameter settings of such databases. Moreover, we show that while ViTs pre-trained without natural images produce visualizations that are somewhat different from ImageNet pre-trained ViTs, they can still interpret natural image datasets to a large extent. Finally, in experiments using the CIFAR-10 dataset, we show that our model achieved a performance rate of 97.8, which is comparable to the rate of 97.4 achieved with SimCLRv2 and 98.0 achieved with ImageNet.

DOI:

10.1609/aaai.v36i2.20094

AAAI

Proceedings of the AAAI Conference on Artificial Intelligence, 36

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.