Proceedings:
No. 1: AAAI-19, IAAI-19, EAAI-20
Volume
Issue:
Proceedings of the AAAI Conference on Artificial Intelligence, 33
Track:
Student Abstract Track
Downloads:
Abstract:
Majority of the advancement in Deep learning (DL) has occurred in domains such as computer vision, and natural language processing, where abundant training data is available. A major obstacle in leveraging DL techniques for malware analysis is the lack of sufficiently big, labeled datasets. In this paper, we take the first steps towards building a model which can synthesize labeled dataset of malware images using GAN. Such a model can be utilized to perform data augmentation for training a classifier. Furthermore, the model can be shared publicly for community to reap benefits of dataset without sharing the original dataset. First, we show the underlying idiosyncrasies of malware images and why existing data augmentation techniques as well as traditional GAN training fail to produce quality artificial samples. Next, we propose a new method for training GAN where we explicitly embed prior domain knowledge about the dataset into the training procedure. We show improvements in training stability and sample quality assessed on different metrics. Our experiments show substantial improvement on baselines and promise for using such a generative model for malware visualization systems.
DOI:
10.1609/aaai.v33i01.330110033
AAAI
Proceedings of the AAAI Conference on Artificial Intelligence, 33