Image saliency detection has recently witnessed rapid progress due to deep neural networks. However, there still exist many important problems in the existing deep learning based methods. Pixel-wise convolutional neural network (CNN) methods suffer from blurry boundaries due to the convolutional and pooling operations. While region-based deep learning methods lack spatial consistency since they deal with each region independently. In this paper, we propose a novel salient object detection framework using a superpixelwise variational autoencoder (SuperVAE) network. We first use VAE to model the image background and then separate salient objects from the background through the reconstruction residuals. To better capture semantic and spatial contexts information, we also propose a perceptual loss to take advantage from deep pre-trained CNNs to train our SuperVAE network. Without the supervision of mask-level annotated data, our method generates high quality saliency results which can better preserve object boundaries and maintain the spatial consistency. Extensive experiments on five wildly-used benchmark datasets show that the proposed method achieves superior or competitive performance compared to other algorithms including the very recent state-of-the-art supervised methods.