Image-based survival prediction models can facilitate doctors in diagnosing and treating cancer patients. With the advance of digital pathology technologies, the big whole slide images (WSIs) provide increasing resolution and more details for diagnosis. However, the gigabyte-size WSIs would make most models computationally infeasible. To this end, instead of using the complete WSIs, most of existing models only use a pre-selected subset of key patches or patch clusters as input, which might fail to completely capture the patient's tumor morphology. In this work, we aim to develop a novel survival analysis model to fully utilize the complete WSI information. We show that the use of a Vision Transformer (ViT) backbone, together with convolution operations involved in it, is an effective framework to improve the prediction performance. Additionally, we present a post-hoc explainable method to identify the most salient patches and distinct morphology features, making the model more faithful and the results easier to comprehend by human users. Evaluations on two large cancer datasets show that our proposed model is more effective and has better interpretability for survival prediction.