We develop a Multimodal Spatiotemporal Neural Fusion network for MTL (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each advising note, Long-Short Term Memory (LSTM) network to model temporal advising note embeddings, LSTM network to model students' temporal performance variables and students' static demographics altogether. The final fused representation from MSNF has been utilized on a Multi-Task Cascade Learning (MTCL) model towards building MSNF-MTCL for predicting 5 student retention risks. We evaluate MSNF-MTCL on a large educational database consists of 36,445 college students over 18 years period of time that provides promising performances comparing with the nearest state-of-art models. Additionally, we test the fairness of such model given the existence of biases.