It is important for task-oriented dialogue systems to discover the dialogue structure (i.e. the general dialogue flow) from dialogue corpora automatically. Previous work models dialogue structure by extracting latent states for each utterance first and then calculating the transition probabilities among states. These two-stage methods ignore the contextual information when calculating the probabilities, which makes the transitions between the states ambiguous. This paper proposes a conversational graph (CG) to represent deterministic dialogue structure where nodes and edges represent the utterance and context information respectively. An unsupervised Edge-Enhanced Graph Auto-Encoder (EGAE) architecture is designed to model local-contextual and global-structural information for conversational graph learning. Furthermore, a self-supervised objective is introduced with the response selection task to guide the unsupervised learning of the dialogue structure. Experimental results on several public datasets demonstrate that the novel model outperforms several alternatives in aggregating utterances with similar semantics. The effectiveness of the learned dialogue structured is also verified by more than 5% joint accuracy improvement in the downstream task of low resource dialogue state tracking.