This paper presents a robust multi-view method for tracking people in 3D scene. Our method distinguishes itself from previous works in two aspects. Firstly, we define a set of binary spatial relationships for individual subjects or pairs of subjects that appear at the same time, e.g. being left or right, being closer or further to the camera, etc. These binary relationships directly reflect relative positions of subjects in 3D scene and thus should be persisted during inference. Secondly, we introduce an unified probabilistic framework to exploit binary spatial constraints for simultaneous 3D localization and cross-view human tracking. We develop a cluster Markov Chain Monte Carlo method to search the optimal solution. We evaluate our method on both public video benchmarks and newly built multi-view video dataset. Results with comparisons showed that our method could achieve state-of-the-art tracking results and meter-level 3D localization on challenging videos.