Depth estimation in real-world applications requires precise responses to fast motion and challenging lighting conditions. Event cameras use bio-inspired event-driven sensors that provide instantaneous and asynchronous information of pixel-level log intensity changes, which makes them suitable for depth estimation in such challenging conditions. However, as the event cameras primarily provide asynchronous and spatially sparse event data, it is hard to provide accurate dense disparity map in stereo event camera setups - especially in estimating disparities on local structures or edges. In this study, we develop a novel deep event stereo network that reconstructs spatial intensity image features from embedded event streams and leverages the event features using the reconstructed image features to compute dense disparity maps. To this end, we propose a novel event-to-image translation network with a cross-semantic attention mechanism that calculates the global semantic context of the event features for the intensity image reconstruction. In addition, a feature aggregation module is developed for accurate disparity estimation, which modulates the event features with the reconstructed image features by a stacked dilated spatially-adaptive denormalization mechanism. Experimental results reveal that our method can outperform the state-of-the-art methods by significant margins both in quantitative and qualitative measures.