Information on social media spreads through an underlying diffusion network that connects people of common interests and opinions. This diffusion network often comprises multiple layers, each capturing the spreading dynamics of a certain type of information characterized by, for example, topic, language, or attitude. Researchers have previously proposed methods to infer these underlying multilayer diffusion networks from observed spreading patterns, but little is known about how well these methods perform across the range of realistic spreading data. In this paper, we conduct an extensive series of synthetic data experiments to systematically analyze the performance of the multilayer diffusion network inference framework, under varied network structure (e.g. density, number of layers) and information diffusion settings (e.g. cascade size, layer mixing) that are designed to mimic real-world spreading on social media. Our results show extreme performance variation of the inference framework: notably, it achieves much higher accuracy when inferring a denser diffusion network, while it fails to decompose the diffusion network correctly when most cascades in the data reach a limited audience. In demonstrating the conditions under which the inference accuracy is extremely low, our paper highlights the need to carefully evaluate the applicability of the inference before running it on real data. Practically, our results serve as a reference for this evaluation, and our publicly available implementation, which outperforms previous implementations in accuracy, supports further testing under personalized settings.