Recommender systems are influential for many internet applications. As the size of the dataset provided for a recommendation model grows rapidly, how to utilize such amount of data effectively matters a lot. For a typical Click-Through-Rate(CTR) prediction model, the amount of daily samples can probably be up to hundreds of terabytes, which reaches dozens of petabytes at an extreme-scale when we take several days into consideration. Such data makes it essential to train the model parallelly and continuously. Traditional asynchronous stochastic gradient descent (ASGD) and its variants are proved efficient but often suffer from stale gradients. Hence, the model convergence tends to be worse as more workers are used. Moreover, the existing adaptive optimizers, which are friendly to sparse data, stagger in long-term training due to the significant imbalance between new and accumulated gradients. To address the challenges posed by extreme-scale data, we propose: 1) Staleness normalization and data normalization to eliminate the turbulence of stale gradients when training asynchronously in hundreds and thousands of workers; 2) SWAP, a novel framework for adaptive optimizers to balance the new and historical gradients by taking sampling period into consideration. We implement these approaches in TensorFlow and apply them to CTR tasks in real-world e- commerce scenarios. Experiments show that the number of workers in asynchronous training can be extended to 3000 with guaranteed convergence, and the final AUC is improved by more than 5 percentage.