Deep Learning has become interestingly popular in the field of computer vision, mostly attaining near or above human-level performance in various vision tasks. But recent work has also demonstrated that these deep neural networks are very vulnerable to adversarial examples (adversarial examples - inputs to a model which are naturally similar to original data but fools the model in classifying it into a wrong class). In this work, we proposed a novel method to train deep learning models on an image classification task. We used a reward-based optimization function, similar to the vanilla policy gradient method in reinforcement learning to train our model instead of conventional cross-entropy loss. An empirical evaluation on cifar10 dataset showed that our method outperforms the same model architecture trained using cross-entropy loss function (on adversarial training). At the same time, our method generalizes better to the training data with the difference in test accuracy and train accuracy < 2% for most of the time as compared to cross-entropy one, whose difference most of the time remains > 2%.