Attention based approaches has been manifested to be an effective method in image captioning. However, attention can be used on text called semantic attention or on image which in known as spatial attention. We chose to implement the later as the main problem of image captioning is not being able to detect objects in image properly. In this work, we develop an approach which extracts features from images using two different convolutional neural network and combines the features with an attention model in order to generate caption with an RNN. We adapted Xception and InceptionV3 as our CNN and GRU as our RNN. Moreover, we Evaluated our proposed model on Flickr8k dataset translated into Bengali. So that captions can be generated in Bengali using visual attention.