Visible light communication is a promising candidate for achieving low-cost, high data rate and massive coverage in future wireless network. To fulfill the multimedia or broadcasting service, the prerequisite is to determine the user's position. However, traditional RSS fingerprint based positioning approach suffers the random signal fluctuation problem which significantly limits the positioning performance. To tackle this problem, this paper presents a deep learning scheme which utilize a regression DNN combined with convolutional auto-encoder (CAE) to provide robust indoor positioning. Instead of using the separate fluctuated RSS reading for positioning, the proposed method takes the RSS temporal image (RTI) composed from a set of consecutive RSS readings as input. By leveraging the spatial and temporal dependency of RTI data with stacked denoising CAE, the proposed network is expected to learn more complex and consistent features from the fluctuating RSS readings and perform better location estimation. Simulation results have shown that the proposed method is capable of providing more stable positioning result from fluctuating RSS data than traditional DNN based method.