Spatiotemporal Fusion of Land Surface Temperature Based on a Convolutional Neural Network

Due to the tradeoff between spatial and temporal resolutions commonly encountered in remote sensing, no single satellite sensor can provide fine spatial resolution land surface temperature (LST) products with frequent coverage. This situation greatly limits applications that require LST data with fine spatiotemporal resolution. Here, a deep learning-based spatiotemporal temperature fusion network (STTFN) method for the generation of fine spatiotemporal resolution LST products is proposed. In STTFN, a multiscale fusion convolutional neural network is employed to build the complex nonlinear relationship between input and output LSTs. Thus, unlike other LST spatiotemporal fusion approaches, STTFN is able to form the potentially complicated relationships through the use of training data without manually designed mathematical rules making it is more flexible and intelligent than other methods. In addition, two target fine spatial resolution LST images are predicted and then integrated by a spatiotemporal-consistency (STC)-weighting function to take advantage of STC of LST data. A set of analyses using two real LST data sets obtained from Landsat and moderate resolution imaging spectroradiometer (MODIS) were undertaken to evaluate the ability of STTFN to generate fine spatiotemporal resolution LST products. The results show that, compared with three classic fusion methods [the enhanced spatial and temporal adaptive reflectance fusion model (ESTARFM), the spatiotemporal integrated temperature fusion model (STITFM), and the two-stream convolutional neural network for spatiotemporal image fusion (StfNet)], the proposed network produced the most accurate outputs [average root mean square error (RMSE) < 1.40 °C and average structural similarity (SSIM) > 0.971].

S ATELLITE-DERIVED land surface temperature (LST) is of considerable importance to a diverse array of studies including environmental and climatic change [1], [2], land and water energy exchange with the atmosphere [3], [4], and ecological processes [5], [6]. In such studies, LST products with fine spatiotemporal resolution are often desired [7]. For instance, as a key variable in studies of urban heat island (UHI), LST must be in consistent with simulated hourly energy consumption data from urban buildings to evaluate the UHI impact on energy use [8]. However, due to technological and budget limitations, currently available satellite LST products do not have both fine spatial and temporal resolutions. Specifically, LST products with fine spatial resolution inevitably have coarse temporal resolution and vice versa [9]. This situation arises mainly from the tradeoff between spatial and temporal resolutions of satellite sensors and has greatly limited the potential of satellite LST products in various applications.
A variety of methods have been proposed for the generation of fine spatiotemporal resolution LST products [9]- [15]. Due to the advantages arising from the use of neighboring spatiotemporal change information concurrently, spatiotemporal fusion methods have been widely adopted [11], [16], [17]. Spatiotemporal fusion methods were originally used for the generation of fine spatiotemporal resolution reflectance imagery using images acquired from different satellite sensors with complementary spatial and temporal characteristics [18]- [20]. Example methods include the multisensor multiresolution technique (MMT) [21], the spatial and temporal adaptive reflectance fusion model (STARFM) [22], an enhanced version of STARFM (ESTARFM) [23], and the flexible spatiotemporal data fusion (FSDAF) [24]. Given that both LST and reflectance are continuous land surface variables, some research has adopted such spatiotemporal fusion methods to estimate fine spatiotemporal LST products directly. For example, Liu and Weng [25] generated a series of Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER)-like LST products using STARFM for community health research and Ma et al. [26] fused moderate resolution imaging spectroradiometer (MODIS) and Landsat LST data using ESTARFM to generate a Landsat-like LST product, which was required to estimate surface evaporation. Nevertheless, some characteristics of LST are different to reflectance. For example, reflectance often varies slowly with time, while LST can change rapidly [27]. Consequently, spatiotemporal fusion methods appropriate for the generation of reflectance imagery may not always be appropriate for the generation of LST products.
Various means have been proposed to enhance the ability of spatiotemporal fusion methods to generate fine spatial resolution LST products. They include considering the correlation between pixels in LST products, introducing the temporal change models of LST into the analysis, and increasing the number of satellite sensors. For example, Huang et al. [28] improved the weight function in STARFM with the aid of bilateral filtering to generate fine spatiotemporal resolution LST products of urban regions, and Wu et al. [29] used a variation-based constrained model to produce accurate Landsat-like LST products. By adding the annual temperature cycle (ATC) to the ESTARFM, Weng et al. [30] put forward the spatiotemporal adaptive data fusion algorithm for temperature mapping (SADFAT) to predict thermal radiance and LST data. To produce diurnal LST products at a Landsat scale, Wu et al. [11] designed a spatiotemporal integrated temperature fusion model (STITFM) for the estimation of fine temporal and spatial resolution LST products, by integrating data from arbitrary satellite sensors including multiscale polarorbiting and geostationary satellites. Recently, several hybrid methods have been presented to take full advantage of different techniques and thus further enhance the spatiotemporal fusion. For instance, Quan et al. [16] designed a unified framework to blend spatiotemporal temperatures (BLESTs) derived from Landsat, MODIS, and geostationary satellites based on the integration of temporal interpolation, spatial downscaling, and weight function-based fusion. Alternatively, to enhance the spatial resolution of temporally dense LST data series, Xia et al. [17] proposed a weighted combination of kernel-driven and fusion-based methods (CKFM), which can inherit the advantages and overcome the shortcomings of each method simultaneously.
Although great progress has been made, spatiotemporal LST fusion remains a challenge because the relations between input and output LST may not be appropriately specified. For example, in some conventional LST fusion techniques [11], [28], [29], it is assumed that the output LST product could be expressed by a linear combination of inputs which may not be appropriate if there are nonlinear temporal changes of LST. Although some research has addressed the effect of temporal variations of LST and employed some mathematical or physical models [16], [30], these theoretical models are not always suitable because LST can be a highly changeable variable. For instance, in BLEST and SADFAT, ATC and diurnal temperature cycle (DTC) models based on prior knowledge are applied to describe the annual and diurnal change of LST. While the ATC and DTC models can to some extent reflect temporal variation of LSTs, they may perform poorly when abrupt changes in LST occur and when some latent relations between input and output LST products do not conform to the prior knowledge [31]. Furthermore, the methods are based on original low-level features such as image texture and location [32]. Therefore, they may not differentiate different objects and thus apply the same model parameters to various objects, while LST of these objects may change differently.
Inspired by the powerful nonlinear representation ability of convolutional neural network (CNN) [33], several CNN-based spatiotemporal reflectance fusion methods have recently been proposed. A five-layer CNN was, for example, put forward by Song et al. [34] to express the nonlinear relationships between reflectances estimated from MODIS and Landsat data. Alternatively, Tan et al. [35] designed a deep convolutional spatiotemporal fusion network (DCSTFN) to obtain high spatiotemporal resolution images directly. In addition, Liu et al. [36] presented a two-stream CNN for spatiotemporal image fusion (StfNet) that accommodated temporal dependency. CNN-based fusion methods utilize end-to-end CNN to automatically form the nonlinear relationship between observed pairs of coarse and fine spatial resolution image and use this to predict the target fine spatial resolution image. Because high-level features (e.g., edges of objects and their mutual relations) [37] which contain abundant semantic information can be extracted from the input data through CNN, the established relations between input and output are more practical; thus, more favorable results can be acquired. But there are also some limitations when they are adopted for the generation of LST products. For example, both the five-layer CNN and DCSTFN lose some significant fine spatial information in the prediction, and StfNet is a shallow network that has limited ability to form the potentially complex nonlinear relationship between the input and output LSTs. Therefore, for the spatiotemporal fusion of LST products, a CNN with enhanced capacity for nonlinear representation would be of considerable value.
This article proposes a novel spatiotemporal temperature fusion network (STTFN) method for the accurate generation of fine spatiotemporal resolution LST products. For enhanced modeling of the potentially complex nonlinear relationships between input and output LSTs, STTFN uses a multiscale fusion CNN. The multiscale fusion CNN contains three parts: 1) super-resolution of temporal change between the input target and neighboring coarse spatial resolution LST, 2) high-level feature abstraction of neighboring fine spatial resolution data, and 3) integration of the extracted multiscale features. Furthermore, to take advantage of the temporal consistency (TC) commonly observed in a series of LST images, the STTFN uses two fine-coarse spatial resolution LST image pairs observed before and after the target date to train two multiscale fusion CNNs, and then combine their predictions using a spatiotemporal-consistency (STC)-weighting function. Considering the special characteristic of temperature, there are four differences of STTFN compared with existing deep learning-based spatiotemporal fusion method: 1) in the super-resolution process, fine spatial resolution image was used to provide fine spatial texture information, 2) residual learning modules were applied to preserve lost features in the convolution and fuse features at different scales, 3) in the training process, Huber loss function was used for decreasing the negative influence of noises and outliers, and 4) in the combination, STC-weighting strategy which considers STC was employed to derive the final result. Experiment results showed that the proposed method was effective and performed better than the comparator methods in generating fine spatial resolution LSTs.
The rest of this article is organized as follows. The proposed STTFN fusion method is presented in Section II. Then, the performance of the proposed method is validated in Section III, which is followed by a discussion in Section IV and the conclusion in Section V.

II. METHODOLOGY
Landsat and MODIS LST data were used as fine and coarse spatial resolution data, respectively. The Landsat LST data have a pixel size of 30 m, and that of MODIS LST data is 1000 m. The aim was to obtain a pair of fine and coarse LST images that predating the target date, for which a coarse MODIS image was available, together with another pair of fine and coarse image that postdating the target date. Then, from one MODIS LST image at the target date and two pairs of Landsat and MODIS LST images at the neighboring dates, a Landsat-like LST image was predicted. For simplicity, the target date is denoted as t 2 , and the dates predating and postdating t 2 are denoted as t 1 and t 3 , respectively. Accordingly, the MODIS LST image at t 2 is denoted as M 2 , and the Landsat-MODIS LST image pairs at dates t 1 and t 3 are denoted as L 1 and M 1 and L 3 and M 3 , respectively. As shown in Fig. 1, the proposed STTFN generates a fine spatial resolution LST image via a three-stage process: 1) forward and backward model training, 2) forward and backward prediction, and 3) combination of predicted Landsat-like LST images to yield a final fine spatial resolution predicted image for t 2 .
The input fine and coarse spatial resolution LST images are first preprocessed. Then, a forward multiscale fusion CNN, which blends L 1 , M 1 , and M 3 to predict L 1→3 3 , is gradually learned and optimized; the superscript 1→3 indicates forward modeling based on the image pairs at t 1 and t 3 . In addition, an optimally trained backward multiscale fusion CNN can be obtained by using L 3 , M 3 , and M 1 to predict L 3→1 1 ; the superscript highlights backward modeling. In the prediction stage, two predicted fine spatial resolution LST images, which are expressed as L 1→2 2 and L 3→2 2 , are generated by the optimal forward and backward multiscale fusion CNNs, respectively. Then, L 1→2 2 and L 3→2 2 are combined via an STC-weighting function. The following subsections explain the proposed STTFN more fully.

A. Layers Used in the Multiscale Fusion CNN
The most commonly used layer in the multiscale fusion CNN is a convolution layer, which is used to extract and fuse high-level feature maps. The inputs to the i th convolution layer are the feature maps (comprised by extracted latent features of input LSTs) Y i−1 from the previous layer. To realize the feature extraction and fusion tasks, a convolution layer applies k filters of size r × r × c (where c is the number of channels) sliding through the input feature maps with a step size s, which is denoted here as "conv r ×r , k." Zero values were padded at the edges of the feature maps to ensure that the output size equaled to that of the input. The output feature maps Y i were calculated as Y i = W i * Y i−1 +b i , where W i denotes the filter parameters and b i is the bias parameter. The parameters are updated and optimized in the training stage. To improve the speed and performance of the network, a batch normalization layer [38] can be added after each convolution layer. Finally, behind the batch normalization layer, a rectified linear unit (ReLU) activation layer (max (0, ·)) [39] was adopted to ensure that the output is a nonlinear combination of the inputs. Table I lists the detailed composition of different layers of the multiscale fusion CNN.

B. Architecture of the Multiscale Fusion CNN
The architecture of multiscale fusion CNN is illustrated using the prediction of L 1→2 2 from L 1 , M 1 , and M 2 as example (Fig. 2). The multiscale fusion CNN is a fully convolution network, which enables an end-to-end mapping from three input LST images to an output LST image. It first extracts high-level features of L 1 and super resolves the change LST image between M 2 and M 1 at the same time, then fuses and retrieves the extracted feature maps to the fine spatial resolution result. Therefore, the whole architecture of the multiscale fusion CNN contains three major parts, termed here as the extraction-net, super-resolution-net, and integration-net. The first part of the multiscale fusion CNN is the extractionnet, which is a two-layer convolution network and employed to extract the high-level features of L 1 . The derived high-level features can provide fine abundant spatial pattern information for the prediction.
The second part is the super-resolution-net. To obtain the high-level features with abundant spatial pattern information on the temporal changes between M 2 and M 1 , the change image is first concatenated with L 1 and then put into the super-resolution module (Fig. 2). The focus in the fusion process is on locating of temporal change between the target and neighboring dates. Unlike DCSTFN [35], multiscale fusion CNN acquires the change image first and then uses it as the input to super-resolution-net. This makes the subsequent network pay more attention to the locations of change and simultaneously reduces its computational burden. By using a super-resolution module, the coarse spatial resolution pixels of the temporal change LST image are disaggregated into fine spatial resolution pixels, which can offer more spatial pattern information. Note that, concatenation of L 1 in the super-resolution is of primary importance, since it provides fine scale information for the disaggregation [40]. To implement the super-resolution, a modified version of the state-of-the-art super-resolution model "Wide Activation for Efficient and Accurate Image Super-Resolution (WDSR)" was used [41], referred to here a slim-WDSR (Fig. 2). The core of slim-WDSR is five convolution layers. Through these layers, the slim-WDSR can extract high-level features of the coarse spatial resolution LST image and map them to the features of the fine spatial resolution LST image. However, in the feature extraction and mapping process, some lowlevel features may be lost [42]. To address this deficiency, two local residual learnings were applied in the slim-WDSR since it can retain shallow features and prevent the problem of misconvergence by adding low-level features to the extracted high-level features [43].
The third part is the integration-net, which is made up of three stacked convolutional layers. The generated feature maps are integrated and used to estimate the target LST image gradually through the integration-net. Specifically, the feature maps from the extraction-net and super-resolution-net are first blended: The blended feature maps are composed of various layers of feature maps, in which different layers will contain different information crucial for estimating the fine spatial resolution LST image. For example, layer one may contain edge information of the involved objects, while layer two is likely to be made up of location information. Therefore, the integrationnet is used to integrate the various types of information and convert the blended feature maps to the target fine LST image. As aforementioned, low-level features may partly disappear because of convolutions, thus the initial fused image is added to the end of the whole net, which is termed global residual learning. By using local and global residual learning, low-and high-level features at different scales of the input can be fully utilized.

C. Training and Prediction
Two multiscale CNNs termed forward multiscale CNN and backward multiscale CNN are trained. In other studies [34]- [36], mean square error (MSE) was used as the loss function in the training. Here, to reduce the effects of noise and outliers, e.g., caused by unidentified cloud in input images, Huber loss [44] was adopted as the loss function, which is more robust to outliers than MSE loss. The loss of pixel i is denoted as (2), shown at the bottom of this page, where δ is a hyper-parameter to determine which formula to be used; here, δ was empirically set to 1.0. F(•) denotes the multiscale fusion network, N is the number of pixels in the input LST imagery, and w 1 and w 2 denote the forward and backward network parameters to be updated during the training process, respectively. In the training stage, the two networks are optimized separately.
For optimization, the weights of the proposed network were initialized to small random values, which were drawn from a Gaussian distribution with zero mean and standard deviation of 1 × 10 −3 . The adaptive moment estimation (Adam) with standard back propagation [45] was applied to minimize the loss and update the network weights until convergence (losses of STTFN did not change significantly); a value of β 1 = 0.9 and β 2 = 0.999 was set for Adam. The learning rate α was initialized as 1 × 10 −4 and is multiplied by a decaying factor 0.1 every 10 epochs to shrink the searching range of the parameters.
In the prediction stage, two fine spatial resolution images L 1→2 2 and L 3→2 2 at the target date, t 2 , are produced from the two trained networks. L 1→2 2 is generated from M 1 , L 1 , and M 2 with the trained forward multiscale fusion CNN (Fig. 2), while L 3→2 2 is obtained from M 3 , L 3 , and M 2 via the backward multiscale fusion CNN. These predicted images are denoted as follows: Then, the two predicted fine spatial resolution LST images are combined to produce the final predicted LST image for the target date,L 2 where p 1 and p 3 are the weighting parameters for L 1→2 2 and L 3→2 2 , respectively. To determine p 1 and p 3 in the combination, the spatial consistency between the two predicted fine spatial resolution images and the corresponding coarse spatial resolution LST image was considered and a novel weighting strategy termed here STC-weighting was proposed. It was different from previous fusion methods [34], [36], in which the temporal change between the coarse spatial resolution images at the neighboring dates (M 1 and M 3 ) and target date (M 2 ) was used to calculate the weights (termed here TC-weighting), which can be denoted as In STC-weighting, weight magnitude depends on the difference between the predicted fine spatial resolution images and the coarse spatial resolution image at the target date, i.e., a smaller difference results in a higher weight. Specifically, the differences between M 2 and L 1→2 2 and M 2 and L 3→2 2 are used to calculate the weighting parameters

A. Study Areas and Data
Two study areas were used here. The first study area was located in Évora, Portugal [ Fig. 3(a)]. This was a large region of expansive plains covered mainly by Holm and Cork Oaks trees. The second study area was in the Shengjin Lake nature reserve, China [ Fig. 3(b)]. The land cover mosaic in this article area was more complex than in Évora and, in addition, the lake's water level fluctuated greatly between seasons and hence its extent was variable [46].
The MODIS/Terra LST and Emissivity Daily L3 Global 1-km SIN Grid product (MOD11A1, Collection 6) were obtained from the website of The Level 1 and Atmosphere Archive and Distribution System Distributed Active Archive Center (https://ladsweb.modaps.eosdis.nasa.gov/); the RMSEs of the MOD11A1 product are within 2 • C for most landcovers [47], [48]. Landsat LST data were derived from imagery acquired in Landsat TM/ETM+ thermal bands provided at 30-m pixel size, which were downloaded from the United States Geological Survey (USGS) Earth Explorer (http://earthexplorer.usgs.gov/). Data gaps caused by the ETM+ SLC-off problem were filled by a linear interpolation algorithm. Landsat LSTs were estimated with a single channel algorithm, whose accuracy is reportedly within 1.5 • C [49].
For Évora, 18 cloud-free Landsat-MODIS LST image pairs between January 2010 and October 2011 were obtained (Table II). These image pairs each covered an area of 39 km × 39 km and, hence, each Landsat LST products comprised 1300 pixels × 1300 pixels, while the MODIS LST product comprised 39 pixels × 39 pixels. The data acquisition dates included almost every month of the year and, therefore, captured phenological change.
For the Shengjin Lake nature reserve study area, five cloudfree Landsat-MODIS LST image pairs, which covered an area of 24 km × 24 km, were acquired. The Landsat LST product comprised 800 pixels × 800 pixels and the MODIS LST product comprised 24 pixels × 24 pixels. The rise and fall of the lake water level during the period of study caused changes in the areal extent of the water body.
In the training process, according to the structure of the multiscale fusion CNN, the input coarse resolution images are not the original MODIS data but the interpolated image that match the size of Landsat data, and bilinear interpolation was used to interpolate the MODIS data. The original input Landsat and interpolated MODIS LST images were cropped into image patches of 40 pixels × 40 pixels. Considering the different image sizes and landcover types and via a series of experiments, the cropping strides were set to 20 and 15 for Évora and Shengjin Lake, respectively. Accordingly, a total of 4096 and 2704 image patches from each original input LST image were used to train the network for Évora and Shengjin Lake, respectively. These image patches were randomly chose in the training to ensure convergence and generalization ability of network.

B. Comparator Methods
For an informative assessment, three traditional fusion methods were used for comparison, the ESTARFM [23], the STITFM [11], and the StfNet [36]. These methods which all assume images from different satellite sensors observed at the same date are comparable and correlated. ESTARFM is a widely adopted spatiotemporal fusion method which uses images from two satellite sensors as input [23], while STITFM is a modified version of ESTARFM accounting for input images from arbitrary satellite sensors [11]. Both methods obtain spatial variations from the input fine spatial resolution images and acquire temporal changes from the input fine temporal resolution images through linear weight functions [51]. StfNet applies a two-stream CNN to build the nonlinear relations between input and output images and predicts a fine spatiotemporal resolution image by incorporating temporal information to fine spatial resolution image series [36]. Here, the inputs of these methods are a coarse spatial resolution image at the target date and two pairs of coarse and fine images predating and postdating the target date.
Regarding quantitative evaluation, the absolute error (AE), root mean square error (RMSE), and structural similarity (SSIM) were calculated using the real fine resolution image at the target date as reference. AE denotes the difference between pixels of the real and predicted images, RMSE represents the difference between the real and predicted images, and SSIM reflects the correlation of spatial details between the real and predicted images. The most accurate prediction will have a low AE and RMSE as well as a high SSIM.

C. Validity of STC-Weighting and Huber Loss
The performance of STC-weighting and TC-weightingbased combination was compared (Fig. 4). It shows that STC-weighting and TC-weighting yielded similar results in most dates, while TC-weighting had a poor performance on October 11, 2010. It may because spatial consistency was not considered in TC-weighting. Three subsets of Landsat LST images on September 9, 2010, October 11, 2010, and November 4, 2010 were shown in Fig. 5. We can see that there were many anomalous values in the gap-filled Landsat ETM+ LC-off LST image on November 4, 2010, resulting in large difference of spatial patterns between Landsat LSTs on October 11, 2010 and November 4, 2010. However, as indicated in Table III, the difference of MODIS LSTs between the target date (October 11, 2010) and the date postdating the target date (November 4, 2010) is smaller than that between the target date (October 11, 2010) and the date predating the target date (September 9, 2010). Therefore, in TC-weighting-based combination, the backward prediction based on the image pair on November 4, 2010 would obtain a higher weight due to the smaller difference of MODIS LSTs with the target date.    The error of the backward prediction is much larger than the forward prediction based on image pair from September 9, 2010 ( Fig. 6) due to spatial inconsistency resulted in large errors of the TC-weighting-based combination. In contrast, the STC-weighting-based combination considered the STC between the predicted results and the actual MODIS data at the target date and, thus, yielded a more robust result. In addition, final results from the combinations of forward and backward predictions through STC-weighting are better than the forward and backward predictions (Fig. 7) due to the consideration of STC. Huber loss was adopted as the loss function to optimize the network is of considerable importance for decreasing the negative influences of the abnormal values in the input LST images. The lower RMSEs of predictions in Évora area with Huber loss compared with MSE loss (Fig. 8) indicate that more favorable results can be derived by Huber loss.

D. Evaluations Based on Actual Data
In the prediction stage, all Landsat-MODIS pairs were arranged in numeric order as defined in Table II. A Landsatlike LST image on the target date was predicted using the corresponding MODIS LST image and two Landsat-MODIS LST image pairs. The acquisition dates of the two Landsat-MODIS LST image pairs were those that immediately predated and postdated the target date. For example, the Landsat-MODIS LST image pair numbers 1 and 3 and the MODIS LST image number 2 are used to predict a Landsat-like LST image for the date of MODIS image number 2. Therefore, excluding the first and last LST image pairs, 16 Landsat-like LST images were predicted for Évora area and three Landsat-like LST images were predicted for Shengjin Lake reserve area. Because the multiscale fusion CNN in the STTFN is a fully convolutional network, it can theoretically process images with arbitrary size. Therefore, original LST images without cropping were entered into the network to yield the predicted image. The accuracy of the prediction was assessed relative to the actual Landsat image for the target date.
For the Évora study area, the quantitative evaluations for the fusion results of ESTARFM, STITFM, StfNet, and STTFN are summarized in Table IV. It is evident that the proposed STTFN produced the highest SSIMs on the prediction dates except for March 9, 2010 and August 24, 2010. In addition, with the exception of three dates (March 9, 2010, June 21, 2010, and Mar 20, 2011), the RMSEs for the predictions from STTFN were all lower than those from ESTARFM, STITFM, and StfNet. Finally, the STTFN produced the most accurate predictions in terms of the average RMSE and SSIM.
For June 21, 2010, the RMSE values computed for the prediction from ESTARFM, StfNet, and STTFN were large (all > 4.0 • C), much larger than those on other dates, while the RMSE for STITFM was relatively small. This might because of the large discrepancy between MODIS and Landsat LST on June 21, 2010 (Fig. 9), and the overall differences between Landsat and MODIS images on May 20, 2010 and August 24, 2010 were much smaller than that on June 21, 2010. This caused the relationships formed in ESTARFM, StfNet, and STTFN between neighboring dates to be inappropriate for predicting the fine spatial resolution image on the target date. For STITFM, the fine spatial resolution image prediction is based mainly on adding the differences between Landsat and MODIS images on neighboring dates to the MODIS image on the target date [11] rather than forming relations between the images and, thus, produced a relatively accurate result. Table V summarizes the quantitative assessment of the output from different algorithms applied to the Shengjin Lake area. Here, the predictions from STTFN had the highest SSIMs on all dates. The RMSEs of STTFN on December 8, 2016 and February 26, 2017 are lower than for the other methods. The only date for which the prediction with STTFN did not yield the lowest RMSE was on March 14, 2017 for which ESTARFM has a very slightly lower RMSE. The average results of the four methods also indicate the STTFN generally produced the most accurate predictions.
The AE distribution of image pixels for the Évora study area [ Fig. 10(a)] and Shengjin Lake study area [ Fig. 10(b)] on all prediction dates also show that STTFN has a better performance in both study areas (almost 60% AE < 1.0 • C). ESTARFM has a comparable AE distribution with STTFN  for Shengjin Lake study area [ Fig. 10(b)], while it performs not well in Évora study area (less than 40% AE < 1.0 • C) [ Fig. 10(a)]. ESTARFM, STITFM, and StfNet have similar distribution of AE in Évora study area, but they perform differently in Shengjin Lake area.
The results in Tables IV and V indicate that the predictions of the proposed STTFN were closest to the actual data and retained the most structural similarities. Fig. 10 shows that STTFN not only yielded the most accurate results but had the stablest performance. This is because the proposed STTFN established a realistic nonlinear relationship between input and output LSTs, whereas ESTARFM and STITFM are limited to linear relationships. Although StfNet builds the nonlinear relationship between input and output LSTs, it is a shallow network, which is unable to adequately represent complex nonlinear relationships. In addition, CNN is able to extract the high-level features which contain abundant semantic information (e.g., edges of objects in images and their mutual relations); thus, the average SSIMs of STTFN and StfNet, which include CNN, in the two study areas, are higher than the other two methods. To illustrate the quality of the spatial representation of LST in the various fusion results, predictions from the four methods and the actual Landsat LST images, and the error maps between four predictions and the corresponding actual LST images on August 24, 2010 (T a and d a ) and June 24, 2011 (T b and d b ) for Évora and on February 26, 2017 (T c and d c ) for Shengjin Lake reserve are shown in Fig. 11, respectively. It is evident that on August 24, 2010 and June 24, 2011, large parts of the predicted LST images obtained from STITFM and ESTARFM are higher than the actual values, especially for STITFM (Fig. 11). The predictions from ESTARFM are visually close to the actual LST image for February 26, 2017, but there are some considerably higher values as well (red rectangle in Fig. 11) The predicted image from StfNet for August 24, 2010 was visually good but too smooth, while that for June 24, 2011 showed some over-prediction. The smoothness problem also occurred for February 26, 2017, for which large areas show predicted values lower than the actual values. Overall, the predicted LST images generated by STTFN are most consistent with the actual LST images, which agrees with the SSIMs in Tables IV and V. For ESTARFM and STITFM, each predicted pixel is a weighted combination of the inputs. Therefore, if there are any abnormal values in the input LST image, such as local inconsistency in the gap-filled Landsat ETM+ SLC-off images, the output of the fusion product will contain corresponding abnormal values. For example, Fig. 12 Fig. 12(g) and (h). With STTFN, the predicted image is the combination of the two predicted outcomes from the learned forward and backward multiscale fusion CNNs. Therefore, if the abnormal values occur in only one pair of neighboring images, this pair will be given low weight in the abnormal regions and hence has little negative influence. The core procedure of StfNet is similar to STTFN, so it also has few abnormal values in the corresponding region. Similarly, when anomalous values occur in the gap-filled Landsat ETM+ SLC-off LST image, the results of ESTARFM and STITFM are impacted greatly, e.g., the road in the red bold box disappeared [ Fig. 13(a) and (b)], while the effect on predictions from StfNet and STTFN is relatively small (Fig. 13).

E. Computation Efficiency
Using a 1300 pixels × 1300 pixels area extracted from the Landsat LST images as an example, the computational  efficiency of the different fusion methods was compared. All experiments are carried out on the same computer equipped with an Intel Core i7-6700 processor with 3.4 GHz and 24 GB RAM and a NVIDIA 1080Ti GPU with 11 GB of RAM. StfNet and STTFN were implemented using the Ten-sorFlow framework with GPU acceleration, while ESTARFM and STITFM were tested under CPU mode. The comparison of computation efficiency for different methods is shown in Table VI. It is evident that ESTARFM was the most time-consuming method. The time cost of STTFN was close to STITFM and higher than for StfNet. For STTFN and StfNet, most computational time is spent on training, while StfNet is a shallow network with three layers, which requires less time for training. In general, CNN-based methods are more time-consuming than traditional methods because the training process is slow. However, STTFN employs the Adam algorithm is employed to update the parameters of the network, which has a much faster convergence speed than stochastic gradient descent (SGD) algorithm. Furthermore, batch normalization layers are used to overcome overfitting and speed up the training process. Thus, the total time of STTFN is less than that of ESTARFM and close to that of STITFM.

IV. DISCUSSION
An STTFN was proposed to predict fine spatiotemporal resolution Landsat-like LST images from MODIS and Landsat observations. The effectiveness of the proposed network was assessed using data from two study areas: Évora, Portugal and Shengjin Lake reserve, China.  The enhanced performance of STTFN relative to the other methods arises primarily from the ability to accommodate nonlinear relations between input and output LST data and a specific weight function-based combination. STTFN is a multiscale fusion CNN-based method that achieves high-level feature extraction and fusion at different levels. The slim-WDSR super resolves the temporal change image between two input coarse spatial resolution LST images and, to provide fine spatial detail, during the super resolution analysis, the temporally neighboring fine spatial resolution LST image is concatenated with the change image. This allows to represent fine spatial resolution change information. Some low-level features may be lost in the convolution process of CNN. To solve this problem, STTFN uses two local and one global residual learning processes which can retain shallow features. Therefore, STTFN can derive high-and low-level features at the same time. Finally, a key component of STTFN is the use of the STC-weighting function. In previous spatiotemporal fusion methods [34], [36], the temporal changes between the coarse spatial resolution LST images at the neighboring and target dates are employed to calculate weight parameters. However, the STC-weighting function also accommodates spatial consistencies between the two predicted fine spatial resolution images and the corresponding coarse image. Note that, despite the performance of STTFN was validated in two study areas here, it can be applied globally if sufficient training data set is available.
There are, however, some limitations to STTFN. First, the performance of STTFN greatly relies on the training samples, an inherent problem of CNN. Therefore, LST changes which are not contained in the training data may not necessarily be predicted accurately. The incorporation of additional inputs (e.g., in situ LST) in combination with other models (e.g., radiation transfer model) may help to solve this problem. Second, since the overpass time of MODIS and Landsat is similar (as indicated in Table II, time differences between them are all within half an hour), the data were not corrected for possible inconsistencies, but the results from combination of forward and backward predictions would not be ideal if the MODIS and Landsat LST did not match at the target date. At present, time normalization method has been applied to eliminate the time inconsistent of MODIS LST product [52]; we will try to integrate the time normalization method into CNN in further study. In addition, there may be large discrepancies between MODIS and Landsat LSTs for some image pairs because they were retrieved by different algorithms. These discrepancies would have large negative impacts on the fusion results (e.g., the result on June 21, 2010). Therefore, the accuracy of LST product is an important factor for the accuracy of STTFN as well. Third, two pairs of temporally close LST images are required to train the network and predict the result, which is sometimes difficult to obtain due to cloud cover. This problem would be reduced by use of methods to fill missing values in remotely sensed LST series [53]. Finally, some customized parameters in STTFN (e.g., cropping size and stride of training image patches, learning rate and decay rates of Adam) should be set in advance, and they were set via experience and a series of experiments. Final experiment showed STTFN was able to generate promising results for different date sets with these set values, but there may be several better values for these parameters which can derive more accurate results, and how to acquire the best values of these parameters is a hot topic in deep learning and needs further studies.

V. CONCLUSION
The proposed STTFN was used successfully to generate fine spatial resolution LST image. STTFN specified the nonlinear relations between input and output LST based on an integration of features extracted at different levels and fusion through a specially designed convolution neural network. The main novelty of STTFN is that: 1) it more fully uses low-and high-level features of the input LSTs through use of residual learning unit and super-resolution module, which enables spatial information at different scales to be obtained, and by its enhanced capacity to accommodate potentially complicated nonlinear relations between input and output data; 2) it uses Huber loss as the loss function in training which is robust to outliers and, hence, yields enhanced outputs; and 3) it preserves the STC of LST images through STC-weighting, which considerably enhances output quality. The proposed method was tested on actual data for two study areas and compared with three classic fusion methods (ESTARFM, STITFM, and StfNet). The results indicated that STTFN has the ability to produce accurate and stable fine spatiotemporal resolution LST image. Moreover, it also has a good performance in terms of computational efficiency. STTFN is designed to yield more accurate LST products with fine spatiotemporal resolution and, therefore, to support monitoring diurnal land-surface and ecological dynamics. Future improvements may include developing strategies to tackle missing value problem, analyzing network parameters, and introducing some physical models.
Zhixiang Yin received the M.S. degree in geodesy and surveying engineering from Wuhan University, Wuhan, China, in 2015. He is currently pursuing the Ph.D. degree in physical geography with the Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan, China.
He is also a Lecturer with the School of Resources and Environmental Engineering, Anhui University, Hefei, China. His research interests include deep learning and remotely sensed image fusion and super resolution.