Model-driven Learning for Generic MIMO Downlink Beamforming With Uplink Channel Information

Accurate downlink channel information is crucial to the beamforming design, but it is difficult to obtain in practice. This paper investigates a deep learning-based optimization approach of the downlink beamforming to maximize the system sum rate, when only the uplink channel information is available. Our main contribution is to propose a model-driven learning technique that exploits the structure of the optimal downlink beamforming to design an effective hybrid learning strategy with the aim to maximize the sum rate performance. This is achieved by jointly considering the learning performance of the downlink channel, the power and the sum rate in the training stage. The proposed approach applies to generic cases in which the uplink channel information is available, but its relation to the downlink channel is unknown and does not require an explicit downlink channel estimation. We further extend the developed technique to massive multiple-input multiple-output scenarios and achieve a distributed learning strategy for multicell systems without an inter-cell signalling overhead. Simulation results verify that our proposed method provides the performance close to the state of the art numerical algorithms with perfect downlink channel information and significantly outperforms existing data-driven methods in terms of the sum rate.

Abstract-Accurate downlink channel information is crucial to the beamforming design, but it is difficult to obtain in practice. This paper investigates a deep learning-based optimization approach of the downlink beamforming to maximize the system sum rate, when only the uplink channel information is available. Our main contribution is to propose a model-driven learning technique that exploits the structure of the optimal downlink beamforming to design an effective hybrid learning strategy with the aim to maximize the sum rate performance. This is achieved by jointly considering the learning performance of the downlink channel, the power and the sum rate in the training stage. The proposed approach applies to generic cases in which the uplink channel information is available, but its relation to the downlink channel is unknown and does not require an explicit downlink channel estimation. We further extend the developed technique to massive multiple-input multiple-output scenarios and achieve a distributed learning strategy for multicell systems without an inter-cell signalling overhead. Simulation results verify that our proposed method provides the performance close to the state of the art numerical algorithms with perfect downlink channel information and significantly outperforms existing data-driven methods in terms of the sum rate.

I. INTRODUCTION
Beamforming is an important multi-antenna technique to deal with interference and improve the capacity of multiuser wireless communications systems. Most beamforming design problems are nonconvex, so early efforts to optimize beamforming mainly refer to numerical algorithms [1] [2], which cause high complexity for practical implementation. Recently deep learning has been recognized as a new "learn to optimize" approach to design beamforming [3]- [7]. The deep learning approach significantly reduces the optimization complexity and the resulting beamforming solution can be potentially implemented in real time.
The optimization and performance of beamforming critically depend on the availability of perfect channel state information (CSI) at the transmitter. Existing deep learning based beamforming solutions are mainly based on the availability of perfect CSI. In the downlink, perfect CSI is difficult to J obtain at the base station (BS) for several reasons. In the time division duplex (TDD) systems, normally channel reciprocity is assumed, so the downlink CSI can be estimated from the pilots sent by the mobile users in the uplink. However, in practice, the channel reciprocity does not hold because the analog radio front-ends at the BS and the mobile users exhibit non-reciprocity due to non-identical behavior of the individual transmit and receive chains [8]. In the frequency division duplex (FDD) systems, for the BS to obtain the CSI, the BS needs to first transmit pilots to the users for the downlink channel estimation, and then users feed back the estimated downlink channel to the BS. As a result, this channel acquisition process incurs a large overhead and reduces the effective system spectral efficiency.
There has been significant efforts in obtaining the CSI using deep learning techniques. CsiNet was develop in [9] to learn CSI sensing and recovery in FDD-based massive multipleinput and multiple-output (MIMO) systems using the channel structure. A learned denoising-based approximate message passing neural network was proposed in [10] for beamspace channel estimation in Millimeter-wave (mmWave) massive MIMO systems. Based on the fact that the uplink and downlink channels share the same propagation environment, deep neural network for channel calibration between the two directions was designed in [11] for a generic massive MIMO system. A sparse complex-valued neural network was introduced in [12] to approximate the uplink-to-downlink mapping function in FDD massive MIMO systems. Convolutional neural networks and generative adversarial networks were used in [13] to infer the downlink CSI by observing the uplink CSI. The feasibility of channel mapping in space and frequency was demonstrated in [14], where the channels at one set of antennas using one frequency band can be learned from the channels at another set of antennas that use a different frequency band. A comprehensive joint channel estimation and feedback framework based on deep learning was proposed in [15], which realizes the estimation, compression, and reconstruction of downlink channels in FDD massive MIMO systems. There is an emerging direction of studies that aims to learn the beamforming solutions rather than improve the CSI accuracy, which is close to the main idea of this work. For instance, the work in [16] proposed a deep learning based CSI feedback framework for beamforming design in FDD massive MIMO systems to maximize the beamforming performance gain. A deep neural network using unsupervised training was proposed arXiv:2109.07819v1 [cs.IT] 16 Sep 2021 in [17] to map the received uplink pilots to the beamforming matrix at the BS for the intelligent reflecting surface assuming uplink-downlink channel reciprocity. A channel sensing and hybrid precoding design was proposed in [18], by using the received pilots without the intermediate channel estimation step for TDD massive MIMO systems.
While existing deep learning approaches have achieved success in individual tasks of channel estimation and beamforming design, they normally require massive amounts of data and computational resources, and their simple combination will not guarantee satisfactory end results. This is due to their inherent limitation of being data-driven and modelagnostic. The optimization of each separate task only focuses on its own objective assuming the other is ideal. In reality, the optimization in each task will introduce some error or imperfection, so the overall performance could deteriorate when they are simply putting together.
A promising direction to remedy this problem is modeldriven deep learning that combines the data-drive approach with the underlying domain knowledge, mathematical models and problem structures, etc., to achieve a better inference with less data. Recent advancements in model-driven deep learning approaches in physical layer communications were discussed in [19]. In our previous works [7], [20], we have proposed the model-driven neural network design for beamforming optimization by exploiting the problem structure. Deep neural networks that adopt the algorithmic structure and constraints of adaptive signal processing techniques were proposed in [21] that can efficiently learn to perform fast high-quality ultrasound beamforming by using very few training data. Model-driving learning is a new concept that can be broadly applied in engineering design, and a comprehensive review of leading approaches for combining model-based algorithms with deep learning can be found in [22], with detailed signal processing and communications oriented examples.
In this paper, we aim to design a model-drive deep learning approach to jointly tackle the challenge of channel estimation and optimize the downlink beamforming to maximize the sum rate of generic multiuser MIMO systems. Different from the literature, we assume only information about the uplink channel is available, without explicit knowledge of the downlink CSI, and the relation between the downlink and uplink channels is unknown. The uplink channel information could take the form of either perfect CSI or received pilots. This method will alleviate the burden of channel estimation at the user side, reduce the feedback overhead, and it is flexible enough to be used in both TDD and FDD systems, and in massive MIMO and multicell settings. Especially for FDD systems, the proposed approach allows to learn the downlink beamforming directly without the need of sending downlink pilots, uplink feedback or explicit channel estimation. The novelty of our work is two-fold: first, we propose to optimize the beamforming in order to maximize the end performance and therefore bypass the explicit intermediate channel estimation step; second, we introduce a model-driven deep learning-based approach. Comparing to existing data-driven beamforming learning, our proposed approach specifies the most appropriate features to be learned with improved performance of inference and end performance. Our main contributions are summarized as follows: • We exploit the algorithmic structure of beamforming solutions as the useful model information, and propose a hybrid method for joint learning of the downlink channel and optimization of beamforming to guarantee the sum rate performance. To be specific, we design a neural network consisting of two subnets for learning the downlink channel and the auxiliary power vector, respectively, from which the downlink beamforming solution can be constructed. The overall loss function is hybrid and chosen to be a weighted sum of the loss functions of channel and power learning using supervised training, and the sum rate using unsupervised training. • We investigate techniques to further reduce the problem dimension and achieve near-optimal low-complexity learning, by using the zero-forcing (ZF) beamforming in the loss function, which is specially appealing for massive MIMO systems. • We extend the proposed method to multicell massive MIMO systems in which a BS in each cell is able to learn the beamforming solution in a distributed manner without signalling exchange with other cells. To the best of our knowledge, this is the first distributed learning solution for the optimization of multicell beamforming. • Extensive simulations are carried out to evaluate the performance the proposed algorithms, which show that the proposed algorithm can achieve a sum rate close to the weighted minimum mean squared error (WMMSE) algorithm [2], and significantly outperforms existing learning methods. The remainder of this paper is organized as follows. Section II introduces the system model and the problem formulation. The uplink to downlink channel mapping is discussed in Section III. The model-driven hybrid learning approach for a general downlink is proposed in section IV. Section V presents techniques to further reduce the training complexity for single-cell massive MIMO systems. Section VI extends the result to allow distributed learning of the beamforming solution in a multicell massive MIMO scenario. Simulation results and conclusions are given in Section VII and Section VIII, respectively.
Notions: The boldface lower case letters and capital letters are used to represent column vectors and matrices, respectively. The notation A and A denote the transpose conjugate and the Frobenius-norm of a complex matrix A, respectively. C denotes the complex field. The operator CN (m, ) represents a complex Gaussian vector with mean m and covariance matrix . I denotes an × identity matrix.
[·] denotes the expectation of a random variable.
II. SYSTEM MODEL AND PROBLEM FORMULATION We start with a single-cell multi-input single-output (MISO) downlink system where a BS with antennas serves single-antenna users. The received signal at the user can be written as where h , ∈ C ×1 denotes the downlink channel vector from the BS to the user , w and denote the transmit beamforming vector and the information signal for the user with normalized power, respectively. is the additive Gaussian white noise with zero mean and variance of 0 . The beamforming matrix is W = [w 1 , · · · , w ] and we collect the downlink CSI into H = [h ,1 , · · · , h , ]. The received SINR at user is expressed as The sum rate is then written as sum = =1 log 2 (1 + ). Based on the above model, the sum rate maximization problem under the total transmit power constraint can be formulated as The extended system models to massive MIMO and multicell scenarios will be discussed in Sections V and VI, respectively, and specific techniques to reduce the training complexity and enable distributed learning will also be introduced. When the downlink CSI H is available, there exist numerical algorithms that can find the locally optimal beamforming solution of problem (3) such as the WMMSE algorithm [2]. In our recent work [7], we have proposed a deep learning method to solve this problem with perfect downlink CSI. However, without downlink CSI H , existing numerical or deep learning algorithms cannot be applied. Therefore, we focus on the design of downlink beamforming algorithms, when only the uplink channel information is available, either in the form of perfect CSI H or the received pilot signal.
III. UPLINK TO DOWNLINK CHANNEL MAPPING In this paper, we rely on the uplink channel information to infer the downlink channel and optimize the downlink beamforming, so we assume that there exists a deterministic and unique mapping from the uplink channel to the downlink channel but its explicit form is unknown and can be learned by a deep neural network. This assumption is based on the fact that a wireless channel between a transmitter-receiver pair is determined by the positions of the transmitter-receiver pair, antennas, carrier frequency and the environment in which the signals propagate including the objectives and their materials and shapes within the environment. Because both uplink and downlink channels share the same propagation environment, given the positions of the transceiver pair, there is an intrinsic mapping between the uplink and the downlink channels. Below we will give details for the FDD and TDD cases, respectively.
• In a FDD system, consider the single-antenna uplink channel ℎ and the downlink channel ℎ that operate at frequencies and , respectively. Assume that there are distinct propagation paths in the environment, the uplink channel can be written as where is the path attenuation, is the path delay and is a frequency-independent phase shift that captures the reflection and attenuation effects of the signal along the path . The path attenuation depends on the distance between the transceiver pair, their antenna gains, the carrier frequency and the environment, the phase shift depends on the scattering and the path delay depends on the propagation distance. Therefore when the environment and other factors are unchanged, there is a deterministic mapping from the positions to the channel [14]. Next we look at the mapping from the channel to the positions. Although the mapping from the channel to the positions may not always be unique, it is unique with a high probability in many practical wireless communication scenarios especially as the number of antennas increases which is widely exploited in the wireless fingerprinting [25] and positioning [26]. In other words, the mapping between positions and channel can be assumed bijective, so is the mapping between the uplink and downlink channels.
• In a TDD system, the channel reciprocity is usually assumed but in reality, the analog radio front-ends at different wireless nodes such as BS and the mobile users exhibit non-reciprocity due to non-identical behavior of the individual transmit and receive chains. This is caused by the mismatches in the frequency-responses of both the BS and user side radio front-ends between the transmit and receive modes, and the differences in mutual coupling of BS antenna units and the associated RF transceivers under transmit and receive modes [27] [28]. Specifically, consider the channels between a BS with antennas and a single-antenna user in linear TDD systems. The uplink channel h ∈ C ×1 and downlink channel h ∈ C ×1 can be written as [8] [11], where h ∈ C ×1 is the physical reciprocal channel, and are the frequency-responses at the user side in the transmitting and receiving modes, respectively. Denote L ∈ C × as the frequency-response matrix and M ∈ C × as the mutual coupling matrix of the BS, and then R = L M and T = M L where the subscripts and denote the transmitting and receiving modes, respectively. The frequency-response matrix L is diagonal but the mutual coupling matrix M is not diagonal. In general, M ≠ M , L ≠ L , ≠ , so the uplink and dowlink channels are non-reciprocal, and their relation can be described as which is a deterministic and unique mapping. Once the bijective mapping between the uplink and downlink channel is established, it can be learned by using the deep neural networks based on the universal approximation theorem [29]. Note that in the above, we have adopted explicit parametric modelling of uplink and downlink channels with simplifying assumptions (e.g., not all hardware impairment sources such as power amplifier distortion, phase noise and quantization noise, are considered), but in practice such parametric models may not be accurate. In this paper we do not assume any specific parametric channel models in our theoretical development and instead we use the model-driven learning approach to learn the best mapping of the uplink to downlink channel in order to maximize the end performance.

IV. THE PROPOSED GENERAL ALGORITHM FRAMEWORK
Since we consider a generic system which could be either TDD or FDD, the exact theoretical characterization on the mapping between the uplink and downlink channel is unknown as long as it is bijective, so the most viable way to obtain the downlink CSI without user feedback is to learn it from data first, based on which, the downlink beamforming will be learned or optimized subsequently to maximize the sum rate. This is a traditional method that treats the channel learning and the end performance optimization separately. The main drawback of the separate learning is that the explicit channel learning process does not take into account the ultimate objective of maximizing the sum rate, and it also causes error propagation when optimizing the beamforming. In this paper, we use a deep learning approach to solve the problem (3) directly from the uplink channel information. We still include learning the downlink channel from the uplink channel information but it is only an intermediate step and the focus is not to achieve the best channel learning performance. The key idea of our proposed method is to exploit the optimal structure of the beamforming solution as the useful model information, which then guides the design of a highly efficient neural network to solve (3), with the assistance of the learned downlink CSI. More details are given below.

A. Structure of beamforming
According to [23], the optimal downlink beamforming vectors that maximizes the sum rate possesses the structure below where and are positive parameters and satisfy =1 = =1 = . The parameter vector p = [ 1 , . . . , ] represents the downlink power allocation. q = [ 1 , . . . , ] is an auxiliary vector variable which is useful to determine the direction of beamforming. Because it needs to satisfy the same total power constraint as the downlink power vector p, q can be interpreted as the virtual uplink power vector. The advantage of this representation is that the power vector [p; q] can be regarded as the key feature of the beamforming solution. Instead of learning the high-dimensional beamforming matrix W directly, (7) allows us to learn the low-dimensional feature [p; q], which will greatly improve the learning efficiency and accuracy, and reduce the training complexity.

B. Key modules of the proposed neural network
Based on the expression in (7), we propose a neural network to jointly learn the downlink channel, the power feature vector with the end objective of maximizing the system sum rate, as illustrated in uplink channel information as input, and its output is the beamforming matrix W. The input to the neural network can be either the perfect uplink CSI H with a dimension of 2 by stacking the real and imaginary parts, or the uplink pilot signal which will be discussed later in this section. The proposed neural network structure consists of the following three modules: • CSI-Net. This sub-net aims to learn the downlink channel H from the uplink channel and its output is separated into the real part R (H ) and the imaginary part I (H ).
When the input is the uplink CSI, it will perform uplinkdownlink calibration for TDD systems, while in FDD systems it will map the channel in the uplink band to the channel in the downlink band. It is not necessary to specify the TDD or FDD system as this module learns the downlink channel automatically for either case. When the input is the uplink pilot signal, it additionally estimates or refines the uplink channel but this is embedded in the module implicitly. Suppose we have a channel training dataset in which there are uplink-downlink CSI pairs. Given the predicted output result of the -th sample in the CSI-Net isĤ ( ) and the target result in the training dataset is H ( ) , the mean squared error (MSE)-based loss function of CSI-Net is defined as The structure of CSI-Net depends on the specific systems of interest, and in this paper, we adopt fully connected layers and details will be given in Section VII. Note that for the uplink CSI input, when the channels for different users are uncorrelated and the statistics is similar, we can use the same CSI-Net to learn the downlink channel in a single-user manner. This effectively increases the amount of training data by a factor of , while reduces the complexity of the neural networks. • Power-Net. This sub-net aims to learn the concatenated power vector [p; q], which is the key feature of the beamforming solution. In the literature, there is no method available that can find the optimal * and * in (7) to maximize the sum rate with a reasonable complexity. The WMMSE algorithm [2] is a well known iterative method to find the locally optimal solutions. It ensures the continuity of the mapping from the channel to the solution, which can be learned by a neural network. Therefore, we generate samples of the power allocation vectors p and q for training, by using the WMMSE algorithm. The supervised learning with the following loss function based on the MSE metric will be used to train the Power-Net, (9) where p ( ) and q ( ) are the -th training samples of the downlink and uplink power vectors in the power training database obtained from the WMMSE algorithm, respectively, andp ( ) andq ( ) are the predicted results of Power-Net. Similar to CSI-Net, the neural network structure is system-dependent, and in this paper, we adopt fully-connected layers or convolutional neural network (CNN) layers which will be specified in simulation results of Section VII. • Beamforming Recovery Module. This module has two functions. First, it aims to find the downlink beamforming matrix from the downlink channel output from the Power-Net and the uplink and downlink power output from the Power-Net using the structure specified in (7); there is no parameter to optimize in this module. Second, this module is important to calculate the sum rate using unsupervised learning which then forms the overall loss function for effective hybrid training as described in the next subsection.

C. Hybrid training
To train the proposed neural network, we construct the overall loss function as the weighted sum of the losses of the CSI-Net, the Power-Net and the sum rate, i.e., where = − sum , and , ∈ { , , } is the weight for each loss component. and can be obtained by supervised learning from the CSI-Net and the Power-Net, respectively, while sum can be calculated by using the beamforming matrix obtained from the above Beamforming Recovery Module and the learned downlink channel from the CSI-Net based on (2); overall, the training adopts a hybrid supervised and unsupervised approach. Note that both channel learning and power learning are auxiliary in our proposed algorithm, and the focus of the overall learning is to maximize the sum rate but not to achieve the best learning performance of the downlink channel matrices and power vectors. The incorporation of the sum rate into the loss function is important because this ensures that the training of the neural network is guided by the end performance; this is in stark contrast to the separate training approach which only focuses on learning the channel or the power vectors but cannot guarantee the sum rate performance. In addition, the inclusion of the and is also important because the downlink channel is unknown and it is difficult to learn the overall mapping from the uplink channel directly to the downlink beamforming which leads to unsatisfactory training performance. We illustrate the advantage of the proposed algorithm over the supervised learning ( = 0) and unsupervised learning methods ( Fig. 2, where the uplink CSI follows distribution and the relation between the -th elements of the uplink and downlink CSI is As can be seen, the proposed algorithm outperforms the supervised learning method, and the performance gap increases as the number of antennas grows. When = = 10, the proposed algorithm achieves about 10% higher sum rate than the supervised learning method, while the unsupervised learning cannot achieve satisfactory sum rate performance as explained above. The results show that in sharp contrast to the case where perfect CSI is available, existing methods of supervised and unsupervised learning could not achieve satisfactory performance when the CSI needs to be learned.

D. Learning from the uplink pilot signal
In this subsection, we introduce the preprocessing when the uplink channel information is in the form of the received pilot signal. Assume that users transmit pilot symbols. Suppose the pilot signals sent by all users are collected in X ∈ C × , then the received pilot signal in the uplink Y ∈ C × can be written as where N ∈ C × is the received noise and the elements of N has zero mean and variance of 2 . In order to recover the uplink channel from the pilot signal, we choose the pilot X to be a sub-matrix of scaled discrete Fourier transform (DFT) matrix of dimension max( , ) × max( , ) (multiplied by the square root of the transmit power), which has orthogonal columns and rows and its elements have unit amplitude. For our proposed algorithm, instead of using the received pilot Y, we use the least square versionỸ ∈ C × as the input to the neural network: Clearly when ≥ , XX = I which means in this case, pilot sequences between users are orthogonal; while when < , there exists pilot contamination among users which will degrade the performance of estimating the uplink channel. Note that in our proposed approach, we do not need to estimate the uplink channel explicitly, but we learn the downlink beamforming from the uplink pilot directly.
For comparison, the traditional separate approach would first estimate the uplink channel, e.g., by using a linear minimum mean-squared error (MMSE) estimator, and then use it as the input of the neural network. Suppose the linear MMSE estimator is in the form of where R and B are the weighting coefficients. Then the corresponding channel estimation can be obtained by solving the following optimization problem: The optimal R and B are given by and the liner MMSE estimation of H is given bŷ The proposed algorithm framework in Section IV is general so it can be applied to massive MIMO downlink, straightforwardly. However, the large number of antennas may introduce a high computational complexity, especially in the calculation of the loss function during the training. In this section, we propose two techniques that can reduce the complexity when calculating the loss function of the sum rate in the training process for massive MIMO systems, without compromising the end performance, especially when .

A. Massive MIMO channel model
We assume that the BS is equipped with uniform linear array (ULA) antennas, without loss of generality. The channel between the BS and a user (the user index is omitted for simplicity) that consists of paths is modelled as where , , and are the attenuation, the path delay, the phase shift and the angle of arrival (AoA, for the uplink) or the angle of departure (AoD, for the downlink) of the -th path, respectively. Moreover, a( ) ∈ C ×1 is the array response vector defined as where is the antenna spacing and is the wavelength. From the structure of the optimal beamforming (7), we can see that it involves inversion of a matrix of dimension × . This causes a high complexity for massive MIMO systems, because the calculation of the sum rate in the loss function (10) requires the construction of the optimal beamforming according to (7), which needs to calculate the inversion of a matrix of size × .

B. Dimension reduction
In the first technique, we aim to reduce the dimension of the inverse matrix, when calculating the optimal beamforming using (7). We find the following proposition is useful which is adopted from the result in interference channels [30] [31].
Proposition 1: Suppose ≥ and the users' channels are linearly independent and that h h ≠ 0, ∀ ≠ . Then if w is a beamforming vector for user that corresponds to a rate point on the Pareto boundary, there exists complex numbers It can be proved using the same method as that in [30] and therefore the proof is omitted.
Recall that H = [h ,1 , · · · , h , ], and define ξ = [ 1 , · · · , ] ∈ C ×1 . Suppose w = Hξ , then the SINR expression becomes Define the eigenvalue decomposition H H = UΛU , where U ∈ C × is the unitary eigen-matrix and Λ ∈ C × is the diagonal eigenvalue matrix. Now define the new beamforming vector v = Λ 1/2 U ξ and the new channel vector g = Λ −1/2 U H h , . Then the sum rate maximization can be written equivalently as We can see from the above new problem formulation (22) that the size of the new channel matrix {g } reduces from × to × . The structure of the optimal beamforming is revised to v * = √ (I + =1 0 g g ) −1 g and as a result the size of the matrix inversion is reduced from × to × . Note that although the above dimension reduction technique reduces the dimension for matrix inversion significantly, it involves extra matrix multiplication and eigenvalue decomposition, when constructing the new channel vectors . It was shown in [32] that standard linear algebra operations for a square matrix of dimension × , including matrix inversion and eigenvalue decomposition problems have the same time complexity as the matrix multiplication algorithm and there exists the Coppersmith-Winograd algorithm for matrix multiplication with a complexity of O ( 2.376 ) [34]. Therefore, the benefit of the dimension reduction technique on the overall training time is only obvious when and this is verified by the simulation results in Section VII.

C. ZF beamforming in the loss function
Another technique to reduce the complexity of matrix inversion in (7), when calculating the sum rate in the loss function, is to use the ZF beamforming in the following form where is chosen to satisfy the power constraint, i.e., W 2 = . It has been shown in the seminal work on massive MIMO [33] that, ZF beamforming is near-optimal when the number of antennas is large. From the complexity's viewpoint, the advantage of the ZF beamforming is that it only involves the matrix inversion of H H which is a × matrix and reduces the complexity of matrix inversion in (7).
Similar to the dimension reduction technique, the advantage of the ZF beamforming is more prominent when and diminishes as increases, but its performance is always close to the WMMSE solution as long as 1. These properties are verified by the simulation results in Section VII.

VI. MULTICELL MASSIVE MIMO DOWNLINK
In this section, we apply the proposed model-based learning to multicell massive MIMO systems, where an example of the multicell massive MIMO system with seven cells is illustrated in Fig. 3. Specifically, we first introduce the distributed multicell massive MIMO downlink beamforming, without the need of signal or data exchange between cells, and then describe the uplink channel estimation via pilots, followed by the proposed model-based learning method.

A. Distributed Optimization of Beamforming in Multicell Massive MIMO Downlink
Consider an -cell massive MIMO system, and in each cell a BS with antennas serves single-antenna users, = 1, . . . , . Denote the beamforming vector for the -th user in the -th cell as w ∈ C ×1 , then the received scalar signal in the -th cell is expressed as follows: where h , , ∈ C ×1 is the downlink channel from the BS in the -th cell to the -th user in the -th cell, is the signal for the -th user in the -th cell. Therefore the SINR of the -th user in the -th cell is written as follows Based on (26), to calculate SINR, for the -th user in the -th cell, it requires not only the beamforming w and downlink channels h , , , but also the knowledge of any possible interfering cell with = 1, . . . , and ≠ . This includes interfering BSs' downlink channels to the considered user in the -th cell h , , and all users' beamforming vectors in those interfering cells w with = 1, . . . , . This means to optimize the SINR in any single cell, the solution of the beamforming vector in this cell is coupled with the solutions in other cells. Traditional methods, such as the coordinated transmission, treat the multicell as a large single cell and use ZF beamforming method to cancel out the interference term in (26). These methods require the cooperation between cells such as the exchange of CSI and centralized joint optimization of beamforming vectors, which results in high signaling overhead [35].
In order to decouple the beamforming between different cells, the signal to leakage plus noise ratio (SLNR) is used in this paper as an alternative, where the SLNR of the -th user in the -th cell is defined as follows [36]: Note that the key difference between the SLNR in (27) and the SINR in (26) is the interference item in the denominator, where the SLNR in (27) considers the 'leaked' interference power due to the beamforming w from the -th user in the -th cell to all other cells' users, instead of the received interference in SINR in (26). Traditionally the data rate is defined based on SINR in (26), but this requires the cooperation between different cells to estimate the data rate performance of the user. Since the SLNR definition in (27) shares similarity with the SINR definition in (26), here we define an approximate data rate based on SLNR the -th user in the -th cell as follows: where the subscript "SLNR" is used to differentiate it from the original data rate definition, and the approximation is based on the high SNR assumption.
With the definition of SLNR in (27), the optimization of beamforming vector w for the -th cell depends only on the downlink channels from the BS in the -th channel to users in all cells, which can be estimated via the uplink estimation. More importantly, since the required channel information can be obtained by the BS in a single cell, the exploitation of SLNR instead of SINR decouples the beamforming in individual cells. Therefore the multicell sum rate maximization problem can be decoupled to the per-cell sum rate maximization problem to be addressed in each cell as where is the transmit power limit for each BS. To solve the problem in (29), the beamforming vector w is rewritten as w = √ u , where u satisfies ||u || = 1 for all = 1, . . . , , and represents the direction of the beamforming vector, while the power is characterized by . Then with (28), the objective function in (29) can be further rewritten as follows: while the per-cell sum rate maximization problem in (29) can be further rewritten as follows (31) The problem (31) is non-convex, so we propose to use the alternating optimization to solve it. Specifically, with a given power allocation , the optimal beamforming direction u is solved as: When the beamforming directions u are fixed, the objective function of (31) is a posynomial with regard to , so (31) can be solved via geometric programming [37]. Since in each step, the objective function is nondecreasing, such an alternating optimization algorithm will converge. Therefore we can use the alternating optimization algorithm to generate the downlink power solutions as the labelled data in the training process.
When >

=1
, the dimension reduction techniques in Section V.B can be also applied to the multicell scenario to reduce the training complexity.
Note that the optimal beamforming based on SLNR involves only the downlink channels from the BS in a single cell to the users in all cells. This enables the distributed beamforming at each single cell and no cooperation is required between cells, which helps to reduce the signaling overhead in the multicell scenario. More importantly, this enables the proposed learningbased method to be decoupled in the multicell scenario and applicable when the system scales up.

B. Learning From Uplink Pilots
The optimization based on SLNR requires the channel from the BS in each cell to learn the downlink channels to users from the uplink channel information, which can be achieved by estimation from uplink pilots.
Consider the -th cell surrounded by − 1 neighbouring cells. Assume that each user in the multicell system transmits pilot symbols of length , then for the BS in the -th cell, the received pilot signal in the uplink Y ∈ C × is written as where denotes the uplink channels from all users to the BS in the -th cell, N ∈ C × is the received noise whose elements has zero mean and variance of 2 , and X ∈ C =1 × is the pilots sent by all users. Similar to the single cell scenario, the pilot X adopts the sub-matrix of a discrete DFT matrix with dimension max{ =1 , } × max{ =1 , }, which is scaled by the square root of the transmit power. Then the least square version of the uplink channelỸ ∈ C × =1 is used as the input of the neural network, and can be estimated via the received pilot signal Y as follows Clearly to avoid the pilot contamination between users, it requires the pilot's length to be no less than the total number of users, i.e., ≥

=1
. For the traditional separate approach, the uplink channel will be estimated using the linear MMSE method.

C. The Proposed Distributed Learning
In this subsection, we will adapt the proposed model-based learning approach in Section IV to the multicell scenario. We will use a different SLNR beamforming structure below based on (32), i.e., (35) From (35), we can see that in order to construct the multicell beamforming, we need the downlink channel information and downlink power allocation. Therefore, we can still use the hybrid loss function (10) which is rewritten below except the loss of Power-Net only involves the downlink power, i.e., for the -th cell, it becomes Note that the proposed method based on SLNR and learning from uplink pilots can be generalized to larger systems. The above analysis is based on the -th cell, which treats all other cells as interfering cells. With a homogeneous assumption that the conditions are similar in each cell in the whole multicell massive MIMO system, the analysis of -th cell is applicable to any cell in the system. This means the trained neural network's parameters are applicable to all cells in a distributed manner, which is demonstrated in the simulation results in Section VII.C.

VII. SIMULATION RESULTS
In this section, numerical simulations are carried out to evaluate the performance of the proposed algorithms. For the proposed model-driven solution, when calculating the sum rate in the loss function, both the original algorithm that uses the beamforming solution in (7), and the one that uses the simplified ZF beamforming in (24) are included. For comparison, we consider the following benchmark algorithms: • The WMMSE solution: this is the solution obtained by using the iterative algorithm proposed in [2] assuming the downlink channel is available, therefore it serves as a performance upper bound. • The Learned Channel and Beamforming Solution: this solution is obtained by first learning the downlink channel via supervised learning, then using the learned downlink channel to infer the beamforming solution using unsupervised training. • The Learned Channel and ZF Beamforming: this solution is used as the benchmark solution in the single-cell scenario. It first learns the downlink channel via supervised learning as the above solution, and then constructs a ZF beamforming (24) by using the learned channel. • The Learned Channel and SLNR Beamforming: this solution is used as the benchmark solution in the multicell scenario. It first learns the downlink channel via supervised learning as the above two solutions, and then constructs a SLNR beamforming (32) by using the learned channel. Note that the SLNR based approximate data rate is used only for the purpose of the neural network training, while the sum rate performance is calculated based on the actual SINR defined in (26). Note that the closed-form non-iterative ZF and SLNR based beamforming solutions are chosen as the benchmarks to ensure the low complexity comparable to our proposed algorithm. In the following, we will present the simulation results and analysis for three scenarios: single-cell small-scale MIMO, single-cell massive MIMO, and multicell massive MIMO systems, as well as the generalization results.

A. Small-scale MIMO Scenario
In this scenario, a TDD downlink system in which one BS with equal numbers of transmit antennas and users is considered, i.e., = 1. We assume the uplink channel elements follow an independent Rayleigh distribution with zero mean and unit variance. Following the result in [11], the relation between the downlink channel and the uplink channel due to the radio front-end mismatch is modelled as: where the unitary matrix Φ and ∼ CN (0, 1) are used to model the mismatches in the frequency-responses of the BS and the user sides, respectively. This mapping will be learned by the CSI-Net. Suppose the learned downlink channel isĤ . The learning performance is characterized by the normalized MSE (NMSE), which is defined as The structure and hyper-parameters of the proposed neural network are as follows. For the CSI-Net, we use four fully connected layers, each with 4 neurons and the 'tanh' activation function. No activation function is employed at the output layer. The Power-Net employs four fully connected layers, each with 4 neurons and the 'relu' activation function and batch normalization. The output layer of the Power-Net uses 'softmax' as the activation function. Beamforming learning for the 'Learned Channel and Beamforming' solution uses the same neural network as CSI-Net, except that a batch normalization is included at each layer. 10 6 training samples are used for this scenario. We also provide brief analysis of the complexity for the online prediction. For fully connected layers, suppose the input dimension is and the number of neurons in the hidden layer is , then the numbers of multiplication and addition operations are equal to . Therefore the overall neural network has an approximate complexity O 2 2 for the online prediction.
We first show the sum rate results of various algorithms in Fig. 4(a), when the transmit power is 20 dB and the number of users/BS antennas vary from 2 to 8. We assume that the perfect uplink channel CSI H is available. It can be seen that the proposed solution achieves the sum rate close to that of the WMMSE algorithm , and it significantly outperforms the benchmark schemes. The performance of both benchmark methods that first learn the downlink channel is not satisfactory. The scheme of learned channel and beamforming, achieves the worst performance while the one using a ZF beamforming achieves a higher data rate but still much lower than the proposed solutions. To further investigate the reason, we plot the channel learning results of downlink channel learning in Fig. 4(b). It is obvious that the benchmark schemes that explicitly learns the channel first outperforms the proposed solution in terms of achieving a lower estimation NMSE. However, its end performance, i.e., the sum rate is much lower than the proposed solution. This confirms the advantage of the proposed model-driven learning with a hybrid training that can better exploit the available uplink channel information to optimize the end performance; the traditional approach that separately learns the channel and then designs the beamforming is not adequate since it is purely data-driven.

B. Single-cell Massive MIMO Scenario
In this scenario, we adopt the channel model specified in Section V for a FDD massive MIMO system and assume = 64. The uplink and downlink operate at 2.5 GHz and 2.4 GHz, respectively. The antenna spacing is half wavelength of the downlink signal. Since the DoA or the AoD is limited within a certain region of the mean angle¯, it is modelled as ∈ [¯− Δ /2,¯+ Δ /2], − 6 ≤ Δ ≤ 6 . The path delay and the phase shift are uniformly distributed in the ranges of [0, 10 −4 ] and [− , − ], respectively. We assume the downlink channel attenuation , of the -th path follows an independent Rayleigh distribution with zero mean and unit variance. Given the ULA channel model in (18), the nonlinear relation between the uplink channel h and dowlink channel h are characterized by: where the unitary matrix Φ and ∼ CN (0, 1) are used to model the mismatches in the frequency-responses of the BS and the user sides, respectively, which will be learned by the CSI-Net.
The structure and the hyper-parameters of the proposed neural network are as follows. The CSI-Net contains three fully connected layers, each with 2 neurons each and the 'tanh' activation function. The Power-Net uses two 1-D CNN with filter sizes of 16 and 8, batch normalization and 'relu' activation function and a dropout rate of 0.3, followed by a fully connected layer with 256 neurons and 'relu' activation function, and an output layer that uses 'softmax' as the activation function. We provide brief complexity analysis for the online prediction. For 1-D convolutional layers, suppose there are kernels of size in the -th convolutional layer and the input dimension is × (padding is added such that the first input and output dimensions remain the same across layers, i.e., ), then the output dimension of is × , and the numbers of multiplication and addition operations are equal to . Therefore for the Power-Net, the complexity for the online prediction is approximately O ( ), while the overall complexity is still O 2 2 considering the CSI-Net.
The sum rate results of various algorithms are shown in Fig. 5(a), when the transmit power is 10 dB and the number of users vary from 2 to 10. It can be seen that the proposed solution achieves the sum rate very close to the WMMSE solution and the use of a ZF beamforming in the loss function has almost no performance loss. The proposed solutions outperform the learned channel and the ZF beamforming solution, by about 10%, although the latter performs much better than the small scale MIMO scenario. The learned channel and beamforming solution is still the worse because it uses a datadriven approach, without taking into account the overall design objective when learning the downlink channel.
Next, we evaluate the required training time of the proposed algorithms when using the low-complexity implementation discussed in Section V. The training time of the proposed original algorithm 0 and the percentage of the time required by the reduced dimension and the ZF beamforming techniques in relation to 0 are shown in Table I at the top of the next page. It is obvious that when the number of users is small, the reduced dimension in matrix inversion of the two low-complexity schemes leads to a much shorter training time. However, as the number of users increases, the gain of the low-complexity schemes diminishes. This is because for the reduced dimension scheme, it involves an extra eigenvalue decomposition and matrix multiplications. In addition, as the number of users increases, the training time is dominated by the width of the fully connected layers which is 2 , therefore the time saved by matrix inversion becomes insignificant.
The sum rate performance versus the number of pilots is shown in Fig. 5(b), when the number of users is = 6. Similar to Fig. 5(a), it is confirmed again that the use of ZF beamforming in the loss function achieves almost the same performance as the original algorithm and both are close to the WMMSE solution when the number of pilot symbols ≥ 6. The learned channel and ZF beamforming achieves good performance in this scenario, which is different from the small scale scenario. There is still significant gap between the proposed solutions and the learned channel and beamforming solution and this verifies the superior performance of the proposed model-driven approach.

C. Multicell Massive MIMO Scenario
In the multicell massive MIMO scenario, we consider a TDD system in which the uplink and the downlink operate at different frequencies of 2.5 GHz and 2.4 GHz, respectively. In the simulations, a multicell system with = 7 cells is considered, where each cell has one active user as illustrated in Fig. 3. The total transmit power of the BS is 10 dBm, the bandwidth is 20 MHz and the noise power spectrum density is -174 dBm/Hz. The radius of each cell is 200 m, and the users are randomly located in each cell by following a uniform distribution. The minimum distance between each user and its serving BS is 10 m.
The channel attenuation includes both the small and large scale fading effects. The relation between small scale channel attentions is the same as (40) in the single-cell scenario. The large scale fading ( ) PL (measured in dB, and the uplink/downlink subscript is omitted) between the BS and the user is given as follows where 0 is the reference path loss gain measured in dB, which includes the effect of the central frequency , is related to the path loss exponent, and is the distance between the BS and the user measured in kilometer. In the simulation, we assume that the uplink and downlink path loss gains are expressed as, respectively,  In the multicell massive MIMO scenarios, the number of transmit antennas is one of the key factors that determines the users' end performance, so we first evaluate the sum rate performance of all users versus different numbers of antennas in Fig. 6(a). As seen from Fig. 6(a), the proposed solution achieves a tight sum rate performance compared to the WMMSE solution. The proposed solution outperforms all benchmark algorithms, while the performance comparison of each benchmark algorithm follows a similar trend as in the single-cell scenario. It is also noticed that the proposed solution based on the ZF loss function shows a close performance to the proposed solution based on the SLNR in the loss function, and the performance gap is generally reducing as the number of antennas increases. It is also clear that the data-driven approaches are worse than the model-driven approaches, while the learned channel and beamforming solution is still the worst.
Next we investigate the sum rate performance versus different numbers of pilot symbols in Fig. 6(b), when = 64 and the uplink channels are estimated via the pilots. It can be seen that the performance of all algorithms improves as the number of pilots increases, but there is a significant gap from the WMMSE solution when < 7. When ≥ 7, the proposed solutions with SLNR and ZF loss functions achieve close performance to the WMMSE solution and significantly outperform other benchmark schemes. This observation is expected as there are a total of seven users in the considered scenario. When the number of pilots is less than seven, there will be users with non-orthogonal pilots, which results in a deteriorated accuracy of the recovered uplink channel from the pilots.

D. Generalization
In this subsection, we study the generalization performance of the proposed joint training method in the multi-cell massive MIMO scenario when = 32. The model to be tested is trained with a total transmit power = 10 dBm for the originally considered 7-cell scenario and no Doppler effect is considered. First, we evaluate the performance of the trained model under different levels of transmit power in Fig. 7(a). It is seen that the inference of this model is still capable of achieving a tight performance close to the WMMSE solution as the transmit power is near 10 dBm, but the performance gap increases as the testing power deviation is large. This shows the proposed method can generalize to scenarios where there is a small variation of the testing power. Secondly, we consider the scenario where the system has different number of cells and the results are presented in Fig. 7(b). It is seen that both proposed solutions achieve the sum rate performance close to the WMMSE solution under different cell numbers, which confirms that the proposed distributed learning solution generalizes well as the number of cells increases. Finally, we consider the user mobility characterized by the Doppler effect in Fig. 8. Note that in practical systems, the Doppler effect may be estimated and compensated at the receiver side [38]. Here we assumed the Doppler effect is not compensated so that it will influence the estimated CSI and the sum rate performance. We consider the maximum Doppler frequency to be in the range of 10 to 150 Hz, which corresponds to a speed range from 1 to 20 m/s. Fig. 8    rate performance of the proposed solutions shows moderate degradation compared to the WMMSE solution as the Doppler frequency increases, but is still significantly higher than the benchmark solutions.

VIII. CONCLUSIONS
In this paper, we have proposed a new downlink beamforming optimization algorithm to maximize the sum rate using deep learning when only the uplink channel information is available, but its mapping to the downlink channel is unknown. We introduced a model-driven learning approach by exploiting the structure of the optimal beamforming solution to facilitate an effective neural network design. The proposed approach was extended to massive MIMO and distributed multicell scenarios. Simulation results demonstrated that our proposed algorithm can approach the performance of the conventional WMMSE algorithm, and achieves a much higher sum rate than the benchmark schemes. These results show the importance of model-driven and holistic learning approaches to optimize downlink beamforming in practical systems. By setting the first order derivative of (46) to be zero, the optimal R is given by The optimal B * and the linear MMSE estimationĤ can be obtained by substituting (47) into (45) and (13), respectively.