A Classiﬁcation-Regression Deep Learning Model for People Counting

—In this paper, we construct a multi-task deep learning model to simultaneously predict people number and the level of crowd density. Motivated by the success of applying “ambiguous labelling” to age estimation problem, we also manage to employ this strategy to the people counting problem. We show that it is a reasonable strategy since people counting problem is similar to the age estimation problem. Also, by applying “ambiguous labelling”, we are able to augment the size of training dataset, which is a desirable property when applying to deep learning model. In a series of experiment, we show that the “ambiguous labelling” strategy can not only improve the performance of deep learning but also enhance the prediction ability of traditional computer vision methods such as Random Projection Forest with hand-crafted features.


I. INTRODUCTION
In many application scenarios, there is a need to count the number of people at a scene. For example, in public spaces such as airports and railway stations, knowing the number of people present at the scene can help better manage the space and to ensure public security. With the wide spread installation of visual surveillance cameras almost everywhere in such public space, it is possible to perform automatic people counting through analyzing surveillance videos. Using computer vision and machine learning techniques for people counting has therefore attracted a lot of interest in the literature. However, like many computer vision applications, people counting in video is also a very challenging problem.
In recent years, deep learning neural networks have emerged as a powerful technique for many computer vision problems. In this paper, we are inspired by the significant performance of deep learning on various vision tasks [1], [2], [3] and apply the deep learning method to extract deep feature for the crowd counting problem. In previous work, several kinds of deep learning models have been proposed to address the people counting problem. Zhang et al. [4] and Wang et al. [5] construct deep networks to directly output people number. Some later works [6], [7] apply deep learning network to produce density map instead of people number to achieve better performance. The density map presents the position of human heads and thus is able to provide people number. However, such method requires to label human position when constructing training datasets, which limits their scalability to the real world application. On the other hand, occlusion is a severe problem in crowd counting. In the case of high density crowd, it is difficult for human to label accurate head positions and provide reliable people numbers for the training datasets.
Inspired by the success of multi-task deep learning method [8], we propose a classification-regression deep learning model which treats the whole surveillance image as the input image, and the deep learning model not only outputs the person number but also estimates the level of crowd density. We show that such multi-task network structure is able to learn more discriminative feature representation than a network solely outputs people number, because the task of estimating the density level could provide a coarse counting number which is less affected by the variation of image scale. In the work of [8], they simultaneously produce density map and 10-way crowd count classification. We differ from their method by predicting people number instead of producing density map. Apart from the aforementioned reasons, directly predicting people number requires less computational resources, since producing density map is usually based on a convolutional layer with filter size of 1 × 1 to map feature map to the density map. In contrast, the performance of our method is comparable to that of [8] through using only one fully-connected layer after the base network.
In order to address occlusion problem, we also adopt a strategy called "ambiguous labelling" method. The "ambiguous labelling" was first applied to solve the age estimation problem [9], [10], [11], since the faces of neighbouring ages usually present similar image features. Thus, in the previous work of age estimation, authors could assign ambiguous labels to input face images and take the problem as a classification task. We reason that it is also possible to apply "ambiguous labelling" strategy to the crowd counting problem. One reason is that people counting problem is similar to the problem of age estimation, for instance, the image of 500 people is similar to the image of 510 people. On the other hand, the size of people counting dataset is usually small, which is not sufficient to train a deep learning model with large number of parameters. To solve this problem in the deep learning context, "ambiguous labelling" method enables us to create various people number labels for the input image that can augment training dataset for the deep learning model. We provide detail analysis in section III. In the experiment, we show that this method is effective not only for the deep learning model, but also for the traditional computer vision methods such as random projection forest model [12].

II. RELATED WORK
The crowd counting task was initially solved by the detection method. Different kinds of features are used to detect the body of pedestrians including motion features [13], histogramof-gradients [14] or Bayesian model-based segmentation [15]. However, occlusion becomes a serious problem when applying to estimate high density crowd. Then the part-based detection methods are developed to solve this problem [16], [17]. These methods usually take a long time to count people since they have to exhaustively scan each frame of the video with the trained detector. Another approach is to cluster the trajectories which have coherent motion and then the number of clusters is used to estimate the moving pedestrians [18], [19]. One problem of the clustering method is that it can only provide accurate result when reliable trajectories can be extracted. Thus, this approach is not able to handle the occlusion problem and low video frame rates due to the broken feature tracks. Foroughi et al. [20] take the people counting task as a classification problem. They apply sparse representation to capture the hidden structure and semantic information in the image data, and the feature dimension is further reduced by random projection. However, one serious problem with the classification method is if any label information (i.e. the number of people) in the testing set is not included by the training set, this method cannot achieve high accuracy result, which means their algorithm requires large training set to cover almost all the possible situation in the testing set.
A more suitable approach to solving the aforementioned problems is to count by regression. Low-level features are firstly extracted and then mapped to the people number by the regression model. As this kind of approach does not require to detect and track individual person, it has relatively low computational cost and demonstrates promising results on solving the occlusion problem. A variety of features have been used by previous works to estimate the crowd density, such as total area [21], [22], edge count [23], [24] and texture features [25]. Chan et al. [26] take the perspective distortion into account and experiment with additional features such as Minkowski fractal dimension to estimate the irregularity of edges.
The traditional approaches are suffering from two main problems. Firstly, they heavily rely on the background segmentation techniques to remove noise. Secondly, an unavoidable step in the traditional approaches is to extract hand-crafted features. However, designing hand-crafted features is not an easy step and it is usually difficult to find out optimal handcrafted feature representation. The deep learning approach can well-solve both problems. It does not have to apply background segmentation method to pre-process images and it is able to count people number from different perspectives [6]. Another advantage is that deep learning can be constructed as an endto-end model, which takes whole image as input and outputs people number or the head position. It means feature designing is not a necessary step when applying deep learning.
Some previous work apply deep learning method to address the problem of people counting. At the initial stage, the deep learning framework is usually employed to directly output people number. Zhang et al. [4] propose a Convolutional Neural Network (CNN) based framework to extract deep features of crowd scene and use a data-driven method to fine-tune the CNN model to the target scene. Wang et al. [5] also construct a deep network in order to estimate extremely dense crowds. Marsden et al. [27] apply a scale aware deep learning model with a single column fully convolutional network that takes multiple scales of image as the input in the prediction stage. Each scale of image produces a people number and the final counting number is to take the average of these estimates.
Apart from directly predicting people number, another way to apply deep learning is to generate density map and then count people number from density map. Zhang et al. [6] first develop this method to count people number from density map. They use a Gaussian kernel to convolve a labelled image and then compute people number by summarizing pixel value. There are also some following work to produce density map based on deep learning approach. Boominathan et al. [28] combine one deep network and one shallow network to predict a density map for a given crowd image. Sindagi et al. [8] propose a cascaded deep network structure to simultaneously classify crowd into different levels and produce density map. However, the approach based on density map has to label the head positions for the whole dataset, which is a timeconsuming process when applying to the high density crowd or the large scale datasets.

III. APPLICATION OF AMBIGUOUS LABELS TO PEOPLE COUNTING
We here illustrate the rationales that we apply "ambiguous labelling" strategy for the people counting problem.
Firstly, we show that people counting problem is similar to the age estimation problem. Fig. 1 presents a typical case in the people counting problem. The ground-truth number for Fig. 1(a) is 26 persons while the person number in Fig. 1(b) is 31. Although the people numbers are totally different, the major contents of both images are very close. It is confirmed by the traditional features extracted from both images. Two main features (segment area and perimeter-area) employed by the previous work [26] are almost the same. If we look into the details of both images, there are three minor differences leading to different person numbers: (1) In the red bounding box, a woman is pushing a stroller for a baby but the size of baby body is small in the image. (2) In the green bounding box, a walking woman's body is occluded by an obstruction while only part of woman body is shown in image. (3) In the yellow bounding box, three persons' heads appear on the image. However, only piece of their heads can be seen in the image. Thus, we can see that similar image features do not always refer to the same person number. It is the same as the age estimation problem that neighbouring age might present similar image features. This is the main reason that we could assign various labels to each input image as done in the previous age estimation work.
Secondly, "ambiguous labelling" strategy enables us to create augmented training dataset for deep learning model. As insufficient training data could lead to over-fitting problem, a desirable training dataset should have multiple images for each image label. However, the mainstream people counting datasets (UCSD and Mall datasets) usually contain limited number of images for each people number. Consequently, we could improve the predicting ability of model by enlarging the size of training dataset. By assigning various labels to the images in the training dataset, we can obtain a much larger size of training datasets than that of the original one. It means for each specific people number (training label), we can find a variety of crowd scenes (training image) in the training dataset. The deep learning model can thus learn more discriminative features with sufficient number of training images.

IV. LABEL AMBIGUITY CONSTRUCTION
In this section, we introduce our method to model the randomness of people number and thus to create ambiguous labels for each input surveillance image. For each scalar-valued people number label l ∈ R of the input image, we seek a label distribution that should satisfy two criteria: (1) the ground truth value should have the highest possibility of being assigned to the image; and (2) when the labels are farther from the ground truth, they should be assigned to the image with lower probabilities. In this paper, we adopt the Gaussian distribution in the experiment to model the ambiguous labels for each surveillance image as shown in Fig.2, whose mean value µ is equal to the ground-truth value. The corresponding standard deviation σ for the Gaussian distribution is usually an unknown factor but can work well when it is carefully chosen [11]. We thus empirically set σ to 2 in the experiment. By constructing a Gaussian distribution, we can randomly sample M labels for each input image. As the problem of occlusion usually appears in the relatively high density crowd, we only apply the "ambiguous labelling" strategy to the images of people number over 15.

V. DEEP CLASSIFICATION-REGRESSION LEARNING MODEL
In this paper, we do not apply Resnet [1] or VGG deep learning model [2] as the base convolutional network to address the problem. The reason is that the size of crowd counting datasets is relatively small (usually around 2000 images), which is not sufficient to train the Resnet or VGG network with large number of parameters. For this crowd counting problem, we construct the convolutional network based on a custom network structure as shown in Fig.3. We construct the multi-task deep learning model by connecting two parallel sub-networks to the base convolutional network. One sub-network is used to predict people number and another sub-network is used to estimate the crowd density level.
The people counting network is consisting of one fullyconnected layer with 256 neurons and Rectified Linear Unit (ReLU) is taken as the activation function. This branch finally produces people numberl k for the input image x k with label l k , and we use Mean Squared Error (MSE) as the objective function for this branch: The classification layer aims to classify input image to one of the density levels. We create classification labels for each dataset with an interval of 10 people. For instance, if the maximum people number in the training dataset is 100, then we can create 11 labels for the dataset. The level-1 density refers to the people number of 0 to 10, and level-2 refers to people number of 11 to 20. The rest can be done in the same manner where level-11 refers to the people number above 100. The classification layer also contains a fully-connected layer that has 256 neurons with ReLU activation function. We use softmax function as classifier and use the cross-entropy error as the loss function: where p is the ground-truth distribution of density level, and q is the estimated class probabilities produced by the softmax classifier. Then the total loss for the whole deep learning model can be written as: where λ is a weighting factor.

A. Experiment Setup
For the parameter settings, we initialize the whole deep network with Gaussian distribution of zero mean and set its standard deviation to 0.01, and bias to zeros. We empirically set λ = 2 in Eq.3. We then optimize the network by Stochastic Gradient Descent (SGD) with a learning rate of 0.01 and the size of mini-batches is 128. In the experiment, the network usually convergences around 30 epochs. We conduct all the experiments over the UCSD pedestrian dataset and Mall dataset. When creating ambiguous labels for each dataset, we randomly sample M = 5 labels from Gaussian distribution for Fig. 2. The process of how to assign ambiguous labels to an input image. The ground-truth value of input image is regarded as the mean value µ for the Gaussian distribution. We randomly sample M = 5 labels for each image.  We test our proposed algorithm on the UCSD pedestrian database [26] and Mall dataset [29], which are two well-known datasets on the evaluation of people counting algorithms. Both datasets contain 2000 frames that are captured by a stationary camcorder from outdoor and indoor scene respectively. The example images from two datasets are shown in Fig. 4.

Conv
We separate the datasets as previous work: in UCSD dataset, frames 601-1400 are employed for training; in Mall dataset, the first 800 frames are used. The rest frames in each dataset are applied for testing. Two evaluation metrics are applied for numerical testing and comparison with the-stateof-art algorithms. The first one is called mean absolute error (MAE) to estimate the average absolute error of each testing frames: where N is the total number of test images, m i is the ground truth for ith test image, and m i is the corresponding prediction result. The second one is mean squared error (MSE) which assesses the average mean squared error:

B. Comparing with Hand-crafted Features
In the first experiment, we compare our deep learning method with the traditional computer vision methods including our random projection forest that employ hand-crafted features. Table.I presents the results of this experiment. It can be seen that our deep learning method significantly outperforms other traditional methods. We also conducted an experiment on the Random Projection Forest (RPF) [12], which employs different kinds of feature. One is the same hand-crafted features (hf) as [26], and another one is the deep feature from the FC layer in the regression branch (fc1), and the FC layer in the classification branch (fc2). It can be seen that the deep features from deep learning model are more discriminative than the hand-crafted features, and the features from fc1 is better than that from fc2, which is caused by the regression branch is able to predict more detail people density scenario than the classification branch.  We also compare our method with the CNN-based approaches. These approaches include Zhang et al. [4], Kumagai et al. [31], Sam et al. [32], and Sheng et al. [33].

C. Comparing with CNN-based Approaches
Form Table. II we can see that our CNN method achieves the best performance on the UCSD dataset with MAE as the evaluation criteria, and slightly worse performance than Sam et al. [32] on the MSE evaluation. On the Mall dataset, our deep learning approach provides comparable performance when comparing to other CNN methods. Comparing with other approaches, the classification branch in our model can provide a coarse estimation to the people density, which is less influenced by the variation of perspectives and image scale.

D. Evaluation of Ambiguous Labelling
Then we conduct an experiment to evaluate the effectiveness of ambiguous labelling strategy. We apply ambiguous labelling method to both deep learning model and also the random projection forest model. From Table. III we can seen that by employing ambiguous labelling method can increase the performances of both deep learning model and the random projection forest model with larger size of training dataset. It confirms the effectiveness of ambiguous labelling method that it is not only effective on age estimation problem in previous work but also helpful on the crowd estimation problem. As we propose a multi-task deep learning model, it is also necessary to evaluate the necessity of the classification branch in the deep learning model. We compare two models: one is the full model and another one is the model without the classification branch. From Fig.5 we can see that it is necessary to include the classification branch to the model. The classification branch provides a coarse counting number that is less influenced by the image scale and the variation of perspectives. Thus, from the experiment result, we can see that the full model with two branches shows much better performance than the model without the classification branch on both datasets. One inevitable problem when applying deep learning model is the size of dataset. Insufficient training dataset size would lead to the over-fitting problem and reduce the generalization ability of the model.

F. Evaluation of the Influence of Dataset Size
In this experiment, we modify the training dataset to evaluate the influence of dataset size. When testing on the UCSD dataset, we also include the whole Mall dataset into the training dataset. When testing on the Mall dataset, we add the whole UCSD dataset to the training dataset. It results in four kinds of model: From Table.IV we can see that when the training data grows, the performance produced by the deep learning increases as well. It verifies the assumption that the larger dataset would lead to the better performance when applying deep learning model.

VII. CONCLUSION
In this paper, we have constructed a multi-task deep learning model for the crowd estimation problem. We show that the deep learning method is able to outperform previous computer vision methods based on hand-crafted features. Apart from employing deep feature, we propose an ambiguous labelling method to create various label for each input image. The experiment result confirms the effectiveness of the ambiguous labelling method, which is able increase the performance of both deep learning method and also our previous random projection forest method.