Crop classification by support vector machine with intelligently selected training data for an operational application

The accuracy of a supervised classification is dependent to a large extent on the training data used. The aim in training is often to capture a large training set to fully describe the classes spectrally, commonly with the requirements of a conventional statistical classifier in mind. However, it is not always necessary to provide a complete description of the classes, especially if using a support vector machine (SVM) as the classifier. An SVM seeks to fit an optimal hyperplane between the classes and uses only some of the training samples that lie at the edge of the class distributions in feature space (support vectors). This should allow the definition of the most informative training samples prior to the analysis. An approach to identify informative training samples was demonstrated for the classification of agricultural classes in south‐western part of Punjab state, India. A small, intelligently selected, training dataset was acquired in the field with the aid of ancillary information. This dataset contained the data from training sites that were predicted before the classification to be amongst the most informative for an SVM classification. The intelligent training collection scheme yielded a classification of comparable accuracy, ∼91%, to one derived using a larger training set acquired by a conventional approach. Moreover, from inspection of the training sets it was apparent that the intelligently defined training set contained a greater proportion of support vectors (0.70), useful training sites, than that acquired by the conventional approach (0.41). By focusing on the most informative training samples, the intelligent scheme required less investment in training than the conventional approach and its adoption would have reduced the total financial outlay in classification production and evaluation by ∼26%. Additionally, the analysis highlighted the possibility to further reduce the training set size without any significant negative impact on classification accuracy.


Introduction
The availability of accurate and up-to-date land-cover maps is crucial for many applications. Compared to conventional methods of surveying, remote sensing can be an efficient and accurate tool for the provision of land cover information at frequent intervals over large areas. Despite the considerable potential of remote sensing as a source of land cover information many problems are encountered in its use, not least that the accuracy of the derived land cover information may often be viewed as being insufficient by the user community (Wilkinson 1996, Foody 2002, 2007. There are many factors responsible for this situation including the nature of the classes being studied, properties of sensing system used to acquire the imagery and the techniques used to extract thematic information from the imagery, the classification techniques (Foody 2002, Pal andMather 2003).
Supervised classification is widely used for the extraction of land cover information from remotely sensed data. Supervised classification comprises of three stages: training, allocation and testing. In the training stage, areas of known ground identity (class membership) are typically identified on the image. The spectral response of the training areas may be used to generate descriptive statistics for the land cover classes to inform the second, class allocation, stage of the classification. Finally, the accuracy of the classification is evaluated in the testing stage, usually with a sample of cases not used in training the classifier.The value of the output generated by a supervised classification is typically a function of its accuracy. The accuracy of supervised classification is dependent on the first two stages of the classification over which the analyst has considerable control. Consequently, considerable effort has been directed at these stages often with the overall aim of increasing the accuracy of classification. Much research, has, for example, focused on the allocation stage, with particular regard to the classifiers used (Foody and Mathur 2004a). Achieving an optimal classification is, however, a challenging and open problem (Ho et al. 1994).
In addition to the classifier adopted, the accuracy of a supervised classification is dependent to a large extent on the quality of the training data used. Indeed the nature of the training stage can have a larger impact on classification accuracy than the classification technique used (Hixson et al. 1980, Campbell 2002. This situation has prompted research on issues related with the design of the training stage of a supervised image classification. This research has addressed issues including those connected with the sampling design (Campbell 2002, Chen andStow 2002), size of the training set (Congalton 1991, Foody and Mathur 2004a, composition of the training set (Foody et al. 1995) as well as issues such as the spacing of training samples and time of sampling with respect to that of image acquisition. However, most attention has focused on the size of training set, the number of samples for training the classifier. This issue has frequently attracted attention because of the costs in terms of time and finance involved in the acquisition of a large training set (Buchheim andLillesand 1989, Jackson andLandgrebe 2001).
The design of the training stage often appears to have been guided by a classical statistical view of the classification process. Statistical classifiers such as the widely used maximum likelihood (ML) classification are based on statistical descriptions of the classes generated from the training data. Such classifiers require a complete description of each class in feature space. For this, a large training set is often required. Additionally, the acquisition of training data from a wide range of geographical locations is encouraged to help capture and represent the full spectral variability of the classes.
There are many recommendations made about to the size of the training set required for an analysis typically based on the classical statistical view of the classification process. For example, the literature often suggests that the size of the training set required is a function of the number of spectral wavebands used and advises that a sample comprising at least 30 times the number of discriminatory 2228 A. Mathur and G. M. Foody wavebands used in the analysis is acquired (Mather 2004). In general, studies have shown that classification accuracy tends to be positively related to training set size (Zhuang et al. 1994, Foody et al. 1995, Pal and Mather 2003, Foody and Mathur 2004a. Conventional training data acquisition schemes, therefore, aim to capture a large training set spread all over the study area. However, such recommendations are general and are often made without any regard to the study area, the complexity of the classes therein or the classifier to be used and the aim of the analysis (Foody et al. 2006). Different classifiers applied to the same dataset often produce dissimilar allocations even if using the same training data (Huang et al. 2002, Foody andMathur 2004a). This can be attributed to the way the classifiers use the training data and how they partition the feature space. For example, a parametric classifier such as the ML classification is based on an assumed model and, therefore, often requires a large training sample to ensure that the statistical parameters are able to completely describe the classes. However, non-parametric classifiers such as decision trees and neural networks are not based on any parametric model but use the training data directly for training. Foody (1999) has shown that with a multi-layer perceptron neural network, the training samples that lie at the edge of class distribution and between the distributions of two or more classes in feature space are the most informative for an accurate classification. This indicates that some training samples are more useful than others for this type of classifier. Variation in training sample importance may allow a classification to be undertaken with a small sample without negative impact on classification accuracy if the most useful training samples are used (Foody and Mathur 2004b). Given that an objective in classification is often to achieve a high accuracy with, if possible, a small number of training samples in order to make the classification process as useful and economical as possible the ability to identify the most useful training sites would, therefore, be advantageous. The desire is, therefore, to find an approach to classification that allows accurate classification from small training sets. One attractive classifier for this application is a support vector machine (SVM).
SVM classifications may be more accurate than widely used alternatives such as classification by ML, decision tree and neural network (Huang et al. 2002, Foody and Mathur 2004a, Melgani and Bruzzone 2004. Typically, the SVM classification aims to fit an optimal separating hyperplane (OSH) between classes by focusing on the training samples that lie at the edge of the class distributions, the support vectors. The OSH is a hyperplane oriented in space such that it is placed at maximum distance between the two classes. That is, the approach maximizes the margin between the classes. It is because of this orientation that SVM is expected to generalize more accurately on unseen cases as compared to classifiers that aim to minimize the training error such as neural networks. As with a neural network, however, each training sample is not of equal value and those lying near the hyperplanes are most informative for SVM classification (Foody and Mathur 2004b). Thus with an SVM, the desire need not be to obtain as large a training sample as possible but one that contains the most useful training cases. With SVM classification, only some of the training samples that lie at the edge of the class distributions in feature space (the support vectors) are needed in the establishment of the decision surface. Training data other than support vectors can effectively be discarded without compromising the accuracy of the classification. Thus, the accuracy of a SVM classification depends not so much on the size of input training Intelligently trained SVM classification data but more on the location of training data in the feature space. Moreover, since the computation of the decision surface is not dependent on the dimensionality of the data, SVM can accurately classify data in high dimensional space with a limited number of training data and overcome the problem of Hughes phenomenon (Pal and Mather 2004). Consequently, it may also be unnecessary to undertake a featurereduction analysis (Melgani and Bruzzone 2004), although this sometimes can be useful (Neumann et al. 2005).The ability to use small training sets that provides appropriate support vectors has important features, notably potential for savings in training data acquisition. To realise the potential to acquire a small training set that would provide appropriate support vectors directly from field work requires a means of identifying the useful training sites in advance of the class allocation stage of the classification (Foody and Mathur, 2004b). One of the approaches to identify training data that would provide potential support vectors is to identify the extremities of the class distributions in feature space with the aid of knowledge on the variables controlling the spectral response of the classes. The principles of this intelligent approach to training have been identified (Foody and Mathur 2004b) but here they are applied to a real operational application scenario.
This paper aims to evaluate a procedure devised to intelligently capture a small training set that would provide appropriate support vectors for an SVM classification directly from the field on the basis of ancillary information. This approach is applied to an operational agricultural application scenario in which an aim is to accurately classify crops to aid production management.

SVM classification
SVM were originally designed for binary classification. A large number of candidate classifiers can be defined to separate two classes but there is only one that provides the maximum margin between the two classes and is termed the optimal separating hyperplane (OSH). Due to its definition, this classifier is expected to generalize accurately on unseen cases as compared to other classifiers.
A detailed mathematical explanation of SVM can be found in Vapnik (1995). Here only some of the main features are discussed. The OSH can be formulated by focusing on the training samples that lie at the edge of the class distribution in feature space. These training cases are the support vectors which are fundamental to classification by an SVM. The OSH can be defined as f(x)5w x + b, where the parameter w determines the orientation of the hyperplane in space and b defines the bias, the distance of the hyperplane from the origin (figure 1). When it is not possible to define the hyperplane by linear equations, the data may be mapped into a high dimensional space through some non-linear mapping which has the effect of spreading the distribution of the data points in a way that facilitates the fitting of a linear hyperplane. With this, the classification decision function becomes where for each of the r training cases there is a vector, x i , that represents the spectral response of the case together with a definition of class membership, y i , a i , i51,…, r are Lagrange multipliers and k(x,x i ) is a kernel function. The magnitude of a i is determined by the parameter C and lies on a scale of 0-C (Belousov et al. 2002). Further details on SVM classification for both the linearly separable and linearly 2230 A. Mathur and G. M. Foody non-separable situations in a remote sensing context is given in the literature (e.g. Huang et al. 2002, Melgani andBruzzone 2004). The support vectors are those training samples x i , for which a i .0. Training cases other than support vectors (a i 50) do not contribute in the formulation of the classifier (equation (1)) and are, therefore, irrelevant. Such training cases may be removed from the training set without compromising the accuracy of the classification. Thus, it is possible to derive an accurate classification from an SVM trained with only a small training set.
Although initially designed for binary classification, the basic SVM approach can be extended for the multi-class classification task that is common in remote sensing applications. The two main approaches to multi-class classifications in SVMs are the one-against-all and one-against-one strategies (Gualtieri andCromp 1998, Huang et al. 2002). These approaches split the multi-class problem to a set of binary problems, enabling the basic binary approach of SVM to be utilized to yield a multiclass classification. A more appropriate approach for multi-class classification, that is also less computationally demanding, may be to consider all classes at one time, yielding a multi-class SVM (Hsu and Lin 2002). One means to achieve this, which is similar in basis to the 'one-against-all' approach, is by solving a single optimization problem. The work described hereafter is focused around one such multi-class SVM.

Data and methods
The study area comprised of south-western part of Punjab state of India. Indian Remote Sensing Satellite (IRS-1D) data with a spatial resolution of approximately 24 m acquired by LISS-III sensor, on 22 September 2003 were used. The remotely sensed data were acquired in red (0.62-0.68 mm), near infrared (NIR) (0.77-0.86 mm) and middle infrared (MWIR) (1.55-1.75 mm) wavebands. The ground data on class membership were collected by visiting the field immediately prior to image acquisition during the period 15 to 21 September 2003. Attention focused on the dominant agricultural and non-agricultural classes in the study area. The agricultural classes were cotton, basmati rice and a local variety of rice while the non-agricultural classes were built-up land and sand. Two sets of ground data were acquired. One set was acquired following a conventional approach adopted in operational applications while the other was an intelligent scheme designed to select the most useful training data for an SVM classification. Throughout, emphasis was on the acquisition of information on crops. Note that the selection of training sites, including the prediction of the most useful training sites, was made in advance of the classification analysis, with the training site locations visited in the field before the image had been acquired.
A range of conventional approaches to training data acquisition can be identified, differing mainly in the detailed nature of the sampling design used. The scheme adopted here was based on a stratified random, by class, sampling design similar to that used operationally for mapping in this region within the Crop Acreage and Production Estimation (CAPE) project (Yadav et al. 1995). This approach helps ensure that even relatively rare classes, such as sand and basmati rice within the study area, are adequately sampled. The sampling process was based on a grid of 500 m6500 m size overlaid upon a map of the study area. The function of the grid was to ensure that the training data were acquired over a large geographical area, a feature often perceived as being desirable as it helps ensure class variability is captured. Grid cells were selected at random and visited on the ground in order to locate homogeneous sites for training purposes. From each selected grid cell, one pixel was selected from the homogeneous sites of the classes visited. In total, 180 cells were selected for each of the five classes under study. For each class, the 180 pixels acquired by this approach were divided randomly into training and testing sets. The training set, therefore, comprised 90 pixels of each class and a total of 450 pixels (figure 2). Note that the basis of selecting 90 pixels per class for training set was the widely promoted recommendation that a training set should comprise at least 30 times the number of discriminatory wavebands to be used in the classification analysis. The derived training set should have captured the spectral variability of the classes and provide a sample that yields an accurate and unbiased description that gives a full characterisation of the classes.
It is not, however, always necessary to have training statistics that provide a complete and representative description of the classes, especially if using a classifier such as an SVM. For classification by an SVM, only the training samples that are support vectors, which lie on part of edge of the class distribution in feature space, are required; all other training samples provide no contribution to the classification Figure 2. Distribution of the training data for the five classes in the feature space.

2232
A. Mathur and G. M. Foody analysis and can be discarded without impacting on the accuracy of the classification (Foody and Mathur 2004b). To capture border training data that would act as support vectors it would be expected that attention should focus on the extremities of the class spectral responses. With the intelligent training scheme, training data were, therefore, acquired from sites with relatively extreme spectral responses (potential border training samples) that should provide appropriate support vectors. This approach was adopted for the agricultural classes. The approach requires a basic understanding of the variables affecting the spectral response of a crop (Foody and Mathur 2004b). It is expected that these include factors related with the growth stage of the crop, the soil background and the water status of the training sites. For instance, a healthy crop generally has very high NIR reflectance and very low red reflectance, while a matured crop on the other hand has comparatively low NIR reflectance and a high red reflectance (Curran 1980). The differences between the growth stages are also clearly detectable in the field (figure 3) making it possible to identify candidate support vectors during the pre-classification fieldwork. Similarly, variations in moisture content resulting from differences in crop maturity or due to proximity to a water body such as a canal from which water may seep, influences the spectral response. Additionally, variables such as soil type also influence the spectral response of the crops, especially if canopy cover is incomplete and the soil exposed to the sky although an indirect effect through plant growth and condition is also important (Curran 1980). On the basis of this type of knowledge on the various factors believed to influence the spectral response of the crops, one may be able to predict sites that may furnish appropriate support vectors. That is, one may predict the most informative training sites for the classification.
Here, fieldwork was aided by knowledge gained about waterlogged areas from local newspaper reports and valuable information about spatial distribution of crops with regard to their type and stage of maturity from discussion with local agricultural departments within the study area and farmers at the field level. So, the intelligent selection of training cases of the crops was directed by information on soil background properties, water status and crop growth stage. The effect of each Figure 3. Variation in crop properties evident in the field that may aid the selection of support vectors prior to the classification. Note the marked difference between the relatively young (foreground) and mature (background) local rice crops which helps fieldworkers locate suitable training sites.
variable on the spectral response may vary between the crops. For example, at the time of data acquisition an incomplete canopy was observed for just the cotton limiting the direct effect of variation in soil background to this crop only. Similarly, the only crop exhibiting marked variation in growth stage was the local rice crop that occurred at stages varying from young and healthy through to a very matured condition (figure 4). Variation in water condition was linked to seepage from water bodies and specific watering activities and could affect all of the crops located in close proximity to the water network. In total, 80 training cases were identified in the field for use in the intelligently defined training set. These cases comprised 30 cases each of cotton and local rice and 20 cases of basmati rice. These sample sizes were arbitrarily selected but are considerably smaller than the 90 cases per class acquired with the conventional scheme for training data acquisition. The location of the non-agricultural classes, built up and sand, in feature space indicated that spectral confusion between them and the agricultural classes was unlikely. Since interest was focused mainly on the crops, from which the nonagricultural classes were distinctive, a sample of 25 cases of built up and sand were acquired for use in the intelligently defined sample. This number of cases was arbitrarily defined but is, again, considerably smaller than the 90 cases of the classes contained in the conventionally defined training set. Later research with the same datasets indicated that substantially smaller training samples could have been used for these classes without significant negative impact on classification accuracy (Foody et al. 2006) The training sets derived from the conventional and intelligent data acquisition schemes were used to drive SVM classifications. The multi-class SVM approach to classification with a Gaussian kernel was used in all analyses (Hsu andLin 2002, Foody andMathur 2004a). The parameters of the SVM were optimized for each analysis using a fivefold cross-validation approach. With the conventionally defined training set, the parameters C and c, which control the width of the kernel, were set at 0.25 and 0.005 respectively. With the intelligently defined training set the C and c parameters were set at 1.0 and 0.000625, respectively.
The accuracy of each classification was assessed using the same testing set. Confusion matrices were constructed for each classification to summarize the allocations made and the accuracy, expressed in terms of the percentage of cases correctly allocated, of each classification determined. In this research, the aim was to determine if the intelligent training based approach could be used to derive a classification of accuracy comparable to that derived from the conventionally trained classification analysis. For the intelligent training approach to be of value, therefore, there should be no significant difference between the accuracy of the classification derived from the conventionally trained and the intelligently trained classifications. That is, accuracy of the classification derived from the use of the small intelligently selected training set should be comparable to that from the larger conventionally selected training set. Since each accuracy statement derived provides only an estimate of classification accuracy, it is inappropriate to simply compare the magnitude of the estimates directly in order to determine if the classifications differed in accuracy. The statistical significance of a difference in classification accuracy has often been evaluated by a comparison of kappa coefficients with respect to their estimated variances (e.g. Congalton and Mead 1983). This approach assumes that independent testing sets were used in evaluating the classifications to be compared and so is inappropriate in this study in which a single testing set was used. Instead, the classification accuracy statements derived from the analyses using the training sets acquired by the conventional and intelligent training data collection schemes were compared in a rigorous fashion that accommodated for the related nature of the samples using a McNemar test (Foody 2004). This test is based on confusion matrices that are 262 in dimension and show the level of inter-classifier agreement in correct and incorrect allocations. The McNemar test is based upon the standardized normal test statistic, Intelligently trained SVM classification in which f ij indicates the frequency of allocations lying in element i,j of the 262 confusion matrix. The test is, therefore, focused on the cases correctly classified by one classifier but misclassified by the other. With this test, two classifications may be considered to be of different accuracy at the 95% level of confidence if Z.|1.96|. Thus, if the conventional and intelligent training data acquisition schemes yielded classifications that were not significantly different, Z,|1.96|.

Results and discussion
The conventionally defined training set was used to derive a classification with an accuracy of 92.00% (table 1). This result highlights that SVM may be used to derive very accurate classifications. The considerably smaller, intelligently defined, training set was, however, used to derive a classification with an accuracy of 90.66%, only marginally, and insignificantly (Z51.50), less accurate than that derived from the conventional approach (table 2). Thus, by focusing the training data acquisition process on the sites believed to have a high expectation of forming support vectors it was possible to use a small training set without any significant negative impact on classification accuracy. The 1.34% decrease in accuracy was achieved with a decrease in training set size from 450 (90 of each class) to 130 pixels (80 from the agricultural and 50 from the non-agricultural classes). Moreover, the reduction in training set size offers savings to the analyst. Savings in the time needed to acquire the training set as well as in costs of transportation and support of the team of fieldworkers would all be expected to result from a decrease in training set size. As a guide to the size of the savings achievable, the conventional training data acquisition scheme followed required a total journey of ,1700 km by the fieldworkers but only ,1040 km was required for the acquisition of the intelligently defined training set. The adoption of the intelligent training scheme would, therefore, have reduced the travel distance by 38.8% and would be associated with large savings in fuel and vehicle use. These travel cost and associated savings contributed to an estimated ,26% financial saving on the total cost of producing and evaluating the crop classification through the adoption of the intelligent rather than conventional training data collection scheme (Mathur 2005). Attention in defining the intelligent training set was focused mainly on the identification of the most useful training sites for the discrimination of the three agricultural crop classes. It was evident that 70% of the intelligently defined training samples were support vectors. Within the conventionally defined training set the proportion of support vectors was lower, with only 41.5% of the crop training samples being support vectors (table 3). The fieldwork, therefore, appeared to direct the training data acquisition process to the most informative locations. Moreover, from inspection of the support vectors and relation to ancillary knowledge it may be possible to further refine the process. For example, inspection of the a values for the support vectors determined for the cotton crop showed that these had come predominately from waterlogged areas. This is not surprising as waterlogging will reduce the reflectance of the crop. Consequently, such sites would be expected to have a reflectance closer to the local rice crop, the spectrally closest class, than cotton on dry soil ( figure 3). Indeed the only support vector for the cotton crop located on dry soil had the lowest a value, 0.6051, while all others had the maximum value of 1.0, and so actually contributed very little to the analysis. Dropping this one training case from the dataset had no impact on the class allocations as exactly the same classification of the testing set was derived. Thus in future, it may be possible to extract training sites for cotton simply from waterlogged regions and ignore the impact of other variables. It would also be possible to intelligently select training samples from the non-agricultural classes. This would further reduce the training set size and cost of training data acquisition without loss of classification accuracy. Other approaches to reducing the size of the training set may also be appropriate (Foody et al. 2006).
Although the intelligent training approach requires some knowledge-based input to aid prediction of the most useful sites for training data acquisition a high degree of expertise may not be unnecessary. Indeed, this study is based on the impact of well-known variables such as water content, growth stage and soil background on the spectral response of vegetation. Moreover, data on these variables can be derived relatively easily from a range of information sources (e.g. maps) or be observed readily in the field (see figure 3). Consequently, the approach can be used to reduce the costs of training data acquisition programmes without placing major demands on those undertaking the work and without significant negative impact on classification accuracy.

Conclusions
The desire in training a supervised classifier has often been to derive an accurate and complete description of the spectral response of all of the classes in the study area. To achieve a complete description of each class in feature space, a large training set is typically required. Although this may be appropriate for some classifications it is not always necessary to have training statistics that provides a complete and representative description of the classes, especially if using non-parametric classifier such as an SVM. For SVM classification, training samples are not equally important with those lying near the edge of the class distributions in feature space and facing the distributions of other classes in feature space, the support vectors, most important in the fitting of the decision boundaries between the classes. Thus, if there is some ancillary information that can be used to identify/locate training sites to regions from which the most informative training samples, the support vectors, can be derived, it may be possible to acquire a small intelligently selected training set that can be used to accurately classify the data. This study showed that ancillary information on crop status and the background properties of training sites (soil type and water content) can be employed as part of the training data acquisition process in order to identify the most informative training samples, the support vectors. Critically, the location of the most useful sites for training data acquisition may be predicted in advance of the classification analysis (note that the training sites were defined prior to image acquisition). Relative to the use of a conventional scheme, the accuracy of the resulting classification was 1.34% less accurate at 90.66% but involved substantially less effort and could be acquired at less expense. The reason that the intelligent scheme was able to essentially maintain accuracy was because it focused attention in the most informative training samples. Note, for example, that the training set defined intelligently contained a considerably larger proportion support vectors than that acquired by the conventional method. Moreover, analysis of the contribution made by each training sample to the classification showed that the intelligent training data acquisition scheme could be further refined. For example, inspection of the a values for the training cases highlighted that the most informative training samples for the cotton class were located near water bodies. Future classifications could, therefore, direct training data acquisition activities for the cotton crop to regions near water bodies and ignore other factors that influence the spectral response.