Deep Fuzzy Tree for Large-Scale Hierarchical Visual Classification

Deep learning models often use a flat softmax layer to classify samples after feature extraction in visual classification tasks. However, it is hard to make a single decision of finding the true label from massive classes. In this scenario, hierarchical classification is proved to be an effective solution and can be utilized to replace the softmax layer. A key issue of hierarchical classification is to construct a good label structure, which is very significant for classification performance. Several works have been proposed to address the issue, but they have some limitations and are almost designed heuristically. In this article, inspired by fuzzy rough set theory, we propose a deep fuzzy tree model which learns a better tree structure and classifiers for hierarchical classification with theory guarantee. Experimental results show the effectiveness and efficiency of the proposed model in various visual classification datasets.


I. INTRODUCTION
I N RECENT years, deep learning has received widespread attention and achieved remarkable performance in visual classification tasks [1], [2]. With the tremendous growth of data, the number of labels in these tasks has been growing fast, which makes it more difficult to recognize the true class from thousands of candidate classes directly. Fortunately, a hierarchical structure often exists in these massive labels, which helps organize data with huge amounts of classes efficiently [3]. With a hierarchical structure, a difficult classification task is divided into several easier subtasks and hence gets solved more effectively. Babbar et al. [4] and Partalas et al. [5] compare hierarchical classification and flat classification, and conclude that hierarchical classification performs better than flat in such difficult task theoretically. Currently, most network structures of deep learning use the softmax layer to classify samples [6]. Although this obtains good performance in many types of visual classification tasks, it still challenging to deal with large-scale tasks with massive labels. Inspired by [4] and [5], we believe that replacing softmax with a deep hierarchical tree system for classification can solve the problem effectively. In this scenario, a key problem is to construct a good label structure, which has been shown to significantly affect the performance of hierarchical classification [7], [8].
Many efforts have been dedicated to building a good structure for visual classification tasks. First, expert-designed ontologies are widely used for large-scale visual classification. They are often in the form of a semantic structure such as WordNet for ImageNet [9], and reflect the relations independent of data. However, such ontologies depend heavily on expert knowledge and are not always available. More importantly, the feature information of data is often ignored [7], and classification accuracy decreases if the semantic hierarchy is inconsistent with the feature space of data [10].
To solve this problem, data-driven hierarchical structure learning methods have been designed in many works. Some researchers try to build a tree structure by recursively assigning the classes of greatest confusion into one group [11]- [13]. Specifically, a 1-vs-all SVM is trained first, and then used to obtain the confusion matrix, to which the hierarchical clustering method is applied. The method has proven to be effective, but the performance heavily depends on the reliability of SVM, and it is particularly challenging when facing some difficult tasks, such as those with massive labels or unbalanced data.
Alternatively, several works have proposed to construct class affinity matrix by measuring interclass similarity or distance without a classifier. Fan [8]. Although some improvements are obtained in some visual classification tasks, these algorithms depend on some assumptions. For example, [14] assumes the data distribution is ball-shaped, [7] requires data respect Gaussian distributions, while [8] need data without much noise. More significantly, they are designed Fig. 1. Framework of deep fuzzy tree. In the training phase, the deep features of training samples are first extracted, then interclass similarities are measured and similarity matrix is created in step A for tree construction, and a tree structure with base fuzzy rough classifiers is built by hierarchical community detection in step B. In the test phase, after deep feature extraction step, test samples are assigned by fuzzy rough classifiers at each node of the tree from the root to the leaves (step C).
heuristically instead of theoretically for hierarchical classification, hence cannot measure the interclass similarity more properly.
In this article, we propose a Deep Fuzzy Tree (DFT) model to solve the large-scale classification tasks with massive labels by replacing the softmax layer in the deep learning network with a newly designed hierarchical classification method (see Fig. 1). We aim to learn a better tree structure and classifiers for hierarchical classification by measuring interclass similarity more appropriately. As fuzzy rough sets have shown their importance in some classification tasks [16]- [18], we leverage the theory of lower approximation and design a new dual fuzzy interclass similarity to both measure the interclass similarity and set the base classifiers of the tree with theoretical guarantee. Then, we detect the communities with higher visual similarity by community detection methods recursively and obtain a hierarchical tree structure. To deal with large-scale tasks, fast adaptation is designed with the help of vector quantization. Experimental results show that the proposed DFT model learns a more reasonable tree structure, which further improves the performance of deep learning and the hierarchical classification tasks.
The main contributions of this article are summarized as follows.
1) A new DFT framework is proposed to construct the tree structure and learn the base classifiers for large-scale hierarchical visual classification tasks. We verify the effectiveness and efficiency on various visual classification datasets in comparison with some state-of-the-art algorithms. 2) A new interclass similarity measurement dual fuzzy interclass similarity is proposed and used for both tree learning and base classifier setting. We theoretically prove that the new measurement can lead to a tighter generalization error bound of hierarchical classification. The rest of this article is organized as follows. Section II reviews the preliminary knowledge hierarchy structure and hierarchical classification for this article. Section III describes the details of the proposed deep fuzzy tree algorithm. Section IV shows the theoretical results of the proposed model with respect to the error bound of hierarchical classification. Experimental results on various datasets are presented in Section V. Finally, Section VI concludes this article.

A. Class Hierarchy
Class hierarchy organizes classes into a hierarchical structure where granularities are from the coarse-grained to the finegrained. There are two kinds of structures in class hierarchy, trees, and directed acyclic graphs (DAG). We focus on tree structures in this article since trees are the most common and widely used.
A tree hierarchy organizes the class labels into a tree-like structure to represent a kind of "IS-A" relationship between labels [19]. Specifically, Kosmopoulos et al. point out that the properties of the "IS-A" relationship can be described as asymmetry, antireflexivity, and transitivity [20]. We define a tree T which has a pair of properties (V, E, ≺) with a set of nodes V = {v 1 , v 2 , . . . , v n }, where E represents a set of edges between nodes in different levels, and ≺ represents the parent-children relationship between nodes connected by edges ("IS-A" relationship), formulated as follows: Generally, there are several types of nodes in a tree hierarchy. For node v i : 1) its parent node is denoted by P i ; 2) its child nodes are denoted by C i , and |C i | is the number of child nodes of v i ; For node 2, its parent node is P 2 = v 1 , its child nodes are C 2 = {v 4 , v 5 }, and its corresponding leaf nodes are Le(d 2 ) = {v 4 , v 5 }. In a hierarchical classification process, the sample x starts at root node 1, and then is classified to the node with the highest confidence score among all child nodes S x 2 and S x 3 (assume S x 3 > S x 2 ). With proceeding this process recursively, the prediction is node 6 for sample x in this example (assume S x 6 > S x 7 ).
3) its ancestor nodes are denoted by Ω(v i ), and |Ω(v i )| is the number of ancestor nodes of v i ; 4) its sibling nodes are denoted by Ψ(v i ) and have the same parent node with v i ; 5) its leaf nodes are denoted by Le(v i ), and |Le(v i )| is the number of leaf nodes of v i . Specially, L denotes the leaf nodes of the tree, and |L| denotes the number of all leaf nodes. A toy example of the hierarchy can be seen in Fig. 2.

B. Hierarchical Classification
Given a tree hierarchy, a classification task can proceed from the root to the leaves in a top-down manner. The widely used and most classical model is called a Pachinko Machine [21], which classifies the samples starting from the root and choosing the child class with the highest confidence recursively until a leaf class is reached.
Specifically, let x = {x 1 , x 2 , . . . , x j , . . . , x m } be a sample, W v i be a trained classifier called base classifier at node v i , and S x v i be the confidence score given by W v i . For x, the algorithm will choose C r as the label at node v i , whereS x C r = max(S x C i ). See the toy example in Fig. 2. Assume a sample x starts at the root node v 1 , it is first assigned to node v 3 if S x 3 > S x 2 . Then, it will reach the leaf node d 6 if S x 6 > S x 7 .

III. DEEP FUZZY TREE LEARNING
In this section, we introduce how to learn the tree structure and set the base classifiers for the deep learning model. In order to build a good tree structure, a proper interclass similarity measurement is first designed and then applied for hierarchical community detection to form a tree. Given the tree structure, the base classifiers can be set on different nodes of the tree.

A. Measuring Interclass Similarity With Fuzzy Lower Approximation
As shown in previous works, it is reasonable that classes with high visual similarity are grouped into one super-class [7], [14].
The key point in this issue lies in the proper measurement of similarity between different classes. Given the interclass similarity matrix, clustering or community detection methods can be applied to build a tree. However, the designed measurements in previous works almost have some assumptions as discussed in Section I. Inspired by the theory of fuzzy rough sets, we propose a new measurement for better describing the similarity relations between different classes.
Given universe U representing a group of objects, let R be a fuzzy with similar relation to U , with generated features B. Assume that R(x, y) monotonically decreases with the distance between x and y, and ∀x, y, z ∈ U , R has the following properties: For any subset X ⊂ U, R , the lower approximation operator is defined as where S represents fuzzy triangular conorm (T -conorm), N is a negator, and the standard negator is defined as As proved by Moser [22], [23], given , kernel functions which have the property of reflexivity also have T cos -transitive property. Hence, given U and kernel function k which have reflexivity and T cos -transitive property, the fuzzy lower approximation operator of fuzzy subset d i ⊂ U can be formulated as If we use Gaussian kernel to extract the relations in fuzzy rough calculations, (2) is transformed into Equation (3) shows that the lower approximation membership degree of x to d i depends on the nearest sample with different label, since it needs to search the nearest sample in the space of U − d i . Intuitively, the fuzzy lower approximation describes the certain degree of a sample x belonging to a specific class d i . By considering the fuzzy lower approximation of all samples in d i , the interclass similarity can be measured appropriately.

Definition 3.1: Given a set of samples
the fuzzy interclass similarity (FIS) is defined as y) is the kernel function. Commonly, we use Gaussian kernel in the computation, so (4) turns into Moreover, since the distance or similarity measurement should be symmetric intuitively, we define the dual fuzzy interclass similarity (DFIS) as follows: Definition 3.2: DFIS is the average of the bidirectional fuzzy interclass similarity: The similarity of each class pair is calculated according to (6), and then the similarity matrix of all labels can be obtained. Moreover, we prove theoretically that the proposed interclass similarity obtains upper bounds to the generalization error bound of hierarchical classification (shown in Section IV).

B. Tree Construction With Hierarchical Community Detection
With the proposed dual fuzzy interclass similarity, the interclass similarity matrix is appropriately computed. Then, the tree structure is obtained by grouping the more similar nodes as a granule recursively. Inspired by the community detection structure in the social network which reveals the organization of people [and its application to confusion graph (as proposed by Jin et al. [24]], we apply the adaptive modularity community detection algorithm hierarchically to explore the communities through the large number of classes. In contrast to the conventional use of spectral clustering, the adaptive modularity community detection methods automatically find the local optimal solution to the community modularities, without processing the similarity matrix to meet the needs of spectral clustering properties.
We utilize the fast community detection algorithm proposed by Blondel et al. [25], and extend it to hierarchical community detection. For the flat version of community detection, the key point is to compute the modularity Q k of the kth community. In our hierarchical scenario, we need to compute Q v k at node v in the tree structure where S i,j is the similarity degree of the edge between class i and class j, k i = j (S ij ) is the sum of similarity degrees of the edges attached to the vertex class i, c i is the community to which vertex class i is assigned, and the δ(p, q) function is 1 if p = q and 0 otherwise, and m = 1 2 ij S ij . Then, we follow the method of [25] which optimizes local modularity changes and the aggregated community at each node v, and apply the algorithm recursively to build the tree structure. In this way, we automatically compute the optimal number of communities at each node of the tree, which removes the need for the parameter setting of node numbers and tree depth in [7], [14].
However, using the aforementioned hierarchical community detection algorithm may sometimes generate a tree structure with few singleton nodes which have only one child node. We delete the redundant nodes in the constructed tree by using where π(v) is the parent node of the current node v, μ(π(v)) = 1 if the number of child nodes of node π(v) is greater than 1, and 0 otherwise. For Φ(α, β), if β = 1, then it means there is no singleton nodes to be removed. In this case, let In this case, we need to search the next parent node in ancestor node set ν for node v.

C. Classification With Node-Wise Tree Classifier
Given a learned tree structure T , a test sample needs to be assigned to one of the leaf nodes in the classification task. Similarly, we adopt the framework of classical Pachinko Machine to classify samples starting from the root node until a leaf node is reached. Consistent with the theory of fuzzy lower approximation utilized in measuring the interclass similarity, we use the fuzzy rough classifier (FRC) as proposed by An et al. [26] as the base classifier, which assigns samples to the candidate class with the largest membership of the samples to the fuzzy lower approximation of the class. In other words, we utilize the lower approximation of samples to first measure the distance between classes appropriately, and then to assign the test samples to the right candidate class, which can help improve the overall performance.
Moreover, using FRC as the base classifier of the tree can reduce the training time of the overall algorithm significantly as it does not need to be trained in advance before using the test phase.

D. Fast Adaptation to Large-Scale Tasks
Constructing the affinity matrix with all samples is timeconsuming, especially for large-scale tasks. Therefore, in this article, fast adaptation is proposed to solve this problem. With the aim of reducing sample numbers, vector quantization is used to generate a few representative points for each class. Although this method reduces the complexity remarkably, it does not lose much performance, as shown in our experiments. We introduce vector quantization briefly in the following.
Recall that x ∈ X d is a sample of d-dimensional vector, we aim to obtain a reconstruction vector q i ∈ X C (1 ≤ C ≤ d) through q = Q(x), where Q(·) is the quantization operator. When x is quantized as q, a distortion measure d(·, ·) can be defined between x and q. The overall average distortion with m samples is written as To split the original space into C cells, each cell C i is associated with a reconstruction vector q i .
We use mean-square-error (mse) d(x, q) = 1 N N k=1 x k − q k 2 as the distortion measurement, and this process turns to be the well-known K-means algorithm. Representative points r C i are generated for each class C i by setting a proportion η of samples in the class. Note that at least one representative point is ensured for each class.
With the aforementioned parts, we summarize the proposed deep fuzzy tree model. In the training phase, deep features are first extracted and the interclass similarity is measured based on the features of training samples, and then hierarchical community detection is applied to build a tree structure (see Algorithm 1). In the test phase, after feature extraction, test samples are assigned by the fuzzy rough set classifier at each node of the tree structure (see Algorithm 2).

IV. THEORETICAL RESULTS
To find the rationality and features of the proposed model, we analyze the generalization error bound of hierarchical classification, and prove theoretically that the designed fuzzy interclass similarity can help improve the performance of hierarchical classification in the following proposition.
Proposition 1: be a set of samples drawn independent and identically distributed (i.i.d.) based on a probability distribution D over X × Y. Let Λ(x i , y) be the membership of x i to class y, and Φ(Λ) be the kernel function for Λ. Given a tree structure T , the empirical data-dependent error of hierarchical classification with Herr S [f ] is upper bounded by the fuzzy interclass similarity K(S).
Proof: The Rademacher complexity is widely used to analyze the generalization error bound of a classification problem [27]. We utilize the theorem presented by Shawe-Taylor and Cristianini [28] for the Rademacher data-dependent generalization bound in this issue where F is a set of hypotheses, Herr D (f )) = E D [f ] represents the generalization error of a hierarchical classifier, and represents the empirical error of a hierarchical classifier, with the minimum probability of 1 − δ.
In the scenario of hierarchical classification, we define f ∈ F as where Λ(x, y) is the membership of x to class y, and Φ(Λ) is the kernel function for Λ.
Babbar et al. [4] have presented the generalization error bound of hierarchical classification with kernel classifiers. Inspired from their work, we focus on the Rademacher complexity term to explore the influence of interclass measurement aŝ where σ i s are Rademacher variables which are independent uniform random variables whose values are chosen from the set {−1, +1}. Here we utilize the definition of Babbar et al. [4] for hierarchical classification, which develops the multiclass margin in [29] as Then, (11) can be transformed with the construction of c aŝ where i = [1, 2, . . . , m] (13) can be relaxed according to Jensen's inequality as follows: If we take the membership function as the lower approximation and take the kernel function as Gaussian kernel, inequality  (14) is in the form of where the node v is the ground truth node, and the node v is the other node which is the most similar node with respect to a certain sample x i . Inequality (15) shows that for the sample x i ∈ S, the generalization upper bound is influenced by the fuzzy interclass similarity between nodes v and v . For each node in the tree, the error upper bound will be smaller if the fuzzy interclass similarity between the ground truth node and other sibling nodes is larger. In the proposed deep fuzzy tree model, we realize inequality (15) by clustering the classes based on the interclass matrix with respect to the fuzzy interclass similarity to maintain large interclass similarity at each non-leaf node. The proposition also demonstrates that setting FRC as the base classifier can further improve the classification performance along with the tree structure built upon fuzzy interclass similarity.

A. Dataset and Implementations
We perform experiments on various datasets (see Table I). Typically, these datasets are all organized by using a semantic tree structure, which shows the hierarchical relations between classes.
1) PASCAL VOC [30]: a visual object classes dataset which is a benchmark in visual object category recognition and detection. It has 34 828 images with 20 classes. 2) ILSVRC65 [31]: a visual object image dataset which is the subset of ImageNet. It has 17 100 samples in 65 different classes.
3) Stanford Cars [32]: a car image dataset which aims to address the fine-grained classification problem. It has 16 185 samples in 196 different classes. 4) Cifar-100 [33]: an image dataset containing 60 000 samples in 100 classes, with 600 images in each class. 5) Caltech256 [34]: an image dataset with various types of classes. It has 30 607 image samples and 256 class labels. 6) SUN [35]: a scene understanding dataset with 397 kinds of scenes. We modify it by leaving out the categories that have more than one parent labels and samples with multiple labels. Finally, SUN dataset turns into 324 classes with at least 100 images per category. 7) ImageNet 1 K [36]: a large-scale image classification dataset which contains 1000 categories. Each dataset is split into a training subset and a test subset by 80% and 20%, respectively. The training subset is used to construct the hierarchical structure, while the test subset is used to obtain the classification results. All the results shown are the average of five-time running results, and all the experiments are executed on an Intel Core i7-600 running the Windows 8 operating system at 3.40 GHz with 32 GB memory.

B. Comparison Methods
We compare various algorithms including tree construction, semantic ontology, and deep learning baseline model. 1) Label Tree (LT) [12]: builds the label tree based on the confusion degree of classification results. It first learns a multiclass SVM and gets the confusion matrix through classification as the affinity matrix. Then, the tree is built through hierarchical spectral clustering method. 2) Mean-Vector-Based Tree Learning (MeanVT) [14]: considers the distance between the center of different classes as the distance between different classes and hence builds the similarity matrix. Then, it uses hierarchical spectral clustering to build the tree according to the similarity matrix.

3) Mean-Variance-Based Visual Tree Construction (Mean-
VarVT) [7]: constructs a similarity matrix by using the mean vector and the variance vector of each class to measure the distance between different classes. Then, it uses hierarchical spectral clustering to build a tree according to the similarity matrix. 4) Enhanced Visual Tree Construction (EnhancedVT) [8]: first proceeds through active sampling to choose a small part of samples reflecting the features of the dataset, then applies the Hausdorff distance to construct the affinity matrix. Finally, it utilizes hierarchical spectral clustering method to build the label tree structure. 5) Ontology (OTG): is the expert-designed semantic tree structure. It reflects the thinking manner of human beings and helps organize the datasets. 6) Standard VGG Net [6]: is the conventional VGG deep learning model, which uses softmax layer to classify the sample to all the candidate classes in a flat manner. We use pretrained VGG-19 Net fine-tuned with various datasets in this article for deep feature extraction, and replace its softmax layer with the proposed deep fuzzy tree algorithm. For a fair comparison with the quality of the tree structure, we obtain the different interclass relation matrix and apply the same community detection method to build the tree structure for [7], [8], [12], [14].

C. Results on Classification Performance
In order to investigate the performance of the proposed model, we compare DFT with other state-of-the-art models on six visual datasets, and use classification accuracy of the original test labels to assess the performance. In the proposed DFT, we set parameter η = 0.05, and choose the best parameter σ from the candidate set {10 −2 , 10 −1 , . . . , 10 6 }. The results are shown in Table II.
There are three aspects of these results that merit discussion. First, flat classification versus hierarchical classification. Table II shows that Standard VGG-19 Net performs well in datasets without too many labels in comparison with all hierarchical methods, such as PASCAL VOC, ILSVRC 65, and Cifar 100. Just as [4] and [5] pointed out, flat classifier performs better in the easy tasks while a hierarchical classifier is good at handling some difficult tasks by dividing the hard task into many easier subtasks. DFT outperforms the flat VGG-19 Net in most of the datasets with a large number of labels, which verifies the conclusion of [4] and [5].
Second, ontology versus data-driven hierarchy. Table II demonstrates that ontology-based method performs well in ILSVRC 65, Cifar 100, and SUN in comparison with datadependent hierarchical methods except DFT, which suggests that human knowledge is very helpful for determining structure and classification. However, there are large gaps between ontologybased and data-dependent hierarchical methods in PASCAL VOC and Stanford Cars. This suggests that building the tree structure without data does not work well in all the classification tasks, as a semantic gap exists in these tasks which needs to be improved by considering data information. Third, different hierarchical structure learning algorithms. It can be seen from Table II that our proposed DFT performs better than other hierarchical methods in all the datasets. MeanVT generally fails to get a good performance since utilizing one central point to represent all the samples can hardly express all information in data, and will only be effective if the data distributions of all the classes are ball-shaped. MeanVarTC improves this problem by using the mean and variance of data, thereby improving performance noticeably. However, it also has a prior assumption that the data distribution is Gaussian, which is often the case, but is hard to verify for data distributions of massive classes. In contrast, the performance of EnhancedVT in datasets with fewer labels is generally better than in datasets with massive labels. The limit for this algorithm is that the Hausdorff distance is sensitive to abnormal points, and this is inevitably in datasets with massive labels. Finally, although the LabelTree algorithm performs relatively well in various datasets, it cannot ensure the reliability of SVM for all the datasets, especially for those with huge amounts of labels.
Moreover, to further explore whether the observed differences are statistically significant, the Friedman test [37] for multiple comparisons, together with the Bonferroni-Dunn post hoc test [38] to identify pairwise differences, is applied on the all seven datasets. In the Friedman test, given k compared algorithms and N datasets, let r j i be the rank of the jth algorithm on the ith dataset, and R i = 1 N N i=1 r j i be the average rank of algorithm i among all datasets. The null-hypothesis of the Friedman test is that all the algorithms are equivalent in terms of classification accuracy. Under null-hypothesis, the Friedman statistic is distributed according to χ 2 F with k − 1 degrees of freedom where F F follows a Fisher distribution with k − 1 and (k − 1)(N − 1) degrees of freedom. The average rank of all the algorithms in terms of classification accuracy is listed in Table II, and the value F F = 26.649 is computed according to (16).
With seven algorithms and seven datasets, the critical value for α = 0.05 of F ((7 − 1), (7 − 1) × (7 − 1)) is 2.3638, so the null-hypothesis can be rejected. Thus, all the algorithms are not equivalent in terms of classification accuracy, and there exist significant differences between them. Then the Bonferroni-Dunn post hoc test is leveraged to detect if the DFT algorithm is better than the existing ones on the all seven datasets. Specifically, the performance of the two compared algorithms is significantly different if the distance between the averaged ranks exceeds the critical distance Table  5]. Note that q 0.1 = 2.394 with k = 7, so CD 0.1 = q 0.1 7×8 6×7 . Fig. 3 visually shows the CD diagrams in terms of classification accuracy, in which the lowest (best) ranks appear on the right of the x-axis. The bars show the estimated range of ranks, such that algorithms for which the bars do not overlap horizontally are statistically different. From Fig. 3, therefore, in terms of classification accuracy, DFT performs statistically better than MeanVT, MeanVarVT, Ontology, and EnhancedVT, but there is no statistical difference between DFT, Standard VGG, and LabelTree. According to conclusion of [4], [5], and Table II, DFT is more appropriate to deal with data with a large number of labels than Standard VGG, so we do another Bonferroni-Dunn  test of DFT and Standard VGG across four datasets with many labels, i.e., Stanford Cars, Caltech 256, SUN, and ImageNet 1 K. In this test, q 0.1 = 1.645 with k = 2 leads to CD 0.1 = 0.8225. The result is shown in Fig. 4, which demonstrates that there exists statistical difference between DFT and Standard VGG, which verifies that DFT performs better than Standard VGG in datasets with a large number of labels.

D. Case Study on Learned Tree Structures
To better understand the features of structures, we analyze and visualize the PASCAL VOC dataset in Fig. 5. Fig. 5(a) is the expert-designed semantic structure, and Fig. 5(b) is the structure learned by DFT. There are some slight differences between these two structures, which are annotated in red. In the semantic structure, Aeroplane and Bird belong to superclass Vehicles and Animal, respectively. However, they are grouped into the same superclass in the structure learned by DFT. Actually, the images of Aeroplane and Bird are very similar in visualization, and it is more reasonable to group them into one superclass. Similarly, Bus is the fine-grained class of 4-Wheeled Vehicles, whereas Train is not. In contrast, they are assigned to the same superclass in the structure learned by DFT. It is shown that the images of Train are almost locomotive images, which have a lot   6. Visualization of the learned tree structures. The learned tree structure removes the class Mushroom from the group Fruit and Vegetables, which corrects the unreasonable local parts of the semantic tree structure addressed by [33].
of resemblance with the images of Bus, and hence it is more reasonable to group them together.
Moreover, we visualize the learned tree structures of Cifar 100 datasets in Fig. 6, and it can be explicitly shown that the relations between different classes are deeply connected. It is worth noting that the learned structure corrects mistakes in the original human-designed structure. As stated in [33], Mushroom is grouped into Fruit and Vegetables for convenience, but they really do not belong to that group. Our learned tree structure correctly places Mushroom into a separate new group, as shown in Fig. 6.

E. Results on Efficiency of Tree Construction
To explore the efficiency of our algorithm, we run all the hierarchical models on all seven datasets and compare the running time of tree construction, including LabelTree, MeanVT, MeanVarVT, and EnhancedVT and DFT. Ontology method uses a predefined tree structure constructed by humans and Standard VGG does not make use of a tree structure, so they are not included in this part of the experiments. For DFT, we set the parameter η = 0.05. The results are shown in Table III. It can be seen that MeanVT is the most efficient algorithm, while En-hancedVT takes the longest time to build the tree. The efficiency of the proposed DFT generally comes just behind MeanVT and is comparable to MeanVarVT. To further explore the statistical differences, the Friedman test and Bonferroni-Dunn post hoc test are again applied. In the Friedman test, the null-hypothesis is that all the tree learning algorithms are considered equivalent in terms of run time. According to (16), χ 2 F = 22.456 and F F = 24.303 with five tree learning algorithms and seven datasets. Therefore, we can reject the null-hypothesis and conclude that there are significant differences between the algorithms. With the rejection of the null-hypothesis, the Bonferroni-Dunn post hoc test can be proceeded to explore if the algorithms are compared to each other. In this case, it is used to explore if the proposed algorithm is statistically better than others.
The Bonferroni-Dunn post hoc test result is shown in Fig. 7 with the CD diagrams in terms of run time. From this, we can conclude that the proposed DFT is comparable with MeanVT and MeanVarVT, with no statistical differences between them. However, LabelTree and EnhancedVT are not so efficient in comparison with these three algorithms. LabelTree first trains a multiclass SVM and then uses it to get the confusion matrix, so the efficiency is heavily influenced by the training of SVM. EnhancedVT aims to find the most important samples for data, requiring large amounts of time to get the selected sample set. The reason for EnhancedVT being less efficient than LabelTree in some datasets is mainly due to the process of selecting sample set in terms of computing multiple features of data, which takes up most of the running time. It is worth noting that LabelTree and EnhancedVT perform well in the cost of decrease in efficiency. On the other hand, MeanVT and MeanVarVT algorithm are very efficient, but their performance is not as good as other hierarchical methods since they cannot measure the similarity between classes accurately by assuming data distribution as ballshaped or Gaussian. The proposed DFT obtains the best level (with LabelTree) of classification accuracy while its efficiency is comparable with MeanVT, MeanVarVT, and StandardVGG, which indicates that DFT is both effective and efficient.

F. Parameter Analysis
There are two parameters in the proposed model, σ and η. To explore the influence of each parameter, various combinations are explored in this experiment. We choose σ ranging from {10 −2 , 10 −1 , . . . , 10 4 } and η ranging from {0.05, 0.15, . . . , 0.65}. Generally, we investigate how much influence on classification accuracy made by different combinations of σ and η in Fig. 8  First, η is the ratio of representative samples in each class, and the larger the η is, the more original information the data contains. Fig. 8 shows that the classification accuracy first increases from η = 0.05 and then decreases until η = 0.45. This indicates that even though using more samples is helpful for training the model, inappropriate ratios of representative points will lead to a decrease in performance. The reason may be that the general features of data can be partly reflected by few representative points but gradually being better until there are sufficient points to describe the overall features. Moreover, a larger value of η can help improve performance since more information of data is leveraged. Interestingly, η = 0.05 appears to be a good choice for balancing the effectiveness and efficiency experimentally, which indicates that small data can also produce good results if they can reflect the general information of the dataset properly.
Second, Fig. 9 shows that the optimal (or near-optimal) value of σ is different for various datasets. With the fixed value of η, values of σ in the range [10 2 , 10 4 ] achieve good performance in most datasets. Furthermore, we also plot all the combination of η and σ to explore more details of the parameter influence in Figs. 9-14 of the supplementary material. Generally, we find that the classification accuracy is broadly similar when each  parameter has a value in the recommended interval, but there are significant differences if inappropriate values of the parameters are selected.

VI. CONCLUSION
In this article, we propose a new DFT framework, aiming at gaining the benefit from hierarchical classification to help deep learning solve classification tasks with a large numbers of labels. With the help of the theory of fuzzy rough sets, dual fuzzy interclass similarity is designed to learn a better tree structure along with setting fuzzy rough classifier as the base classifier. It is further proved theoretically to be effective for hierarchical classification. By using community detection methods, a tree structure can be constructed by hierarchically detecting the most similar communities. To deal with large-scale tasks, fast adaptation is designed by using vector quantization. The performance of the proposed DFT algorithm shows the effectiveness and efficiency in comparison with the standard deep learning model and state-of-the-art hierarchical classification models.