Measuring Similarity Between Discontinuous Intervals - Challenges and Solutions

Discontinuous intervals (DIs) arise in a wide range of contexts, from real world data capture of human opinion to α-cuts of non-convex fuzzy sets. Commonly, for assessing the similarity of DIs, the latter are converted into their continuous form, followed by the application of a continuous interval (CI) compatible similarity measure. While this conversion is efficient, it involves the loss of discontinuity information and thus limits the accuracy of similarity results. Further, most similarity measures including the most popular ones, such as Jaccard and Dice, suffer from aliasing, that is, they are liable to return the same similarity for very different pairs of CIs. To address both of these challenges, this paper proposes a generalized approach for calculating the similarity of DIs which leverages the recently introduced bidirectional subsethood based similarity measure (which avoids aliasing) while accounting for all pairs of the continuous subintervals within the DIs to be compared. We provide detail of the proposed approach and demonstrate its behaviour when applying bidirectional subsethood, Jaccard and Dice as similarity measures, using different pairs of synthetic DIs. The experimental results show that the similarity outputs of the new generalized approach follow intuition for all three similarity measures; however, it is only the proposed integration with the bidirectional subsethood similarity measure which also avoids aliasing for DIs.


I. INTRODUCTION
Interval-valued data is used in many applications to model uncertain and imprecise data in a simple and efficient way.In particular, continuous intervals (CIs)-bounded by left and right endpoints [1]-are often used.Discontinuous intervals (DIs)-having a sequence of continuous subintervals [2]-can arise in many real-world situations, such as hazard detection [3], fusion of sensor data observed in a non-continuous space [4], temporal reasoning [5] [6], and expressing natural language with temporal repetition [7] where similarity between the DIs are often assessed and applied.Moreover, in fuzzy set (FS) theory, the α-cuts of non-convex FSs also result in the DIs [8].In such cases, the similarity between non-convex FSs with the α-plane decomposition is dependent on the computation of similarity of DIs such as proposed in this paper.
Many similarity measures (SMs) have been proposed for CIs where Jaccard [9] and Dice [10] are the most popular ones.However, thus far, there is no specific SM for DIs that directly assesses their similarity.Instead, DIs are commonly converted into their continuous form (CIs) using some common approaches like interval addition [11], interval union [12], or a 'convexify' function [13] [14] and then the respective CI SM is applied to compute the similarity.However, the 'DI to CI' conversion involves the loss of discontinuity information of the DIs, changing the original meaning of the data, and thus affecting the accuracy of the similarity of the DIs. Figure 1 shows an example of this, where we consider two different pairs of DIs.The use of the 'convexify' function converts both cases into the same pair of CIs as shown in Fig. 2. As a result, we receive same similarity for both pairs of DIs by the Jaccard and Dice SMs, which goes against intuition in respect to the original DIs.Here, one way to avoid this type of information loss is to consider all possible combinations of the continuous subintervals within the DIs [15].
However, a further problem, particularly, the aliasing issue with common SMs, such as Jaccard and Dice has recently been identified [16], where the same similarity is returned for very different sets of intervals.A recently introduced SM for CIs using their overlapping ratios [16], also called bidirectional subsethood [17] has been shown to avoid aliasing for CIs.
In this paper, we propose a generalized SM for DIs which combines the bidirectional subsethood based SM [16] [17] with the idea of considering all pairs of continuous subintervals within the DIs.This generalized approach maintains discontinuity information in respect to DIs and uniformly handles the similarity computation of both CIs and DIs.We explore and contrast the behaviour of the resulting SM in respect to employing the well-known Jaccard and Dice SMs as part of the same framework, highlighting that such approaches, while avoiding information loss, still suffer from aliasing.The rest of this paper is organized as follows.In Section II, we present some background facts of CIs and DIs, subsethood, two common SMs for the CIs along with the bidirectional subsethood based SM [16] [17].Section III introduces the proposed generalized SM for the DIs and discusses its properties.We demonstrate this generalized SM using a set of synthetic examples of DIs and discuss the results in Section IV.Section V concludes the paper along with future work.Table I presents a list of acronyms and notation used in this paper.

II. BACKGROUND
In this section, we first define CIs and DIs, followed by a review of subsethood, as well as the Jaccard and Dice SMs.Finally, we briefly review the bidirectional subsethood based SM for CIs [16], [17].

A. Continuous Intervals
A CI is a set of real numbers characterized by a left and a right endpoints [1].Mathematically, it is represented as a = [a − , a + ] with a − < a + [11]1 .The cardinality, or equivalently, the size or width of a CI a is |a| = |a + − a − | [18].Three common approaches for representing multiple disjoint CIs with a single CI are: two bounded, non-empty CIs, then their addition, a+b = [a − + b − , a + + b + ] is also a bounded, non-empty CI [11].
results in a single bounded, non-empty CI [12].
• A 'Convexify' function: It takes two CIs a and b as inputs and returns the smallest CI that covers both a and b [13].

B. Discontinuous Intervals
A DI consists of a number of continuous subintervals (i.e., CIs) [2] 2 .Mathematically, it is represented as [4] where a is the DI and m is the number of its CIs such that a 1 < ... < a i < ... < a m , and a i is the ith CI of a such that a − i < a + i .Alternatively, a can be presented as Subsethood is a relation that expresses the degree to which one object is a subset of the other object.For two crisp sets, a and b, the subsethood is [19] where |a ∩ b| is the cardinality of the intersection of a and b, and |a| is the cardinality of a. S h is in between 0 and 1 where Equivalently, the subsethood between two CIs a and b can be defined as where a ∩ b is the size of the intersection between a and b and a is the size of a.
For the FSs 4 A and B, the degree of subsethood is [23] S where ) is a measure of the cardinality of the intersection of membership functions of A and B, and

D. Jaccard Similarity Measure
The Jaccard SM [9] between sets a and b is defined as the ratio of the cardinality of their intersection and the cardinality of their union, Beyond sets, the Jaccard SM is used to estimate the similarity for CIs or sets of CIs such as employed for example in data fusion [24], [25] and that of fuzzy sets [26].For comparing two CIs a and b, the Jaccard SM is expressed as where a ∩ b is the size of the intersection between a and b and a ∪ b is the size of the interval segment(s) covering them.When a and b are completely overlapped, S J a, b = 1 and when they are non-overlapped, S J a, b = 0. Again, for the FSs A and B on the discrete universe of discourse X, the Jaccard similarity is extended as [27] S where μ A (x i ) and μ B (x i ) are the membership grades of x i in A and B respectively.Equation ( 6) gives 1 for identical FSs and 0 for disjoint FSs.Note that the Jaccard SM has been further extended for interval-valued [28] and type-2 fuzzy sets [29]; though, this is not discussed further here.

E. Dice Similarity Measure
The Dice SM [10] between sets a and b is the ratio of the cardinality of their intersection and the average of their cardinality, expressed as In [24], [25], the Dice similarity is used along with the Jaccard similarity for the CIs.As for sets, the Dice similarity for CIs a and b is While less frequently used for FSs than Jaccard, the Dice SM is for example used in [30], [31] for trapezoidal FSs in the context of solving multi-criteria decision-making problems.

F. Bidirectional Subsethood Based Similarity Measure for Continuous Intervals
A new SM for the CIs was introduced in [16] which uses the reciprocal subsethoods [17] or overlapping ratios [16] of a pair of CIs for capturing their similarity.This measure for two CIs a and b [16] [17] is where is a t-norm 5 .We can rewrite (9) using (2) as, This SM directly captures any changes in the size of CIs and is sensitive to the size of their intersection when one CI is a subset of another in a pair.Further, it is always within [0,1], and is bounded above and below by the Jaccard and Dice SMs respectively for the minimum t-norm.
In the next section, we introduce a generalized measure where we can apply any of the S J , S D or S S h SMs for estimating the similarity between CIs or DIs by meeting their continuity or discontinuity property.

III. PROPOSED GENERALIZED SIMILARITY MEASURE FOR DISCONTINUOUS INTERVALS
In this section, we propose a generalized SM for computing the similarity between two DIs by comparing all possible pairs of their continuous subintervals.As stated, this generalized SM is equally applicable for the CIs.First, we present the proposed generalization and then demonstrate its major properties.We note that while the proposed approach is computationally expensive, we focus on the quality of the resulting similarity assessment only in this paper.We have already made progress on making the approach computationally more efficient, but considering the constraints on manuscript size, we will focus on this in our future publication.

A. Proposed Generalized Similarity Measure
In the proposed generalization, we use the basic notion that as a DI contains one or more continuous subintervals, comparing two DIs is analogous to systematically comparing their subintervals.With this intent, we first determine all possible pairs of subintervals within the DIs and compute their similarity.Later, we aggregate all these similarities to determine overall similarity between the DIs.Equation ( 11 where S a i , b j computes the similarity between subintervals a i ∈ a and b j ∈ b of each pair {a i , b j } using any of the three SMs (S J , S D , and S S h ).max(m, n) is the maximum number of pairs that can arise from the comparison of a single subinterval.The max(m, n) operator in the normalization step guarantees a maximum similarity of 1 -achieved when two DIs are identical.While other operators could potentially be explored, the max(m, n) operator provides intuitive behaviour for the similarity measure.
Remark 1.When DIs possess only a single CI, the formulation for S at (11) will return the original formulation for CIs.
Example 1.We consider an example in Fig. 3   disjoint.Figure 3 shows the similarity for all pairs using S J , S D and S S h (with the minimum t-norm) SMs.Hence, the overall similarity between a and b using (11) along with S S h SM is, S a, b = (e) All of S J , S D and S S h measures are transitive [16], which implies that the S measure is also transitive.

IV. DEMONSTRATION
This section presents the behaviour of the proposed generalized approach based on the bidirectional subsethood based SM (S S h ) along with the Jaccard (S J ) and Dice (S D ) SMs for the DIs.Herein, we conduct two separate sets of experiments with different synthetic examples, each designed to facilitate intuitive understanding of the behaviour of the approaches.
With the first synthetic dataset, we gradually decrease overlapping between the subintervals for a pair of DIs to see how smoothly the similarity alters from 1 to 0. In particular, the change in similarity results is investigated for a gradual change in the overlapping of subintervals.With the second synthetic dataset, we change the number of subintervals and their degree of overlapping.In particular, we expect to see changes in the similarity due to a rise in the number of subintervals and their potential (lack of) overlap.In all experiments, we use the minimum t-norm for the S S h SM as it is the most common in practice.Further, all of these experiments are implemented using Java on an Intel(R) Core(TM) i3-4005U series based machine running at 1.70 GHz with 8GB RAM.

A. Synthetic Dataset-1
We consider a number of scenarios for a pair of DIs (a and b) where each of them includes two subintervals.In each scenario, we vary the degree of overlap between subintervals of a and b to explore how the generalized approach (S) responds with the respective SMs S J , S D , and S S h and how smoothly the similarity results change from 1 to 0. We keep a unchanged in all scenarios but shift the subintervals of b consecutively by a factor of 25%.In Fig. 4(b)-(e), we gradually shift the rightmost subinterval [6,8] of b by a factor of 25% till its only intersection is the right-end point of the subinterval [6,8] of a.In Fig. 4(f)-(i), we further shift the leftmost subinterval [1,3] of b by a factor of 25% until its only intersection is the right-end point of the subinterval [1,3] of a.The results in Fig. 5(b) show that the initial similarity between a and b is 1 from the S measure with all three SMs (as they are identical in Fig. 4(a)).Their similarity gradually decreases to 0.50 when the second subinterval [6,8] of b is gradually shifted by the factor of 25%.The similarity drops further and gradually reaches to 0 when the first subinterval [1,3] of b is also repeatedly shifted.Although one would intuitively expect that the similarity between the DIs should decrease proportionately as to the rate of change in their overlapping, Fig. 5(b) shows a proportionate decline in similarity results by both S S h and S D SMs, while the S J SM exhibits Scenario 2.5 -In Fig. 6(e), we add one more subinterval [9,10] to a as designed in Scenario 2.4 (Fig. 6(d)), thus setting a as [0, [3,7], [9,10] , while b remains the same.As this new subinterval [9,10] of a is disjoint from all subintervals of b, adding it should decrease the similarity between a and b as compared to the Scenario 2.4 (Fig. 6(d)).The results show that the S with all three SMs yielded expected similarity.
Scenario 2.6 -In Fig. 6(f), a remains the same but b is changed by adding one more subinterval [9.7, 10].Thus, b is now [0, 2], [5,9], [9.7, 10] .The new subinterval of b has 30% overlap with the subinterval [9, 10] of a.Therefore, the overall similarity between a and b is expected to be higher than that of the Scenario 2.5 (Fig. 6(e)).Again, we receive higher similarity results from the S with all three SMs.
In summary, the S with S S h and S D SMs follow a proportionate decline in similarity results as we gradually move the subintervals of a pair of DIs from a complete overlap to disjoint positions.Contrarily, the S with S J SM yields slightly higher than proportionate decline in the similarity results.Importantly, the S with S J and S D SMs still exhibit aliasing, whereas the S with S S h SM is sensitive to changes in overlap and thus avoid it.

V. CONCLUSION
In this paper, we have proposed a generalized approach to computing the similarity of DIs by integrating the bidirectional subsethood based SM [16] [17] with the strategy of considering the similarity of all continuous subinterval-combinations within the DIs.The new generalized SM is equally suitable for CIs and DIs.It does not require conversion/approximation of DIs to CIs, thus avoiding changes to the original data.We have compared the performance of the generalized approach using the bidirectional subsethood SM along with the Jaccard and Dice SMs for different synthetic pairs of DIs.The results show intuitive behaviour of the resulting generalized approach while highlighting that only by using the recently developed bidirectional subsethood similarity as part of the generalized approach, can avoid the aliasing issue.
In our generalized approach, we always consider all possible pairs of subintervals.As a result, an increase in the number of subintervals within the DIs leads to the increase in the number of similarity calculations.In particular where DIs have many/all disjoint subinterval pairs, such a 'brute force' approach results in substantial execution time.To mitigate this, in the future, we will integrate this generalized SM with Allen's theory [5] for reducing the number of similarity calculations and overall execution time.Further, we plan to use it for assessing similarity of non-convex FSs.We also aim to apply it in generating data-driven fuzzy measures from DIvalued data [33] and use it with fuzzy integrals for aggregation.

Fig. 3 :
Fig.3: Using S J , S D , and S S h SMs, the similarity results of all combinations of subintervals within a pair of DIs-one with three and the other with two subintervals.

1 max( 3 , 2 )
× 0.8333 = 1 3 × 0.8333 = 0.2778.In a similar manner, the total similarity between a and b with S J and S D SMs are 0.2778 and 0.3889, respectively.Theorem 1.The proposed generalized approach with S J , S D and S S h SMs satisfies all common properties of a SM for the DIs a, b, and such that: (a) 0 ≤ S a, b ≤ 1 (boundedness); S a, b = S b, a (symmetry); (c) S a, b = 1 ⇐⇒ a = b (reflexivity); (d) S a, b = 0 ⇐⇒ a b are disjoint (disjointness); (e) S a, b ≥ S a, c when a ⊆ b ⊆ c. (transitivity).Proof: Consider a = a 1 , ..., a m , b = b 1 , ..., b n , and c = c 1 , ..., c p .(a) S a, b involves the S J , S D or S S h SMs to compute similarity for all pairs of the subintervals a i ∈ a and b j ∈ b.All S J , S D and S S h SMs are bounded by 0 and 1 [16], i.e., 0 ≤ S a i , b j ≤ 1, ∀a i , b j .Hence, the mean of all such similarities is again within 0 and 1, implying S a, b ∈ [0, 1].(b) All of S J , S D and S S h measures are symmetric [16], thus making the S measure symmetric too.(c) If a = b, it means that both a and b have an equal number of m subintervals and each a i is identical to each b i , i.e., a i = b i , 1 ≤ i ≤ m.Among all subinterval pairs, m pairs have identical subintervals and the rest have disjoint subintervals.It implies that m pairs receive a similarity of 1 and the rest have a similarity of 0. Hence, the similarity between a and b is, S a, b = 1 max(m,m) × m = m m = 1.Thus, S(a, b) = 1 means that a and b are identical DIs.(d) If a and b are disjoint, it means that no subinterval of a is overlapping with any of subintervals of b, i.e., a i ∩ b j = 0, 1 ≤ i ≤ m and 1 ≤ j ≤ n.In other words, all subinterval pairs consist of disjoint subintervals, thus receiving a similarity of 0. Hence, the similarity between a and b is, S a, b = 1 max(m,n) × 0 = 0.

Figure ( 5 )
(a) presents in detail the shifting of subintervals of b for all scenarios, and Fig. (5)(b) graphically exhibits the similarity results using all three SMs.

TABLE I :
Acronyms and Notation Similarity results for the pairs of DIs with an increasing number of subintervals and varying degree of overlap.Note: SMs S J and S D return identical results for scenarios 2.3 and 2.4, i.e., they are subject to aliasing -only S S h captures the change in the respective DIs and thus avoids aliasing.