CLUSTERING OF PARAGRAPHS WHEN SUMMARIZING LEGAL CASES
3. METHODS: THE CLUSTERING TECHNIQUES
We employ non-hierarchical clustering methods that are based on the selection of representative objects to thematically group the paragraphs of alleged offences and opinion of the court, and to identify representative paragraphs. The representative paragraphs are extracted. They form the summary of the alleged offences and opinion of the court.
Cluster analysis is a multivariate statistical technique that automatically generates groups in data. It is considered as unsupervised learning. Non- hierarchical methods partition a set of objects into clusters of similar objects.
Clusteringmethodsbased on the selection of representative objects consider possible choices of representative objects and then construct clusters around them. The technique of clustering supposes:
1. an abstract representation of the textual object to be clustered, containing the text features or attributes for the classification;
2. a function that computes the relative importance (weight) of the features;
a function that computes a numerical similarity between the representations.
Each paragraph of the text of the alleged offences and opinion of the court is represented as a term vector. The terms (single words) are selected after elimination of stopwords and proper names, and are currently not stemmed. Stopwords are identified as the most frequent words in the corpus of legal cases. Proper names are recognized as capitalized words. The terms of the alleged offences are weighted with the in-paragraph frequency, which is computed as the number of times a term i occurs in the text paragraph.
Considering the stereotypical way used in describing the crimes committed, less important content words also contribute to identifying redundancy.
Discriminating the terms of the opinion of the court is done with inverse document frequency weights, which are computed before the actual abstracting. Their computation is based upon about 3000 cases and results in a list of term weights. Numbers are not included in the term vectors of the opinion paragraphs. The similarity between two text paragraphs is calculated as the cosine coefficient of their term vectors representations V1 and V2 (cf.
Jones & Furnas, 1987):
where
n= number of distinct terms in the paragraphs to be clustered.
(1)
In preliminary experiments, the cosine function performed better than the inner product as similarity coefficient because of length normalization.
Clustering methods based on the selection of representative objects consider possible choices of representative objects (also called centrotypes ormedoids) and then construct clusters around them. We adapted and further developed clustering algorithms described by Kaufman and Rousseeuw (1990, p. 68 ff.) for use in text-based systems. In the algorithms employed, each object can only belong to one cluster. As in other non-hierarchical methods, these algorithms split a data set of nobjects into kclusters.
We implemented the covering clustering algorithm for clustering of identical delict description paragraphs of the alleged offences disturbed by different facts, or with a variant sentence structure, and to eliminate redundant delict descriptions (Figure 1). In this algorithm, possible representative paragraphs (medoids) are considered for a potential grouping, but each paragraph must at least have a given similarity (threshold) with the representative paragraph of its cluster. The objective is to minimize the number of representative paragraphs. The threshold value is useful to define the degree of redundancy allowed and was set after several trials. We added an extra constraint: For a given number of medoids, a best solution is found for which the total (or average) similarity between each non-selected object (paragraph) and its medoid is maximized. We implemented a best solution to this problem with the following algorithm, which considers n! / (k! (n - k)!) possible solutions for each value of k.The number of k-values to be tested depends upon how fast an acceptable solution is found.
Covering Algorithm define threshold initk= 1
WHILE (k<=n) AND not found acceptable combination
FOR each possible combination ofkmedoids (= selected objects) FOR each non-selected object
IF combination of medoids is acceptable (= each non-selected object has a similarity above the threshold with the medoid of its cluster) THEN calculate total similarity of each non-selected object and its medoid IF an acceptable combination is found
THEN select acceptable combination ofkmedoids for which the total similarity of each non-selected object and its medoid is maximized and the algorithm stops ELSE increase kwith 1
determine its medoid
We implemented the k-medoid method for clustering the paragraphs of the opinion of the court according to theme (Figure 2). The k-medoid method searches the best possible clustering in k-groups of a set of objects.
The optimal solution of this problem is the generation of all possible k representative paragraphs (medoids) and the choice of the best possible solution for which the total (or average) similarity of each non-selected object (paragraph) and its medoid is maximized. We implemented a best solution to this problem with the following algorithm.
k-Medoid Method: Best Solution definek
FOR each possible combination ofkmedoids (= selected objects) FOR each non-selected object
calculate total similarity of each non-selected object and its medoid determine its medoid
select combination of kmedoids for which the total similarity of each non-selected object and its medoid is maximized
An optimal solution, which for the chosen kvalue considers n! / (k! (n - k)!) possible combinations, is only executable for relatively small problems.
We implemented an optimal solution for up to 15 paragraphs to be clustered.
Because the texts of the opinion of the court may contain more than 50 paragraphs, we implemented a good, but not optimal solution for the k- medoid method. The algorithm can be considered as a reallocation algorithm. An initial clustering is improved in consequent steps until a specific criterion is met. The algorithm consists of two phases. First, an initial clustering is performed by successive selection of representative paragraphs (medoids) until kmedoids are found (function BUILD). Second, to improve the clustering yielded by BUILD, the set of all pairs of objects (i,h), for which object i has been selected as representative paragraph (medoid) and objecth has not, will be considered in the search for a better clustering (function SWAP).
As an initial step, the function BUILD selects the most centrally located object of the data set. The object is chosen for which the sum of similarities to all other objects is maximized. This object is the first medoid. In the next steps, each time a new medoid is chosen until k medoids are found. The medoid chosen is the object for which a maximum gain in total (or average) similarities between each non-selected object and its medoid is obtained.
For a given initial clustering, the function SWAP considers each pair of objects (i,h) for which ihas been selected as representative object and hnot.
For each pair, the contribution to the clustering is computed when representative object i is replaced by object h. This contribution is positive (increase in total or average similarity values between each non-selected object and its medoid), negative (decrease in total or average similarity values between each non-selected object and its medoid), or zero. The swap- pair with the highest contribution is selected. If this contribution is positive, the swapping operation is executed, and the whole procedure of calculating the contribution of all possible swapping operations is repeated, otherwise the algorithm stops (no better grouping can be found).
This reallocation algorithm is computationally much less expensive than a best solution to the problem. For long texts, usually a few swapping operations are sufficient to obtain a good clustering.
k-Medoid Method: Good Solution definek
1) BUILD
select most centrally located object of the data set (sum of similarities with all other objects is maximized)
build a cluster around this medoid REPEAT
FOR each candidate medoidi FOR each non-selected objectj
calculate similarity betweeniandj= s(i,j)
compare s (i,j) withSj(similarity between jand the medoid of the cluster j currently belongs to)
IF Sj <s(i,j)
THEN compute the gain in similarity when movingj compute the total gain in similarities by choosing objecti(Li) choose the bestifor whichLi is maximized
build clusters around the medoids UNTILkmedoids are found
FOR each pair (i,h) (iis selected hnot)
calculate CONTRIBUTION TO THE CLUSTERING select pair for which contribution to the clustering is maximized IF contribution is positive
THEN execute swapping operation with selected pair ELSE stop SWAP
2) SWAP
repeat SWAP
The contribution of swapping the pair (i,h) is computed as the total of changes in similarities, when h becomes medoid instead of i. Instead of recalculating the total similarities in this new cluster structure, only those similarities that are affected by the change in cluster structure are computed.
CONTRIBUTION TO THE CLUSTERING
ibecomes a member instead of a medoid: its new medoid is searched and the similarity betweeniand this medoid is added to the contribution
hbecomes a medoid instead of a member: its old medoid is searched and the similarity betweenhand this medoid is subtracted from the contribution
The changes regarding all other non-selected objectsjare added to the contribution:
IFjis more similar to one of the other medoids than to ior to h
THENjdoes not change position in the cluster structure and the contribution = zero ELSE
IFiwas the medoid of the clusterjbelongs to
THEN IFjis closer tohthan to its second choice medoid (x):
contribution is positive, negative, or zero, depending on the difference in similarities between jandhandjandi(sim (j,h) - sim(j,i))
ELSEjchanges from cluster with medoidito cluster with medoidx (x= second choice medoid of j): the contribution is negative or zero, depending on the difference in similarity between jandxandjandi(sim (j, x) - sim(j,i))
IF the similarity between jandh is higher than the similarity betweenj and its current medoid y
contribution is always positive and represents the difference in similarity betweenjandhand j and y(sim(j,h) - sim(j,y))
ELSE
THENjchanges from cluster with medoid yto cluster with medoid h: the
As all combinations of medoids (in case of an optimal solution) or all potential swapping operations are considered (in case of a good solution), the results of the algorithms do not depend on the order of the objects in the input file (except in case the similarities between objects are tied).
The number of medoids (k) is predefined or is determined as part of the clustering method. In the latter case, employed in SALOMON, possible k values are considered in the search for the best kvalue. For each object iof the cluster structure, the degree of fitness (f(i)) of an object ito its cluster is computed as the normalized difference between the average similarity of the
The changes regardingiandh:
THEN jchanges from. cluster with medoidito cluster with medoidh: the
--- ---
---
object i to all other objects of its cluster and the similarity of i with its second choice cluster:
f(i) = (a(i) -b(i)) / max (a(i), b(i))
(2) where:
a(i) = b(i) =
average similarity of ito all other objects of its cluster
maximum of the similarities of iwith each other cluster whereto idoes not belong computed as the average similarity ofiwith the objects of this cluster, i.e., the similarity of ito its second choice cluster.
PARAGRAPHS OF THE ALLEGED OFFENCES =
import: namely cannabis from The Netherlands (Maastricht) (O.S. 91/2068);
In breach of article 1,2 b (1 and 5) of the Act of 24 February 1921, and of article 1, 3, 11 and 28 of the Royal Decree of 31 December 1930 on Drugs and Narcotics, having imported, possessed, sold or offered for sale narcotics or other psychotropic drugs that may induce dependence and that are enlisted by Royal Decree, for valuable consideration or for free, without preceding license of the Ministry of Public Health, namely ...
possession: several times cannabis, as it turns out from the analysis of exhibits O. S. 90/1571 and 9 1/2068
possession: several times cannabis REPRESENTATIVE PARAGRAPHS =
In breach of article 1,2 b (1 and 5) of the Act of 24 February 1921, and of article 1, 3, 11 and 28 of the Royal Decree of 31 December 1930 on Drugs and Narcotics, having imported, possessed, sold or offered for sale narcotics or other psychotropic drugs that may induce dependence and that are enlisted by Royal Decree, for valuable consideration or for free, without preceding license of the Ministry of Public Health, namely ...
possession: several times cannabis
Figure 1. Brief example of the elimination of redundant paragraphs in the alleged offences (translated from Dutch).1
For each possible k value (except for k= 1 or k =n), we compute a best or good clustering, compute the degree of fitness of each object to its cluster, and average these fitness values. The best kvalue is the one for which the average fitness value is maximized. To test whetherk=1 (in case the bestk= 2) or k = n (in case the best k = n - 1) represents a better clustering, we respectively test whether the average similarity between each non-selected object and its medoid increases or whether the average similarity between objects of different clusters decreases. For the former test, we first compute the medoid whenk= 1.
Themedoid of each cluster or most centrally located object of the cluster forms a representative description of each crime or topic treated in the alleged offences or opinion of the court (Figure 2). We assume that a text sentence or paragraph that is closely linked by patterns of content words to a number of other text sentences or paragraphs is informative, and thus is relevant to include in the summary (cf. Prikhod’ko & Skorokhod’ko, 1982).
In addition to the paragraphs, we also extract key terms from clusters of opinion of the court paragraphs that contain more than three objects (Figures 3 and 4). Different methods are possible for key term selection (Jardine &
van Rijsbergen, 1971; Willett, 1980). Currently, we select the two terms with highest weight from the terms of the average vector of the cluster.
Presently, we limit ourselves to the extraction of information from the case text. No attempt is made to re-edit this information. Given the danger of misinterpreting or misrepresenting the case text, even abstracts of legal cases that are intellectually composed are no more than the extraction of relevant text parts (Uyttendaele et al., 1996, 1998).