Lecture 16 : The Tree Representation of Sets


1 Sets as Trees
      1.1   Perfectly Balanced Trees: left & right branches are same size
      1.2   Well balanced trees
      1.3   AVL trees are adequately balanced
2 Implementing sets as trees.
      2.1   Tree-nodes have four entries
      2.2   Implementing the empty set representing sets as trees
      2.3   We need the height of trees, empty or non-empty.
      2.4   (mk_tree entry left right) makes a tree.
      2.5   Making a balanced tree with make_tree
            ... Rotation is required to balance a tree
      2.6   Implementing set->list representing sets as trees
      2.7   Implementing member_set? representing sets as trees
      2.8   Implementing adjoin representing sets as trees
3 Other representations of sets.
4 Languages are Sets, Parsers are member_set? helpers.

1 Sets as Trees

If we represent a set as a balanced tree we can achieve a significant speed up in evaluating the member_set? and adjoin functions.

We speak of the tree as being composed of nodes, each of which contains an entry which is a member of the set being represented, a left branch and a right branch. A tree is a binary search tree with respect to a given total ordering relation if the entry at any node is greater than all entries occurring in the left branch, and less than all entries occurring in the right branch. This is the first data invariant for our representation of sets as trees.

A tree can also be an empty-tree, which has no entry, left branch or right branch.

The height of a tree is defined as being 0 for the empty tree, and one more than the maximum of the height of the left branch and the height of right branch for a non-empty tree.

The great advantage of a binary search tree is that, if we are looking for a given entry, x in a tree, we can compare x with the entry, y say, to find which sub-tree x must lie in. If x=y then x is in the tree, and we have found it. If x<y then we know that it must lie in the left branch of the tree if it is in the tree at all, and conversely, if x>y then it must lie in the right branch. If the tree is adequately balanced, then at each stage we are dividing the size of the set of values in which we are searching by 2, which means that our search for x will terminate in logarithmic time.

1.1 Perfectly Balanced Trees: left & right branches are same size

A tree, such as the one on the left above, for which the left and right branches of all subtrees contain the same number of elements, is said to be perfectly balanced.

In a perfectly balanced tree, each entry is the median of the set of entries in the whole subtree headed by that entry.

Theorem

A perfectly balanced tree with height h contains 2h-1 entries.

Proof by induction:

Base case: h=0. A perfectly balanced tree of height 0, that is the empty tree, contains 0 entries. 20-1 = 1-1 = 0. So the result holds.

Inductive step:

Now suppose that for some h, all perfectly balanced trees of height h have 2h-1 entries. Consider a perfectly balanced tree of height h+1. It has two sub-trees of height h, each, by the inductive hypothesis, containing 2h-1 entries. So the total number of entries for our tree of height h+1 is 2*(2h-1) + 1 = 2(h+1) - 2 + 1 = 2(h+1) - 1.

Suppose we have a perfectly balanced tree containing n entries. Then n = 2(h-1) - 1, that is, taking logarithms to the base 2, h-1 = log(n+1) Thus we can get to any entry in worst case time O(log n), provided we know which branch to take at every node.

1.2 Well balanced trees

However we may not have a set of size exactly 2h+1 - 1 elements. Such a set cannot be represented by a perfectly balanced tree, but we can limit the amount of unbalance to maintain logarithmic time access.

Any set can be represented by a tree in which the disparity between the number of entries in the left branch and the number in the right branch is never more than 1. The algorithm to do this is obvious enough:

Let us call such a tree "well-balanced". If we kept our trees well-balanced, then this would give us the shortest worst-case time to find a given element in a member_set? operation.

However, maintaining trees in a well-balanced form is not practicable if we want an efficient implementation of adjoin. Imagine a well-balanced tree

where A and B are subtrees, and size(B) = size(A) + 1. If we now adjoin an element y > x, y < b for all b in B, to the tree, B becomes one larger, but the rearrangement required to maintain the well-balanced condition can be quite expensive.

Consider, for example the tree:

Adjoin 7 - a simple algorithm gives:

Adjust to make it well balanced:

It is clear from this example that we have a significant amount of work to do. The 7 has moved from a tip of the tree to the root-node, while the 6 has moved from the root-node to a tip. This kind of re-arrangement could take place in many circumstances in which an entry which "belongs" between the left and right branches is adjoined to a tree of any depth.

1.3 AVL trees are adequately balanced

So, we need to look for compromise in our idea of balance that will make the adjoin operation cheaper. If we are less particular about balance, we can adjust balance by a local operation as we rebuild a tree. Let us say that a tree is adequately balanced if the branches of all sub-trees differ in height by no more than one. These are more commonly known as AVL trees.

Consider an adequately balanced tree T, with top-level entry x, left and right sub-trees A and B, which is converted into a tree T' by adjoining an element s > x. The right sub-tree B of T will be replaced in T' by B'. Now if height(B) = height(A) + 1, then T' will be no longer adequately balanced if height(B') > height(B).

However we can restore adequate balance by a local transformation of T', which moves some material into the left branch. This requires us to analyse four distinct cases:

If the tree B' has height 2 greater than A, it must have height at least 2. So we can expand B' as a sub-tree, obtaining the following tree, which is annotated with the heights.

Since the original tree T was adequately balanced, the unbalanced nature of T' must arise from either C or D having height h-2, but not both, since we have only adjoined one element.

CASE 1

If D has height h-2 then C must have height h' = h-3 or h'= h-4, so we can move it to the left branch as follows:

This new tree is adequately balanced, and is a binary search tree, since every entry in the left sub-tree is less than y, every entry in the right sub-tree is greater than y and every entry in C is greater than x and every entry in A is less than x.

We call this operation on a tree a left rotation.

CASE 2

This is symmetric to CASE 1, where the left branch becomes too long. It is cured by a right rotation.

CASE 3

However, if C has height h-2 then D must have height h"', where h"' = h-3 or h"' = h-4. We can split C into E of height h' and F of height h", where h'= h-3 or h'= h-4 and h" = h-3 or h" = h-4.



Here the new tree is adequately balanced, and is a binary search tree, since every entry in the left sub-tree is less than z, every entry in the right sub-tree is greater than z, every entry in A is less than x, every entry in E is greater than x, every entry in F is less than y, every entry in D is greater than y.

This tree-transformation can be achieved by a right rotation followed by a left rotation.

CASE 4

This is the symmetric condition in which the left branch becomes too long.

2 Implementing sets as trees.

Let us now implement sets as AVL trees. We must first design the concrete data-structures to represent the nodes of a tree, and decide how to represent the empty set.

2.1 Tree-nodes have four entries

We will need to be able to decide quickly whether a tree is balanced, so it is convenient to have a "slot" in our representation of a node which holds the height of the tree. Thus a node is represented as a record having components

    entry
    tree_left
    tree_right
    height

We will require that these 4-member records preserve the data-invariant that the contents of the height-slot are actually the height of the tree represented by the node.

We can use the record-class function of UMASS Scheme to create suitable records for our nodes.


(define class_tree (record-class 'tree '(full full full full)))

(define cons_tree   (car class_tree))      ; The constructor for nodes
(define sel_tree    (caddr class_tree))
(define entry       (car sel_tree))        ; (entry tree) gets value at node
(define tree_left   (cadr sel_tree))       ; (tree_left tree) is left branch
(define tree_right   (caddr sel_tree))     ; (tree_right tree) is right branch
(define height_field (cadddr sel_tree))    ; should hold the height.

2.2 Implementing the empty set representing sets as trees

We choose to use the empty-list to represent the empty-set.


(define empty-set '())

We will also use the function:


(define null_set?
   (lambda (s) (equal? s empty-set))
)

2.3 We need the height of trees, empty or non-empty.

However we have to define height as a function which acts on trees, which may be null trees, representing the empty set. We are relying on constraining the way in which we make nodes to ensure that the height_field actually contains the height of the tree.


(define (height tree)
    (if (null_set? tree) 0 (height_field tree)))

2.4 (mk_tree entry left right) makes a tree.

We define the function mk_tree to preserve the data-invariant for height:


(define (mk_tree entry left right)
    (cons_tree entry left right (+ 1 (max (height left) (height right)))))

2.5 Making a balanced tree with make_tree

Now we need a function to measure the degree of balance of a tree:


(define (balance T)
    (let* (
         (L (tree_left T))
         (R (tree_right T))
         (diff (- (height R) (height L)))
         )
        diff
        )
    )

Given these capabilities, we can define a make_tree function which, given two AVL trees, makes a new AVL tree by adjusting the balance as discussed above.


(define (make_tree x L R)
    (let* (
         (T (mk_tree x L R))
         (B (balance T))
         )
        (cond
            ( (> B 1)              ; right tree is too deep
             (if (> (balance R) 0)
                 (rotate_left T)   ; CASE 1
                 (rotate_left      ; CASE 3
                     (mk_tree x L (rotate_right R))
                     )
                 )

             )
            ( (< B -1)            ; left tree is too deep
             (if (< (balance L) 0)
                 (rotate_right T)     ; CASE 2
                 (rotate_right        ; CASE 4
                     (mk_tree x (rotate_left L) R))
                 )
             )
            (else T)            ; balance is adequate anyway
            );end cond
        ); end let
    )

... Rotation is required to balance a tree

We can readily define the rotation operations. Let us recall our picture of a tree which is to be rotated left:

This is to be converted into a tree:

and we can do this as follows:


(define (rotate_left T)
    (let (
         (x (entry T))
         (y (entry (tree_right T)))
         (A (tree_left T))
         (C (tree_left (tree_right T)))
         (D (tree_right (tree_right T)))
         ) (mk_tree y (mk_tree x A C) D)
        )
    )

We can use the same pictures to guide our definition of right-rotation.


(define (rotate_right T)
    (let (
         (y (entry T))
         (x (entry (tree_left T)))
         (A (tree_left (tree_left T)))
         (C (tree_right (tree_left T)))
         (D (tree_right T))
         ) (mk_tree x A (mk_tree y C D))
        )
    )

2.6 Implementing set->list representing sets as trees

Having managed to deal with writing a function for making adequately balanced trees, we can now define our functions to represent sets as trees. The set->list function will require us to walk the tree with an accumulator, so we need an auxiliary function help_stol.


(define (help_stol s acc)
    (if (null_set? s)  acc     ; empty set? use the accumulated elements
        (help_stol             ; collect elements
            (tree_left s)      ; in the left branch
            (cons              ; having already accumulated..
                (entry s)      ; the current entry and ..
                (help_stol                ; all elements in the right branch
                    (tree_right s)
                    acc)
                )
            )
        )   ; end if
    )

Now set->list requires us to call the auxiliary function with a null accumulator.


(define (set->list s)
    (help_stol s '())
    )

2.7 Implementing member_set? representing sets as trees

We can write member_set?:


(define (member_set? x s)
    (cond
        ((null_set? s) #f)         ; nothing belongs to the empty set
        ((= x (entry s)) #t)       ; we have found the entry for x
        ((< x (entry s))           ; is x less than the current entry?
         (member_set? x            ; if so, go down the left branch
             (tree_left s)))
        (else (member_set? x       ; otherwise go down the right branch
                (tree_right s)))
        )
    )

2.8 Implementing adjoin representing sets as trees

We write the adjoin function using make_tree which will maintain balance. Essentially, it rebuilds the tree down a path; to the left of this path every entry is less than x, to the right every entry is greater.


(define (adjoin x s)
    (cond
        ((null_set? s)                   ; to adjoin x to the empty set
         (make_tree x '() '())           ; we make a tree with x as the only
         )                               ; entry. [end of null case]

        ((< x (entry s))                 ; if x is less than the current entry
         (make_tree                      ; we make a balanced tree, starting
             (entry s)                   ; with one whose entry is the current
             (adjoin x (tree_left s))    ; whose left branch has x adjoined
             (tree_right s)              ; and with the same right branch.
             )
         )                               ; end < entry case

        ((> x (entry s))                 ; if x is greater than the current
         (make_tree                      ; entry, we similarly rebuild ...
             (entry s)
             (tree_left s)
             (adjoin x (tree_right s))  ; the right branch
             )
         )                             ; end > entry case
        (else s)                       ; otherwise x is equal to current entry
        )                              ; x is already in the tree - use it
    )

The above adjoin function takes log(n) time, because we only have to call make_tree at each node down a path in the tree, and make_tree takes constant time, since it only rearranges the nodes of the tree to a depth of 3.

We can use the generic function for intersection that already exists. This now takes time n log(n) because member_set? now takes log time.

The rest of the implementation can use the generic functions we defined in the previous lecture.


(define (list->set l)
    (if (null? l) empty-set
        (adjoin (car l) (list->set (cdr l)))
        )
    )


(define (reduce f acc base l)
   (if (null? l)
       base
       (acc (f (car l)) (reduce f acc base (cdr l)))))


(define (intersect s1 s2)
    (reduce
        (lambda (x) x)                   ;f
        (lambda (x s)                    ;acc
            (if (member_set? x s2)
                (adjoin x s)
                s)
            )
        empty-set                        ;base
        (set->list s1)                   ;list
        )
    )


(define (included_in? s1 s2)
    (reduce
        (lambda (x) (member_set? x s2))    ;f
        andf                               ;acc
        #t                                 ;base
        (set->list s1))                    ;list
    )

We need to define andf as a proper function, since and is a special form.


(define (andf b1 b2)
    (and b1 b2)
    )

We can use the generic definition of equal_set? in terms of included_in?:


(define (equal_set? s1 s2)
    (and (included_in? s1 s2) (included_in? s2 s1))
    )

So, let's test out our tree representation of sets.


(test_laws_sets  100)

We can summarise the computational complexity of the chosen functions for given representations of sets as follows:

If you are compiling this whole lecture, we can stop at this point, because what's below doesn't form an integral part of the code embedded in the lecture.


(error "Ignore this - it's just to stop compilation at this point")

3 Other representations of sets.

If we are representing an infinite set the set->list function cannot be implemented. It is possible to represent a countably infinite set as a stream, which can be thought of as an extension of the list concept, with a "lazy cdr" usually called tail.

Generally for infinite sets the equal_set? function is hard to implement, and will often be undecidable.

We could define a infinite set by a predicate which recognises whether an object is a member of it. For example the set of even integers could be defined by:


(define (even x)
    (= (remainder x 2) 0)
    )

(define (member_set? x s) (s x))

Given this representation, it is easy to define member_set?


(define (member_set? x s) (s x))


(:- (member_set? 2 even))

adjoin (but this is less useful for infinite sets) union and intersection. However the equal_set? function requires us to determine the equality of two functions, which is known to be undecidable.

Russell's Paradox, due to Bertrand Russell, shows that allowing a set to be defined just by a predicate is problematic. The main difficulty is that it allows one to have sets that are members of themselves. For example, one might speak of the set of all abstract concepts, which is surely an abstract concept and so is a member of itself. Now let us call a set normal if it is not a member of itself. Is the set of all normal sets a normal set? If it is, then it is not a member of itself, but, being a normal set it must be a member of itself, a contradiction.

We can try out this paradox in Scheme! We can define a normal set to be one which is not a member of itelf.


(define (normal x)
    (not (member_set? x x)))

Now consider whether the set of all normal sets is normal. If you paste the line below into a file test.scm

    (normal normal)

and execute it you will get

    Error: rle: RECURSION LIMIT (pop_callstack_lim) EXCEEDED

Incidentally, this raises the question of the soundness of the lambda-calculus itself, since the lambda calculus allows us to write dangerous looking formulae like (x x). Is the calculus a formalism that can be given any consistent interpretation? - Scott and Strachey showed that it can be, but the construction is not easy.

4 Languages are Sets, Parsers are member_set? helpers.

A language is an infinite set of sequences of tokens drawn from an alphabet. A parser for a language is in effect a helper function for the member_set? function. It is easy to see how we can implement the union and intersection of languages represented by their parsers.

However the problem of performing the equal_set? computation for languages is much harder. Indeed, for general languages, it is undecidable.