Overview

In this post we take a look into searching in multidimensional data sets. Specifically, we consider k-d trees to facilitate searching in multidimensional data sets and the k-nearest neighbors algorithm.

In this post, we won't implement a k-d tree. We rather keep things simple and just illustrate the concept. A C++ implementation can be found here. Furthermore, a discussion of this implementation can be found in K-d trees Part 2.

K-d trees Part 1

The k-nearest neighbor algorithm is one of the simplest algorithms performing both classification and regression modeling on data. Indeed, given a data set $D$ and a set of labels, the k-nearest neighbor algorithm does not explicitly train any particular mathematical model on $D$. Instead, it stores $D$ and every time a query comes in about the class of a point, $P$, the algorithm goes over the points in $D$ computes the $k$ nearest points to $P$ and uses a kind of majority vote in order to determine the class that $P$ should belong to. This is incredibly simple but this means that we need to compare $P$ with every point in $D$. Assuming that the metric we use is $O(1)$ time, the total time complexity is $O(|D|)$. However, this is just an ideal scenario. For a tuple with $N$ elements computing a distance requires $N$ operations. Hence, for one query point the complexity is $O(N|D|)$ and for $|D|$ query points we get quadratic behavior $O(N|D|^2)$ [4].

So how can we improve this time complexity? The problem we face is that of searching in multidimensional data for the nearest neighbor(s) of a generic point $P$ that possibly is not in the dataset itself. K-d trees pose a way to solve this problem [3].

Before getting into the details, recall that a balanced binary search tree with $|D|$ nodes has complexity $O(log(|D|)$ and this is the motivation behind k-d trees. The only problem seems to be how to do the splitting.

K-d trees details

A k-d tree or a k-dimensional tree is a space partitioning data structure for organizing points in a k-dimensional space [1]. Specifically, a k-d tree is a binary search tree where every node in the tree represents a k-dimensional point [3]. We will assume that the coordinates of k-dimensional vector can be compared with each other. A non-leaf node in a k-d tree, implicitly generates a splitting hyperplane that divides the space into two parts, known as half-spaces [1]. Then all points to the left of the splitting hyperplane are represented by the left subtree of that node and points to the right of the hyperplane are represented by the right subtree [1]. The hyperplane direction is chosen by associating every node in the tree with one of the $k$ dimensions. K-d trees were proposed by Jon Louis Bentley in [5].

The above may sound a bit technical and difficult to grasp. So let's go over a simple example. In order to be able to visualize what is happening we will confine ourselves to 2D space i.e. $k=2$.

K-d tree simple example

The example we will use is taken from [2] as well as all the images used below. Let's consider the following points as shown in the figure below.

Figure 1. Data points in a Cartesian 2D space. Image from [3].

The specific coordinates of the points is immaterial here. We want to construct a k-d tree out of these points. There are many different ways to construct k-d trees [1]. This follows from the fact that there are many possible ways we could choose in order to create axis-aligned splitting planes.

Nevertheless, the canonical method of of k-d tree construction is as follows. As we move down the tree, we cycle through the axes used to select the splitting planes. For our example, we could have the root node to have an $y-$aligned plane, root's children would both have $x-$aligned planes, the the root's grandchildren would all have again $y-$aligned planes an so on. This process is illustrated in the figure above we use point $R$ and its $x-$ coordinate to determine the first split and the two images below.

Figure 2. Splitting a Cartesian 2D space along $x$ and $y$ coordinates. Image from [3].

In figure 1 we choose to split along the $x-$coordinate of point $R$. In figure 2, we continue by choosing point $W$ and its $y-coordinate$. Whilst in figure 3, we choose point $P$ and its $x-$coordinate.

Figure 3. Splitting a Cartesian 2D space along $x$ and $y$ coordinates. Image from [3].

The second point of concern is how to choose the points to insert. Again in the canonical method, points are inserted by selecting the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane. Of course the latter assumes that we know all points in advance. This method leads to a balanced k-d tree, in which each leaf node is approximately the same distance from the root [1]. Note. however, that it is not required to select the median point; in this case however, there is no guarantee that the tree will be balanced [1].

In order to avoid coding a complex $O(n)$ median-finding algorithm or using an $O(n log n)$ sort e.g. heap-sort or merge-sort to sort all $n$ points, a popular practice is to sort a fixed number of randomly selected points, and use the median of those points to serve as the splitting plane. In practice, this technique often results in balanced trees [1].

We now know conceptually how k-d trees work, in a following post, we will implement a simple k-d tree using c++. let's turn our attention to k-nearest search.

K-nearest search is the dominant operation behind the k-nearest neighbor algorithm. K-d trees can be used in order to avoid a brute-force search over the entire data set $D$. This would be similar to search an element in an unsorted array. k-d trees provide structural information we can utilize in order to improve the search.

However, $k-$nearest search is a bit involved so we won't discuss it in this post. You can however look it up in [3].

Summary

In this post we have a brief overview of k-d trees. These are binary search trees that facilitate searching in multidimensional data sets. In general k-d trees work well for low-to-medium-dimensional spaces but behaves poorly in high-dimensional spaces. K-d trees can also be used to boost the k-means clustering algorithm. An alternative to k-d trees is provided by ball-trees but we will look into that in another post.

References

  1. k-d tree.
  2. k-nearest neighbors algorithm.
  3. Marcello La Rocca, Advanced algorithms and data structures, Manning Publications.
  4. Giuseppe Bonaccorso, Mastering machine learning algorithms, Packt Publications.
  5. Jon Louis Bentley Multidimensional binary search trees used for associative searching. Communications of the ACM, 1975, Vol. 18, Issue 9, pp. 509-517.