This is part of Early Gnutella

Search Result Grouping

Christopher Rohrs

August 24, 2001

This document describes how LimeWire displays search results. Particular emphasis is placed on LimeWire's "smart grouping" algorithm, which places approximately similar search results in little folders. Because the algorithm is somewhat slow, LimeWire uses threads to decouple the backend, grouping, and GUI. Furthermore, LimeWire use a somewhat complicated TreeTableModel to render these folders in the GUI.

The Basic Algorithm

LimeWire defines two results to be similar if they have the same file extension, their file names differ by just a few characters, and their sizes differ by no more than 60k. More formally, file names are considered similar if they have an edit distance of no more than 4 characters or 5% of the file name, whichever is smaller. Calculating the edit distance of two strings can be done with a classic dynamic programming algorithm in O(M^2) time, where M is the length of a file name.

The grouping problem can be formalized as follows: given a list of N search results, find any results that match the N+1'th search results. An obvious way of implementing this is to compare the last result with all N previous results. Unfortunately this algorithm requires O(N^2) comparisons for N results, and each comparison takes O(M^2) time with filenames of length M. Hence the overall running time is O(N^2*M^2). In practice, this algorithm is unusable.

However, with a little cleverness, we can calculate whether two strings have an edit distance less than K in just O(K*M) time. See Chapter 11 of Algorithms on Strings, Trees, and Sequences by Dan Gusfield for a complete discussion. This reduces the overall running time to O(N^2*K*M), where K=4. Furthermore, we can reuse memory for the dynamic programming table, reducing allocation and garbage collection time.

Another optimization is to keep the results sorted by file size. Then we only need to compare a result against the results that are 5% bigger and smaller. While this algorithm still requires O(N^2) comparisons in the worst case, it only requires O(N) comparisons in practice, for a moderately small hidden constant. Assuming the sorted list is a binary tree, inserting the N'th result only takes O(log N) time. Hence the overall running time is O(N*log N*K*M), orders of magnitude faster than the original algorithm. Of course the user doesn't usually want to view the sort by file size. For example, the user may wish to sort by IP address. Hence the grouper's result list is different from the GUI's result list.

Each search result is ultimately stored in the GUI as a single TableLine instance. The result grouping algorithm is implemented in the TableLineGrouper class. It delegates to TableLine.similar(TableLine) to compare search results, which delegates to ApproximateMatcher to compare strings. Abstractly, TableLineGrouper is a list of TableLine's, with methods to match a TableLine against other elements in the list. Note that the TableLineGrouper does not maintain the groups per se; it simply stores the TableLine's of the top-most groups.

The Grouping Pipeline

The algorithm described above is still too slow to call from the backend and too slow to call from GUI event dispatch thread. So we throw another thread at the problem, the ResultGrouperThread. Here's how it works. The backend delivers search results to the GUI via ActivityCallback.handleSearchResult. This method adds the QueryReply to a queue in SearchView and returns immediately. The ResultGrouperThread then consumes each QueryReply from the queue. For each QueryReply Q, it does the following:

  • Find the ResultPanel R corresponding to Q. If no match, the search has been cancelled, so discard Q.
  • Convert Q into a TableLine T.
  • Get R's TableLineGrouper and try to find another TableLine T' similar to T. If no such match, add T to the grouper and set T' to null.
  • Schedule T to be rendered in the GUI as a child of T'. T will be rendered via SearchView.addQueryReply(T, T', R).

In other words, we calculate where a TableLine should be drawn some time before we actually draw it. One downside of this scheme is that there is no way to actually disable the result grouping algorithm! Grouping is done whether or not the user has selected the "Grouped" checkbox. More on this in the last section.

SearchView currently extends StandardSearchView. This is bad design - it should probably delegate instead - and exists only for historical reasons. It is not illustrated below. SearchView deals with the result grouping pipeline, outside the GUI, and StandardSearchView deals with the actual GUI.

ResultPanel Overview

The class diagram for the search view panel is shown below. The search view table is implemented with a single StandardSearchView instance, which holds a reference to multiple ResultPanel's, each representing a single search result tab. Each ResultPanel holds a reference to a JTreeTable and the corresponding ResultPanelModel backing the table. Each ResultPanelModel has a reference to a single root TableLine, which is never displayed. The children of this TableLine, themselves also TableLine's, are shown as results.

Image:ResultGroupingImage002.png

There are three cases when adding a TableLine T to the ResultPanelModel as a child of T', i.e., when calling addLine(T, T'):

  1. Grouping disabled: T'==null. T is added as a child of the root
  2. Parent a leaf: T' has no children. T' is mutated into a leaf node. Then a clone of T' is added as a child of T', along with T.
  3. Parent a non-leaf: T' already has children. T is added to the children of T'.

These cases are illustrated by the object diagrams below. Each box in this diagram represents a single instance of TableLine. Remember that the root TableLine is never displayed.

Image:ResultGroupingImage004.png

Now, here comes a nasty complication. You cannot just add a TableLine to the model and expect it to be rendered automatically; you need to notify any TreeTableModelListeners of ResultPanelModel via its fireTreeNodesInserted and fireTreeNodesChanged methods. This method in turn requires you to pass the index of the child that was modified or inserted. For example, in case (3) of the above diagram, you must notify any TreeTableModelListeners that you inserted a node as the 1st child of the 0th child of the root.

So how do you get the index value of a node? The addTableLine method of ResultPanelModel does not have this value. You could search through the children of the node's parent, but that would be expensive. Alternatively, you could simply tell the TreeTableListener that everything has changed, but that too is expensive. So instead we maintain a "childName" field in each TableLine field according to the following invariant: for all TableLine's T in the model, T is the T.childName'th child of its parent. In the right side of example (3) above, A.childName=0, B.childName=0, and T.childName=1. With this invariant, it is easy to pass the right arguments to fireTreeNodesChanged.

Ungrouping

LimeWire supports the ability to ungroup search results. However, even when you ungroup results, the grouping is still maintained internally; as described above, we always run the result grouping algorithm, so it seems a waste to throw this information away. A corollary is that grouping is actually cheaper than ungrouping!

To see how this works, let’s start with a slightly more detailed view of a grouped result panel. The diagrams above omit two details. First, TableLine's use ArrayList's to store their children. Secondly, ResultPanelModel stores a reference to an undisplayed TableLine called fakeGroupedRoot. fakedGroupedRoot is never equal to the root. During normal operations, its children are aliased to those of the root. This is shown below:

Image:ResultGroupingImage006.png

When we ungroup results, fakeGroupedRoot continues to store the grouped result structure. The leaf children of fakedGroupedRoot are then cloned and stored under the root. In other words, each search result is stored twice, though only the ungrouped results are displayed. This is shown below. Note that TableLine B is not cloned, as it represents multiple nodes in the search results, not a single result.

Image:ResultGroupingImage008.png

While in the ungrouped state, new search results must be added as children of fakedGroupedRoot and the real root. To go back to the grouped state, we just discard the ungrouped TableLine clones.

Addendum: Row and Column Filtering

This part of the document is meant to be read after the above design has been understood. The subject of this addendum is the design changes that have been implemented in the the com.limegroup.gnutella.gui.search package. The purpose of the changes was to add the capability to add nifty new features like the ability to turn columns and rows on and off based on certain criterion (user selection, filtering etc). The ResultPanelModel has been replaced by a new class hierarchy. Before this change was made the ResultPanelModel used to extend AbstractTreeTableModel (which used to implement TreeTableModel). The ResultPanelModel was responsible for grouping, ungrouping, sorting, and returning values of the various cells in the JTreeTable. However with this architecture it was not possible to allow filtering of rows and the ability to only view the columns of ones choice.

The new class hierarchy is illustrated in the figure below. A brief explanation of the classes and their functions follows. For those of you who are confused by the choice of class names, we are using database terminology here. The TableLineSelector has the responsibility of selecting rows to display, while the TableLineProjector has the responsibility of projecting rows that the user has chosen to see.

Image:ResultGroupingImage001.gif

The ResultPanel has a handle to a TableLineSelector, which extends TableLineProjector, which extends ResultPanelModel. The class hierarchy discussed above is the reverse of the order in which delegation takes place. The TableLineSelector is the class to which all the requests are made. The TableLineSelector does the tasks that are relevant to it, and delegates the rest to its super class (TableLineProjector). The TableLineProjector, similarly does the tasks it can/should do, and delegates to it super class (ResultPanelModel).

TableLineSelector: A TableLineSelector may have a filter, which sets the criterion for TableLines that should be displayed in the JTreeTable. The TableLineSelector gets the TableLines which are to be added (to the underlying model of which the TableLineSelector is a part). The TableLineSelector applies the selection criterion to the line. If the line meets the criterion, then it remembers the index of the line in the underlying model (here we are referring to the ResultPanelModel since it the class that remembers all the lines ever added anyway). This arrangement allows for the criterion for display of TableLines to be determined at a later time. The TableLine simply rebuilds its indices based on the new criterion, by getting (all) the lines stored in the ResultPanelModel.

Functions like sorting, grouping, ungrouping are actually done within the ResultPanelModel. If these operations are invoked, the TableLineSelector first asks the ResultPanelModel to do the operation, and then rebuilds its mapping of which lines to hide and which ones to show (this is necessary since the underlying indices have been changed as a result of these operations).

As alluded to above, in addition to the above operations there is one more operation that would require the TableLineSelector's internal mapping to be rebuilt - when a filter is set. Setting a filter implies that the conditions for displaying a TableLine have been changed, consequently we need to rebuild the mapping of TableLines in the TableLineSelector.

TableLineProjector (Abstract): The TableLineProjector is responsible for maintaining information about hidden and visible columns of the model. This allows users to be able selectively hide and show columns they are interested in seeing. The class has the ability to add a schema, this is because schema fields become columns in the view.

The class maintains a list of all the schemas it know about, and a another list of column names (comprising of the normal columns and the display names of the fields for all the schemas it knows about), and yet another list of indices of columns that user has chosen to see.

When a schema is added to the TableLineProjector, the display names of the fields of the schema are added to the list of column names. Similarly, when a user chooses to hide column, the index of that column is removed from the list of selected indices; and when a user chooses to show a column, the index of that column is added to the list of selected indices.

The JTreeTable asks for values in certain cells. The TableLineSelector translates the row number based in its mapping, then the TableLineProjector translates the column number based on it's mapping. And finally asks the underlying data-model (the ResultPanelModel) to get the required value.

ResultPanelModel (Abstract): The structure of this class is not dramatically different than it was before we started making the design changes. The class is responsible for grouping, ungrouping, sorting of the TableLines. It acts the backing store for all the TableLines whether or not they are displayed in the JTreeTable.

The few things that are different about this version of the ResultPanelModel from the previous version are, that the newer version does not

  • fireTreeNodeInserted, fireTreeNodeChanged etc.
  • Deal with row and column specific information.