This is part of Early Gnutella
Keyword Matching
Specifications and Algorithms
Searching the list of files for matches to a query is one of the expensive parts of LimeWire. As of this writing (12/12), it consumes about 87% of the CPU time. A number of improvements are possible, some resulting in weaker policies. These are described below in terms of the difficulty of implementation. In all of the following algorithms, case is ignored. I'll also ignore wildcards; because wildcard searches are so rare, they can be special cased.
Let WORDS(s) be all the words of a string, tokenized by spaces, +'s, etc. Example: WORDS("beatles yellow+submarine") is the set {"beatles", "yellow", "submarine"}. I will make much use of this helper function.
1. Keyword Substring Matching
This is the strongest policy. A query Q matches a file name F if all strings in WORDS(Q) are a substring of F. For example, "ello eatle" will match "beatles yellow+submarine" since both "ello" and "eatle" are substrings of "beatles yellow+submarine".
This is the current policy. It is implemented in a straightforward manner, by searching each file name in a list. The problem is that the time complexity is proportional to the number of files and the length of those file names. If you have 20K files, that's a problem. At the end of this document, I give a faster algorithm.
2. Keyword Subset Matching
This is the weakest policy. A query Q matches a file name F if WORDS(Q) is a subset of WORDS(F). For example "submarine beatles" will match the above file name, but "sub beatles" will not.
This can be implemented efficiently by indexing each file name and storing the keywords in a hash table. Then matching can be done in constant time. The space required is only proportional to the sum of the lengths of all filenames. Space can be compressed further by storing the keywords in a Trie, a tree-like structure discussed below.
3. Keyword Prefix Matching
A query Q matches a file name F if all strings in WORDS(Q) are a prefix of some element of WORDS(F). For example, "sub beatle" will match the above file name, but "sub eatles" will not.
This is best implemented by indexing each file name and storing the keywords in a Trie. This is a tree-like data structure that takes advantage of words with common prefixes. For example, the files "funny mp3 and funny fund" have the keywords "funny", "mp3", and "fund". These would be stored in the Trie as
fun
ny
d
mp3
Searching then proceeds in linear time with respect to the length of the query.
4. Faster Keyword Substring Matching
The same policy described in (1) can be implemented in the same time as (3) by using a [http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Suffix.html | Suffix Tree], but at the cost of more space. Assume we have the files "cat-mp3" and "the-cat". To naively compute the suffix tree, we first compute all the suffixes of each file name:
3
p3
mp3
-mp3
t-mp3
at-mp3
cat-mp3
t
at
cat
-cat
e-cat
he-cat
the-cat
Then we sort them:
-cat
-mp3
3
at
at-mp3
cat
cat-mp3
e-cat
he-cat
mp3
p3
t
t-mp3
the-cat
Finally we observe that many of these suffixes have common prefixes. Taking advantage of this, we compress them into a Trie:
-
cat
mp3
3
at
-mp3
cat
-mp3
e-cat
he-cat
mp3
p3
t
-mp3
the-ca
Now let W be an element of WORDS(Q). Note that W is a substring of F iff W is a prefix of some substring of W. Using this idea, we simply match W against the Trie above. This takes time proportional to the length of W.
The problem is that the suffix tree can be quite large, since representing each file in the tree may take space quadratic with respect to the length of each file name. If many files have overlapping prefixes, e.g. "Funny video" and "Funny movie", this will be ameliorated somewhat. Furthermore, if we weaken the specification slightly, we can reduce the space requirements by only computing the suffix tree on the keywords in each filename, not the full filename.
Christopher Rohrs
12/12/2000

