Reviewer: Mike Wilson
Date: 9-22-2005
How would you rate this paper, relative to others we have read? top 10%
How would you rate your knowledge of the topic of this paper? novice
What problem or issue does the paper address? Why is it important?
The paper is directed at information search methods in very large dataspace, distributed peer-to-peer networks. The problem is important because
What are the main contributions of the paper and why are they important?
This paper creates a technique for clustering documents semantically using an overlay network. That is, overlay networks are created such that a node housing documents has neighbors with similar documents, for any given topic area. Topics are determined by a semantic vector created by examining frequency of words in a document, with heavier weight given to words that are comparatively rare in other documents. While this technique is well known, the authors add an algorithm for applying semantic vectors to information retrieval. Because there are significant problems with the straight-forward application of this algorithm, the authors also provide workarounds for the problem cases, such as dimensionality mismatch between the overlay network and the semantic vectors.
How significant are these contributions relative to previous work?
The general concept is fairly simple and straight-forward. The authors take an existing architectural concept - the Content Addressable Network (CAN) and apply a search technique to it. The result is entirely novel and highly significant - even ground-breaking. Unfortunately, like most truly novel systems, there are many problems not considered. Judged solely on the basis of novelty, the work is highly to be recommended, but it should be borne in mind that we refer to the "bleeding edge" for a reason! A more efficient system can be made by working through the system from scratch and designing out the flaws - not creating a "rolling index" bandaid to patch it, contributing a much larger storage penalty and processing time. (Index publishing and searches each involve O(p) operations, where p is the number of subvectors in the partition.) The authors also fail to investigate a critical concern in peer-to-peer networks: nodes frequently join and leave p2p nets. As it turns out, later work demonstrates that this is actually a key problem for pSearch. Finally, the index storage and processing load for any given node is dependent on a uniform distribution of data. If the distribution is heavily skewed, it is possible that some nodes may carry much more than their fair share. (Also demonstrated by later work.)
Give detailed comments justifying your view of the paper.
This paper is low-hanging fruit - the result of taking a known architecture - the CAN - and applying known techniques to create a search method. More recently, other researchers have taken the system back to basics and created a system that does not suffer from the dimensional mismatch problem by applying dimensional reduction techniques. For a good example, see Semantic Small World: An Overlay Network for Peer-to-Peer Search, with offers direct comparisons to pSearch, with both reduced cost and increased accuracy. It also demonstrates that search failure rates are greatly impacted by node failure in pSearch, and the load distribution problem detailed above. While these comments might be taken as negative, the truth is that the authors present a novel system. No truly novel system comes to fruition without many points of useful optimization. Later work, while pointing out the many flaws, is ultimately just optimization of a new and useful technique.