The application of the set cover problem in network design has two
variants: the facility location problem and the
-median problem.
The facility location problem minimizes the joint cost of server
installation and the cost of connecting each client to its designated
server. This problem has been applied to designing and placing network
concentrators. The
-median problem minimizes the cost of
connections between clients and servers under the constraint that no
more than
servers can be used. Both problems are NP-Hard. The best
known approximation algorithms can achieve constant
ratio [#!facility-const-approx!#,#!guha-improved!#,#!facility-lp!#], if the
connection cost is symmetric and satisfies the triangle inequality.
For arbitrary cost, the worst case bound is
. However,
neither of the problems can be applied directly to the design of
overlay networks, since in the overlay model the communication
channels between clients and servers are over the commodity Internet
and do not incur any cost to service providers. Rather, the major cost
is the number of servers needed to service all clients and the access
bandwidth required at each server's network interface.
Our model of server placement more closely resembles the set cover
problem. The classic greedy algorithm for solving the set cover
problem [#!greedy-setcover-johnson!#,#!greedy-setcover-lovasz!#]
achieves an
performance ratio. In geometric spaces, the
problem is easier. In [#!shifting-lemma!#], Hochbaum proposed a
shifting strategy that gives an
performance ratio.
Unfortunately, the interconnections between networks dictate that the
network propagation delay no longer exhibits the geometric properties
of distance.
References [#!qiu-infocom01!#,#!wcw01!#] studied the problem of placing
cache replicas in the network and formulated it as the
-median
problem: given a specific number of servers, what is the best
placement that achieves the highest average service level to clients,
where service level is indicated by access delay from a client to its
nearest replica. In [#!qiu-infocom01!#], Qiu et al. proposed several
placement strategies including: a greedy strategy that incrementally
places replicas to achieve highest service quality; a hot-spot
strategy that places replicas near the clients that generate the
greatest load. In [#!wcw01!#], the authors also proposed a max degree
strategy by placing replicas in decreasing order of nodes' degrees. By
simulating over several synthetic and real network graphs, they
concluded that the greedy strategy performs remarkably well, achieving
within 1.1 to 1.5 of the lower bound.
Our approach to network design is from a different angle. we are more interested in examining the necessary cost, in this case the number of servers, if we want to provide all clients the guaranteed service. This gives service providers insight into the relation of network cost and the achievable service quality. Then, they can make adjustments to reflect their revenue stream, such as eliminating servers that are only serving small numbers of clients. Contrarily, the work in [#!qiu-infocom01!#,#!wcw01!#] measures the average service quality which masks the number of unhappy customers. Additionally, the performance of our approach, which is the number of required servers, is not susceptible to the cost metric of connections, since we only use it to categorize clients as serviceable or not by a server; while theirs is achieved for a specific cost metric, namely the access delay. Since the connection cost metric depends heavily on the application, it is not known if the same ratio could be achieved.
Another difference is that we model the network geographically and consider server co-locations. As networks overlap geographically, the number of potential server locations is much fewer in number than the number of network nodes that need to be considered. In [#!qiu-infocom01!#,#!wcw01!#], they used network graphs consisting of tens of thousands nodes for router-level graphs and thousands of nodes for AS-level graphs. Consequently, the optimal algorithm based on LP relaxation is too expensive for their models. We think considering the geographical locations of servers is important and corresponds to the current trend in the Internet, where servers are typically located in data centers in different geographic areas. This reduces the problem size and enables us to solve it more optimally. In Section 5.3, we compare the performance of our algorithm both with co-location and without, and show that with co-location we can reduce the number of required servers to approximately half of that with no network co-locations.