Ontological Directory and Directory Load-Balancing for Large-Scale Grids

Year : 2010, Volume : 1, Issue : 1
First page : ( 28) Last page : ( 45)
Print ISSN : 0975-8070. Online ISSN : 0975-8089.

Li Juan^1,*

¹Computer Science Department, North Dakota State University, Fargo, USA

*J.Li@ndsu.edu

Abstract

If not properly organized, searching an Internet-scale grid for quality resources is like looking for a needle in a haystack – we have too large a space to explore. In this paper, we propose an ontology directory to organize the huge chaotic grid space to multiple semantics-based sub-spaces. Participants of the grid can browse search resources through this directory. Sharing and collaborating can then be performed on related directory only. This results in an efficient and scalable grid-based integration and interoperability infrastructure. However, the ability to guarantee that the system will not be overwhelmed due to load imbalance becomes much more significant, especially when factors such as item popularity and skewing are taken into consideration. To address this problem, we propose an effective load balancing solution, which takes the peer heterogeneity and access popularity into account to determine the load distribution. Our algorithm achieves load balancing by dynamically balancing the query routing load and query answering load respectively. Experimentations illustrate that our balancing algorithms effectively balance the system load and significantly improves performance.

Top

Keywords

Grid, Ontology, Semantic, Load-balance.

Top

Introduction

Grid computing [26] is a virtualized distributed computing environment aimed at enabling the sharing of geographically distributed resources. Grid resources have traditionally consisted of dedicated supercomputers, clusters, or storage units. With the present ubiquitous network connections and the growing computational and storage capabilities of modern everyday-use computers, more and more inexpensive PCs, and various devices such as PDAs and sensors at the periphery of networks are joining the grid. Grid research is therefore shifting its focus from scientific computing to a pervasive, world-wide resource-sharing infrastructure. However, grid resources and users are geographically distributed and owned by different organizations. The fact that users typically have little or no knowledge of the resources contributed by other participants in the grid poses a significant obstacle to their use.

The provision of an information service, as currently envisaged by the grid community, is a first step towards the discovery of distributed resources. However, a large part of these efforts have been focused on “getting it to work,” without directly addressing issues of scalability, reliability, and information quality [17]. For example, classical grids always use centralized or static hierarchical models to discover resources. The Globus Toolkit [28] is a famous example. Globus users can get a node's resource information by directly querying a server application running on that node, or querying dedicated information servers that retrieve and publish the resource information of the organization. Although interactions between these information servers are supported, the general-purpose decentralized service discovery mechanism is still absent. For this reason, building a directory which enables users to efficiently discovery interested resources and users with similar interests is a vital part of a grid system. An effective directory can make sharing and collaborating in a timely and reliable manner.

We propose a fully-decentralized ontology directory that defines a hierarchy of categories of a domain of interest that are used to represent the semantics of resources. It is like a yellow page or a “rendezvous” point, allowing nodes to find recourses and contacts with similar interests. The directory does not need to be predefined; it grows spontaneously as network interest evolves. To avoid the bottleneck caused by a centralized server, we implement the directory with Distributed Hash Table (DHT) [19, 20, 27]. A hierarchy path identifies relations of categories in the path. The system transforms the ontology directory path into a set of numeric keys. It does so in a way that preserves the expressiveness of the semi-structured data. facilitates an even data distribution through the network enables efficient query resolution, and provides a flexible lookup interface.

One major problem of this directory overlay is the load unbalance caused by the inherently uneven hierarchical namespace and highly variable directory popularity. To solve this problem, we propose an effective load-balancing solution, which takes the node heterogeneity and access popularity into account to determine the load distribution. Our algorithms achieve load balancing by dynamically balancing the query routing load and query answering load respectively.

In the rest of this paper, we present the design and implementation of ontological directories in Section 2. Then in Section 3 we describe a comprehensive scheme for solving the load unbalancing problem of the directory overlay. We validate our models using simulation in Section 4. Finally, Section 5 concludes the paper.

Top

Ontological Directory

An open and dynamic grid should allow users to use the grid structure and resources for a wide variety of purposes. Grids will be fully exploited only when people can quickly and conveniently locate desirable resources. In order to organize different interests and facilitate the construction of virtual organizations (VOs) [5], we propose an abstract generic ontological model that guides users in determining the desired ontological properties and choosing the “right” VOs to join. Ontology is defined as “a formal, explicit specification of a shared conceptualization” [28], which refers to the shared understanding of domains of interests. Our ontology model defines most general categories of existence (e.g., existing item, spatial region, dependent part), which essentially form a hierarchy where each entry corresponds to a categorical domain. Here we provide a formal definition of the ontology directory.

DEFINITION 3.1: An ontology directory is a system D=(L,H,r), which consists of:

A lexicon: The lexicon L contains a set of natural language terms.
A hierarchy H: Terms in L are taxonomically related by the directed, acyclic, transitive, reflexive relation H. (H ⊂L×L);
A root term r ∈ L. For all l ∈ L, it holds: H(l,R).

The ontology directory essentially defines a hierarchy where each node corresponds to a lexicon or a categorical term. It is almost a rooted tree structure, with rare nodes having multiple parents. The subordination relationship between nodes is interpreted as the involvement (topic/subtopic) relationship, while the layers of nodes correspond to intuitively perceived levels of abstractness of topics. Each node is described by primitives which are generic concepts that include other concepts. An example of a primitive is computer that includes software, hardware, networks, and so forth. The hierarchical relationship, also called the ISA relationship, is transitive, i.e., whatever holds for a more general concept also holds for a more specific concept, e.g., music is a type of art. Our ontology directory is different from those global web directories, such as Google directory, Yahoo directory, and DMOZ [23], because it is not predefined, but created and extended automatically with network growth and the evolution of the ontology. Moreover, the ontology directory loosely defines domain categories; it does not expect different communities of users to conform to the same ontology to describe their resources and interests. Therefore, it is based on multiple ontologies as opposed to a global ontology. Fig.1 shows a fragment of an example ontology directory.

Distributed directory implementation

To implement the ontology directory in a decentralized manner, we need an efficient and scalable structure to index and lookup the hierarchical taxonomy. Many applications, such as Canon [7] or HIERAS [22], use hierarchical DHTs to index hierarchical data; however, they are not applicable to our system. Specifically, in Canon, information is only stored in real nodes (leaf nodes), while all internal nodes are purely virtual; however, directory information in our system is stored in real (non-virtual) internal nodes. In addition, it is difficult to implement directory browsing inside the Canon network. HIERAS constructs the multi-level hierarchy by creating multiple DHT overlays for each subdirectory. As mentioned by Artigas et al. [1], “DHTs are mainly designed for very big networks, and the creation of several DHTs assigned to sub-domains or layers can represent a burden for the problem that they aim to solve”. The major problem of the multi-level DHT architecture is that maintaining each level of the hierarchy brings extra overhead; it is a waste on subdirectories not large enough. In our work, we propose a simple flat DHT structure to index the hierarchical directory. This structure enables economical, flexible, and efficient lookup services.

In this section, we describe techniques for indexing a hierarchical ontology directory in a flat DHT overlay, in particular Pastry overlay. Our system provides multiple interfaces permitting users to access the directory in different ways. To efficiently index and retrieve the hierarchical ontology directory with a flat DHT structure, we need to extend the basic DHT API. The directory path starting from the root is used to represent the ontology domain (e.g., /computer science/systems/network). A particular directory entry should include contact information for peers in this directory. A direct indexing scheme is to index the full directory path as a key, and users can locate a VO by providing the full directory path. However, like navigating in a UNIX file system, users rarely input an absolute directory path, but rather browse directories level by level and select the more interesting one at each level. Therefore, it is necessary to provide users an ontology browsing interface. Moreover, to automatically locate related directory for nodes, we extract key concepts from the joining nodes’ ontology and then use them as keys to locate the right directory domain. Therefore, we should also provide a keyword-based lookup interface.

Consider the ontology model in Fig.2. It consists of taxonomy paths:

/computer science
/computer science/systems
/computer science/hardware
/computer science/systems/network
/computer science/systems/processor
/computer science/systems/network/architecture
/computer science/systems/network/protocol

Some domains may relate to keywords, for example: keywords: cluster, grid, P2P, are related to taxonomy: /computer science/system/network/architecture

Keywords: protocol, TCP, IP are related to taxonomy: /computer science/system/network/protocol

For each path and keyword, a hash value (key) is computed in Pastry using an SHA-1 algorithm. Table 1 shows keys for taxonomy paths and keywords of the model. To make the example simple, we use a 4- digits (8 bits) identifier space; in reality a much larger identifier space is used, such as 160 or 128 bits. Each key is assigned to a node, which is the nearest node to the key in the key-space. For example, as listed in Table 1, the hashed key of directory path /computer science/system is 0230, and the key is stored at node 0213 as shown in Fig.3, since node 0213’s id is closest to the key. Each owner node of a directory key maintains a Least Recently Used (LRU) cache storing contact information of peers that are interested in this directory. To implement the directory browser's functionality, an overlay node that is in charge of a directory entry also stores information about that directory's direct children. When the user chooses one directory, Pastry routes to that directory entry and retrieves child directory information, allowing the directory to be extended dynamically while browsing. An overlay node also stores keywords that are hashed to it and links the keywords with related ontology domains. Fig. 3 shows how the directory model above is stored into an example Pastry network.

Directory lookup

We provide three kinds of lookup interfaces for users: exact lookups, browser-based lookups, and keyword-based lookups. A node can use these three interfaces to locate VOs they are interested in and join these VOs.

Exact lookups: This is the simplest form of a lookup. This type of query contains the complete directory path of the interest domain, for example “/computer science/system/network/architecture”. This complete directory path is hashed to a key and then a corresponding lookup of the hashed key on the Pastry overlay is executed.

Browser-based lookups: In this case, users do not need to remember the directory path to locate the directory domain of interest. Instead, they can navigate from the root of each hierarchy down to the leaves to reach the directory of interest. A user first uses the root Id as the key to locate the node storing the root of the ontology model. Since a node storing a directory entry also storing the next level children, then users can dynamically expand a directory tree node to browse its child branches. After the user chooses an interested branch, the directory path of that branch is used as a key to lookup the next level directory. In this way, the tree is expanded until users find the desired directory entries. In reality, the root and top level categories are widely cached in most of the nodes in the network; therefore they can be quickly located without going through the overlay network.

Keyword-based lookups: Users can also specify one or more key concepts of their local ontology and use the concepts as keys to lookup the corresponding directory in the overlay. Since overlay nodes in charge of the keywords keep links to the corresponding directory entries, a keyword-based lookup can be converted into an exact lookup. When a user provides multiple keywords, each of them may correspond to multiple directories, the intersection (or union) of all directories related to these keywords is returned to the user. Domain ontologies and/or external generic ontologies like WordNet [17, 4] can be used for keyword semantic query expansion or keyword conceptual indexing in order to improve retrieval performance.

Directory overlay maintenance: The directory overlay nodes are also user nodes. We utilize the heterogeneity of grid nodes, and promote those stable and powerful ones to join the directory overlay. Excluding ephemeral nodes from the directory overlay avoids unnecessary maintenance costs. The maintenance of the directory overlay mainly includes adding new directory entries. We assume deleting and updating do not occur frequently. When a new joining node cannot find its category of interest, it may try to apply to create a new category. If the application is approved by the authoritative organizations in the grid, the node will create this category by hashing the directory path to an overlay node and informing the parent node to add this entry. Then it hashes each of its main key concepts in the ontology to the overlay network. A node joins the directory overlay only when three conditions are satisfied: (1) It satisfies the capacity requirements, i.e., it is powerful enough. (2) It is stable for a threshold time period. (3) The directory load balancing algorithm (which will be explained in next section) requires it to do so.

Top

Directory Overlay Load Balancing

The directory overlay uses a DHT to distribute directories randomly among peers, but the consistent hash which used by DHT may cause some peers to have O(logN) times as many objects as the average peer [20]. In addition, peer capacity, such as computational power, storage capacity and network bandwidth are quite different among peers. Even with a uniform workload distribution, serious load imbalance problems may occur. Further imbalance may happen due to the non-uniform distribution of directory entries in the identifier space. The situation is even worse for hierarchical systems such as our ontological directory

hierarchy: servers hosting nodes at the top of the hierarchy will incur exponentially disproportionately more load than servers hosting leaf nodes. Last but not least, directory queries tend to be skewed, i.e., certain directories are quite popular as compared to the others. Heavy lookup traffic load is experienced at the peers responsible for popular objects, as well as at the intermediary nodes on the lookup paths to those peers. When subsequent tasks are then obliviously assigned to the already overloaded node, the average response time consequently increases drastically. We aim at balancing the highly unbalanced load caused by skewed directory distribution through the use of a comprehensive balancing mechanism, which includes an adaptive load redistribution scheme as well as a dynamic routing table reconfiguration scheme.

Existing balancing strategies

There have been many load balancing schemes proposed for DHT-based systems. Roughly, we divide them into four categories:

The virtual server approach [8, 11, 12] focuses on the imbalance of the key distribution due to the hash function. Each physical node instantiates O(logN) virtual servers with random IDs that act as peers in the DHT, which reduces the load imbalance to a constant factor. To address peer heterogeneity, each node chooses to create a number of virtual servers proportional to its capacity. Unfortunately, the usage of virtual servers greatly increases the amount of routing metadata needed on each peer and causes more maintenance overhead. In addition, the number of hops per lookup (and the corresponding latency) increases as well. Moreover, it does not take object popularity into account.

Unlike the virtual server approach, the dynamic ID approach uses just a single ID per node [18,16,12,21]. The load of a peer can be adjusted by choosing a suitable ID in the namespace. However, all such solutions requires IDs to be reassigned to maintain load balance as nodes dynamically join and leave the system, resulting in a high overhead because it involves transferring objects and updating overlay links.

The third class of approaches uses multiple hash functions to balance the load. The power of two choices load balancing algorithm, the node uses multiple hashes to generate a set of IDs and at join time selects an ID in a way to minimize the discrepancies between capacity and load for itself and the nodes that will be affected by its joining. While such a strategy is simple and efficient, it increases the computational overhead for publishing and retrieving content, since multiple hash functions have to be computed each time; in addition, it is a static allocation, and does not change in the case that the workload distribution shifts.

The last category of balancing schemes is by caching and replication [19][20][9]. Hotspots and dynamic streams are handled by using caches to store popular objects in the network, and lookups are considered resolved whenever cache hits occur along the path. Pastry [19] and Chord [20] replicate an object on the k servers whose identifiers are closest to the object key in the namespace to improve the availability, but it also help balance the load of a popular topic. Unfortunately, the last few hops of a lookup are precisely the ones that can least be optimized [6]. Moreover, since the query load is dynamic, a fixed number of replicas do not work well; if the number is chosen too high, then resources may be wasted, and if it is set too low, then these replicas may not be enough to support a high query load.

The most significant load-unbalancing problem of our directory overlay is caused by skewed directory popularity. Therefore, we focus on this unbalance problem. Our load-balancing solution partially belongs to the last category of the aforementioned schemes. It replicates and dynamically reconfigures the routing table to balance the heterogeneous request load – the most significant problem of our directory overlay. The existing approaches, especially caching, are orthogonal to our solution.

Adaptive load balancing scheme

In this section, we detail our load balancing scheme, focusing on the imbalance caused by heterogeneous directory popularity. We propose a comprehensive load balancing strategy, which address this problem by dynamically re-distributing the load of hot spots to other “cold spots”. Particularly, we distinguish two types of load: query answering load and query forwarding load (query load and routing load for short). Aiming at balancing these two kinds of load, we propose three balancing strategies: (1) adaptive object replication, which targets balancing the query load, and (2) adaptive routing replication and (3) dynamic routing table reconfiguration, both aimed at balancing the system's routing load. Each node analyzes the main cause of its overloading and uses a particular balancing algorithm to correct its situation. This scheme is generic enough to resolve the load balancing problem of general DHT applications. In the rest of this section, we use terms peer and node interchangeably to represent the node of a directory overlay.

Load metric

Our load balancing scheme involves a load metric to gauge the activity of each peer node and make the necessary adjustments. Each peer p in the network has a capacity C for serving requests, which corresponds to the maximum amount of load that it can support. In this paper, this is derived from the maximum number of queries that can be routed, answered, or queued per second by the peer. It is assumed that any arriving traffic that cannot be either processed or queued by the peer is dropped. It is also assumed that nodes will be able to define their capacity consistently via a globally ratified/used metrics scale.

At any given time, the load of peer p is defined as the number of requests received per unit of time. We focus on two kinds of requests: query routing requests and query answering requests. Upon receiving a routing request, the peer checks its routing table and forwards the query to next hop. If it receives a query answering request (meaning that it stores the hashed key locally), it serves that request according to the application's needs, for example, for our directory overlay, this may include retrieving sub-directories or registering peers to the VOs. For other applications, this may include answering a complex query, or transferring a file, and so on. In this paper, the current load value L of a node is defined in Equation (1) as the sum of its current routing load and its current query load:

Both the routing and query load can be represented by the number of requests received in unit time. Assuming that the unit load is l, and each routing request creates a unit load while each query request creates b unit load, then (1) can be converted to (2), in which is the number of query requests in unit time, and is the number of routing requests in unit time.

For any given peer p, we also define an overloading threshold value, T_o which represents the point after which additional workload placed on the peer will induce overloading, and trigger load redistribution for p. This value can be represented as a portion of the peer's capacity, e.g., T_o = 0.8C, which means that p is considered overloaded when it reaches 80% of its capacity. We also introduce another load threshold value, T that represents the “safe” workload capacity for a peer. A peer will agree to accept redistributed load from an overloaded peer only when its load is below T_s e.g., T_s=0.6C. The goal of load redistribution is to make the workload on all participating peers fall below their respective Ts in order to guarantee that none of them will again be overloaded soon after the redistribution.

The adaptive object replication algorithm

Nodes storing very popular objects are susceptible to becoming overwhelmed due to external requests for those objects. In this case, attempting to redistribute the load via shedding objects and keys to other nodes does not guarantee any noticeable improvement, since even one very popular key could overload a node. Therefore, we suggest a replication-based method to relieve the load of overwhelmed nodes. By replicating the popular keys of overloaded nodes to lightly loaded nodes, we help to balance the network load. While this idea of balancing by replication is by itself not new, the when, where, and how we propose are. Specifically: When does replication occurs? Where do we locate the candidates to help out an encumbered node? And how do the consequences of the redistribution get announced to the rest of the system?

When: Each peer periodically checks its current load via the previously mentioned load metrics. If it is above the overloading threshold (i.e., L > T_o), and this overloading is caused mainly by query loads, it will pick a lightly loaded node to replicate its keys thus sharing the load. When more than one peer is responsible for a popular key, each responsible peer only manages part of the load, thus reducing the chance of overloading.

Where: Upon detecting that it has crossed the “overload” threshold, a node will issue a replica discovery query to the network, broadcasted (with limited steps) down the DHT broadcast tree with the querying node as the root. Any lightly loaded node (defined previously as nodes with current load L<T) in the path of the tree will reply with its load information. Once enough responses have been received, the overloaded node begins transferring its keys and objects to these candidates, creating replica nodes of itself.

How: Once replicas are created, dissemination of information about the existence of these new replica must occur. For prefix-based DHTs like Pastry or Tapestry, the replica information is updated at all the peers in the original peer's neighborhood set, leaf set, and routing table. Those nodes in turn update their own state based on the information received. Similar to the node joining process, the total cost for the replica update in terms of the number of messages exchanged is O(log₂^bN). Similarly, for Chord-based DHTs, the replica info is updated at the fingers and predecessors of the related nodes to reflect the addition of this replica, requiring O(log² N) messages. This process can be carried out asynchronously, since the peers in the routing table already have a pointer to the original peers and asynchronous updates will not negatively affect the correctness of the system. When a query needs to be forwarded to a popular key, neighboring nodes can now pick peers in a round-robin fashion from the list of available peers holding the key. Thus, the queries for the hot key are now partitioned among the multiple peers storing the key. When a popular key later becomes unpopular, the replica nodes can just get rid of the replicated keys, using access history to gauge the popularity of the replica.

Adaptive routing replication algorithm

Replicating popular keys relieves the query answering load of nodes responsible for these keys. However, another major source of workload in DHT overlays is caused by relaying queries among nodes. A node may be overwhelmed simply by the traffic of forwarding incoming routing queries. For example, the last hop neighbors of a popular key can be overloaded by forwarding queries to the popular node. While this problem can be partially solved by the aforementioned replication of popular keys to disperse the traffic, it cannot completely alleviate the problem since certain nodes in the system might still be functioning effectively as traffic hubs for popular sections of the network. To address this problem, we propose a balancing scheme which actively redistributes the routing load of an overloaded node by duplicating its routing table to other nodes, thereby sharing its routing load. When a node is overloaded by routing loads, it will pick a lightly loaded node to replicate its routing table, so that the replica node can share its routing load.

To let replicas share the responsibility of routing, their information must be propagated to other related nodes in the network. For Chord-based DHTs, the replica info is updated at the fingers and predecessors of the related nodes to reflect the addition of this replica. Fig.4 shows the pseudocode of the replica propagation algorithm. The total cost for the replica update in terms of the number of messages exchanged is O(log²N). Similarly, for prefix-based DHTs like Pastry or Tapestry, the replica info is updated at all the peers in the original peer's leaf set and routing table. Those nodes in turn update their respective routing tables by adding a replica entry to the entry of the original node so that future queries can be routed to either the original node or the new node, all the while maintaining system network correctness. This process requiring O(log₂bN) messages exchanged and can be carried out asynchronously, since the peers in the routing table already have a pointer to the original peers and asynchronous updates will not negatively affect the correctness of the system. Besides load balancing, this replication approach can also improve the routing resilience in the face of network failures.

Fig 5 shows an example of the Pastry structure with the replication of routing tables. The query for item ID 0221, which is actually served by node 0222, is initiated at node 3012. According to its routing table, node 3012 chooses 0021 as the next hop. Node 0021 determines that node 0200 should be the right node to forward the query. Since node 0200 has a replica at node 1102, node 0021 may choose 1102 as the next hop. When the query is sent to 1102, it uses the duplicated routing table for 0200 to serve the query and send the query to the destination node 0222. When node 0200 is exposed to a high load, the replicas will share some of the traffic, preventing overload.

Dynamic routing load adjusting algorithm

In addition to the use of replication, another scheme to balance the routing load is by dynamically reconfiguring the routing table. In the previously mentioned methods, an overloaded node actively redistributes its own load, but in cases where external policies or the network environment prevents the redistribution effort, replacing routing table content can help relieve highly loaded nodes.

This algorithm is tailored specifically for DHTs like Pastry or Tapestry. In those systems, many nodes with the same prefix can be potentially filled in a node's routing table; the one in the table is the one that is “closest” to this node according to the topological proximity. We propose changing the strategy of choosing nodes in the routing table to balance routing load (especially to relieve heavily loaded nodes). In lieu of simply choosing according to a proximity metric, we choose according to routing load instead. When an overloaded node receives a querying message from its neighbor, it will reply with a message indicating its overloaded status. This neighbor, receiving the message, will, at the earliest opportunity possible, replace the entry of the overloaded node in its routing table with another node with the same prefix. The light-loaded candidate nodes can be learned from forwarded query messages which include Ids of passed nodes, or by broadcasting a candidate-discovery query as the aforementioned replicating schemes did. By doing so, traffic is shed from the overloaded load as long as it is not the actual “end target” of the query request, as the replacement node will be able to direct any queries the original node could have, and forwarding traffic is spread out more evenly.

Continuing from our example in Fig.5, in node 1102’s routing table, let us assume that a neighbor node, 3200, (1^st row 4^th column) is heavily-loaded. When a query passes through node 3012 to 0021 and then comes to node 1102, since 3012 shares the identical first digit prefix (3) with the overloaded neighbor 3200 in 1102’s routing table, the entry of 3200 will be replaced with 3012. This way, the traffic to the more heavily loaded 3200 will be redirected to the more free 3012.

Dynamic load splitting algorithm and caching

It is possible that the whole directory overlay is full – most nodes are experiencing high loads. In practice. this is detected when overloaded nodes cannot find a replicating node easily. When this happens, trying to use the previously mentioned algorithms to relieve an overloaded node is in vain. Instead, we need to add new nodes to the overlay to take some load. When a new node registers to an overloaded overlay node in charge of a domain, the overloaded node will split its load and let the new node join the overlay network and share part of its load. The sharing can be implemented by letting the new node choose a suitable ID (close to the overloaded node) to take some directories or replicate the directories to the new node, depending on the nature of the overloading.

Caching popular directories is another effective solution for alleviating the load of a hot-spot. Caching provides a way to speed the performance of domain resolution for subsequent queries of popular directories, while substantially reducing query traffic on the network. Caching also helps improve the availability of the system by the ability to jump over namespace partitions induced by network failure. Our directory overlay uses recursive queries and allows en-route caching of records. After a query has been resolved, all the intermediate nodes that forward the query back to the querying node can store a local copy. Thus, subsequent queries for the same content that cross any of the nodes with cached copies can be answered immediately. As a result, the number of hops needed to resolve a query is decreased. Caching is orthogonal to our load balancing scheme; it cannot replace our routing scheme, because caching is random and ad hoc, it helps redistribute the load of popular node, but it cannot guarantee to avoid overloading.

Top

Experiments

In this section, we examine the performance of our proposed directory overlay with simulation experiments. Since our directory location problem is similar to key search in a DHT, we do not list the general locating performance here, rather, we focus on evaluating how our load balancing algorithm and caching improves system performance.

Methodology

Data set: We base our simulation framework on a data set of the RDF dump of the open DMOZ directory [24], since it consists of realistic data about the content distribution within a large community. DMOZ is an open directory project (ODP) maintained by a large community of volunteers on the Web. Participants of the open directory project manually categorize web pages of general interest into a topic hierarchy. Editors contribute links to web pages, define subtopics and associate related topics to the DMOZ topic pages. This kind of metadata is one of the first metadata available on the Web in significant quantities and it is useful to provide hierarchically structured access to high-quality content on the Web. It is one of the largest efforts to manually annotate web pages, exporting all this metadata information in RDF format. Over 65,000 editors are busy keeping the directory reasonably up-to-date, and the ODP now provides access to over 4 million web pages in the ODP catalogue. The DMOZ data is available as two big RDF dumps, one for category hierarchy information (structure.rdf.u8.gz) and one for links within each category (content.rdf.u8.gz). We use the category hierarchy file in this experiment to simulate the ontology directory. For the topic distribution we select topics in the first four levels of the DMOZ hierarchy. According to a previous research effort [15], the hierarchy topics are distributed with a heavily tailed Zipf popularity.

Query: Since both keyword-based queries and browser-based queries are eventually converted to directory-path/sub-path queries, in this experiment, we only generate directory-path queries. Queries are generated by instantiating the topics chosen from the set of DMOZ topics. We use both uniform query distribution and Zipf distribution to simulate the query requests.

Topology: The directory overlay is built on Pastry. Each peer in Pastry is assigned a 160-bit identifier. The unique key of the directory is generated using SHA-1 [3] secure hashes. For a network of N peers. Pastry routes to the numerically closest peer to a given key in less than log(₂bN)steps, where the identifiers use a sequence of digits with base 2^b. In our simulation, the value of base b is 2.

Other parameters: Each node is randomly assigned a value C representing its capacity (C=2ⁱ, i ∈{0, 1, 2, 3, 4}). A node's current load is represented by the number of query forwarding requests and query answering requests it receives per unit time. The load caused by the two kinds of requests has different weight to simulate the different costs they would incur. In our experiment, the query load is a simple question answering procedure, such that we can set the ratio of the weight of query answering load vs. query routing load to 5 (i.e., a:b = 5:1 in Equation 3.2). Given the lightness of the directory locating process in the current experiment, this would be a reasonable projection. In the case of more significant operations, such as file transfers, the ratio will be larger by several orders of magnitude.

The simulation is carried out on an overlay network with 10³ nodes and the directory topics randomly distributed throughout the nodes. Queries are issued with different frequencies and distributions (random distribution and Zipf distributions with different a value, which represents how skewed the distribution is, with a larger a value indicating greater levels of skew). For the purpose of our experiments, the (overload) threshold for each node was set at 0.8, and the (safety) threshold at 0.6 of its maximum capacity. Each experiment is run ten times with different random seeds, and the results are the average of these ten sets of results.

Four different load balancing strategies were evaluated and analyzed: (1) simple Pastry: this is the basic Pastry system with no load balancing strategy used (represented by None in the following figures), (2) reconfiguring routing table (RR), (3) duplicating objects/directories (DO), (4) duplicating the routing table (DR), (5) integrating all of the previous three balancing schemes (All). The performance metric we used is the load/capacity ratio.

Results

Effect of query distribution

Fig.6 shows the effect of query distribution on a node's load burden (without any balancing mechanism used), indicating the mean, 1^st and 99^th percentiles of the peer workload/capacity ratio. This percentile represents the workload variances on the peers, such that the greater the difference, the less evenly the load is being distributed. In the experiment, we increase the skew degree of the query distribution from random to Zipf with a=1.25. We can see that query distribution has a significant impact on peer load. The more skewed the query distribution, the more unevenly distributed the load becomes, causing some nodes to suffer from a very high load when the query is sufficiently skewed.

Performance of load balancing schemes under different query distributions

Overloading a node can induce an overflow to its request queue, causing new incoming queries to be dropped, which in turn deteriorates the system performance. Fig.7 shows an overview of the fraction of dropped queries under different query distributions and with each of our load balancing schemes. We can clearly see that each of our load balancing algorithms reduce the fraction of dropped queries, thus improving the system performance. Specifically, algorithm All, which integrates all of the other algorithms we presented earlier, experiences the best performance in terms of minimizing the query drop rate even under a highly skewed query distribution.

A caveat worth mentioning is that in Fig. 7, we can see that duplicating routing table entries reduces the number of dropped queries more than duplicating objects does. Note that this is dependent on the parameters we set, particularly the query load to routing load ratio (a:b=5:1). If the ratio is larger, it means that the query answering is more complex compared to the query forwarding, thereby accounting for more of the total load. From the figure, we see clear indication of the effectiveness of our proposed algorithms. The following is a more in-depth examination of the results of each of our balancing schemes:

Balancing of routing load

Fig.8 illustrates the performance of each query routing-related balancing algorithm relative to the query insertion rate. The network size is 10³ and the query distribution is Zipf=1). The figure shows the percentile of the routing load in terms of query forwarding requests received. As mentioned, the smaller the difference, the better the load balancing performs. As we increase the query frequency, the variance for all the algorithms becomes larger. This is because query distribution is skewed, so increasing the query frequency will result in more unbalanced requests, exacerbating the existing imbalance problem.

While the majority of the experimental results are as we expected, the re-configuring routing table scheme contributes surprisingly little to performance gain. We attribute this observation due to the following:

(1) Prefix requirements for the bottom rows of a node's routing table are more stringent, therefore candidates for the replacement nodes of these rows are more difficult to find, resulting in the algorithm being unable to efficiently adjust this part of the routing; (2) consequently, the last-hops-neighbor of a node cannot find replacements to route to that node. This means that neighbors (in Id space) of a popular node cannot be effectively relieved.

We can also observe from Fig. 8 (d) that by integrating all of the schemes together, we are able to achieve performance beyond the sum of the benefits from all these algorithms. We surmise that this is due to the fact that although duplicating-objects does not balancing routing loads directly, it redistributes the load of hot spots, helping to relieve the traffic towards the hot spots and thus avoiding overloading the neighborhood with forwarding requests.

Balancing of query answering load

Fig.9 shows the performance of the adaptive object replication algorithm. We can see that the algorithm effectively relieves the overloaded nodes and balances the load because hot items are quickly replicated in other nodes in the network.

Balancing of the whole system load

Fig.10 shows the results of the combined algorithm in balancing system load. The results of the experiment clearly indicate a significant and drastic effect on the system load balance.

Top

Conclusion

In this paper, we have presented an ontology-based directory overlay facilitates grid users locate desirable information. The ontology hierarchy is indexed in a flat DHT overlay, providing nodes with flexible interfaces to locate domains of interest efficiently in a decentralized fashion. To overcome the major problem of the DHT-based directory overlay – load unbalance – we have proposed an effective scheme to balance load of the directory overlay. This scheme enables the system to achieve good load balance even when demand is heavily skewed. Extensive simulation results indicate significant improvements in maintaining a more balanced system, leading to improved scalability and performance.

Top

Figures

Fig.1.:

Fragment of an example ontology directory

Top Back

Fig.2.:

Fragment of an ontological directory model

Top Back

Fig. 3.:

Storing the ontology model into a Pastry network of 6 nodes in an example 8-bit identifier space

Top Back

Fig.4.:

Algorithm of replica propagation in Chord

Top Back

Fig. 5.:

An example of adaptive routing replication algorithm

Top Back

Fig.6.:

Percentiles of the ratio of load/capacity under different query distributions on a network without any balancing algorithms used (None)

Top Back

Fig.7.:

Fraction of dropped queries under different query distributions and load balancing schemes

Top Back

Fig. 8.:

Percentiles of the routing load (in terms of number of routing requests) under different query frequencies

Top Back

Fig.9.:

Percentiles of the query load (in terms of number of query answering requests) under different query frequencies

Top Back

Fig. 10.:

Percentiles of the system load under different query frequencies

Top Back

Table

Table 1.:

Hash keys of models in Fig. 2 in a sample 4-digit identifier space

Hash key	Directory path
1211	/computer science
0230	/computer science/systems
3211	/computer science/hardware
2011	/computer science/systems/network
1000	/computer science/systems/processor
1013	/computer science/systems/network/architecture
0012	/computer science/systems/network/protocol
2111	Protocol
0211	TCP
1201	IP
2003	Cluster
0012	Grid
0032	P2P

Top Back

References
1..	Top Back
2..	Top Back
3..	Top Back
4..	Top Back
5..	Top Back
6..	Top Back
7..	Top Back
8..	Top Back
9..	Top Back
10..	Top Back
11..	Top Back
12..	Top Back
13..	Top Back
14..	Top Back
15..	Top Back
16..	Top Back
17..	Top Back
18..	Top Back
19..	Top Back
20..	Top Back
21..	Top Back
22..	Top Back
23..	Top Back
24..	Top Back
25..	Top Back
26..	Top Back
27..	Top Back

Agriculture
Applied Science/Technology
Biology
Botany
Business/Economics/Management
Chemistry
Civil Engineering
Commerce/Banking/Finance
Computers/Information Technology
Dental Science
Earthscience
Education
Engineering Mechanics/Materials
Environment
Health Science
Humanities
Library and Information Science
Management
Mathematics/Statistics
Medical Science
Nanotechnology
Nursing
Pharmacy
Physics
Social Science
Veterinary/Animal Sciences