Lesson 5 | The changing attitudes regarding distributed data |
Objective | Explain why Replication and Distribution are now viable Elements of Database Design. |
Replication Distribution as Elements of Database Design
Prior to the advent of cheap hard disks in the 1980s, an important design requirement for all databases was to minimize the amount of redundant information. All databases were kept in centralized mainframe environments and distributed processing was very rare. However, once hard disks became cheap enough to permit replicated data, Oracle introduced the concept of snapshots and their first distributed-database tool, called SQL*Net. A snapshot is an Oracle construct whereby remote tables are refreshed from a master table. This allows a table to be replicated on many Oracle databases.
SQL*Net permits geographically distributed databases to be "linked" such that they function as a single database. The first version of SQL*Net was quite primitive when compared to Network Services, but it did have the advantage of being simple and functional.
In the next lesson, we will look at how Oracle implements the features of a distributed database.
Options for Distributing Database
How should a database be distributed among the sites (or nodes) of a network?
We discussed this important issue of physical database design which introduced an analytical procedure for evaluating alternative distribution strategies. We noted that there are four basic strategies for distributing databases:
- Data replication
- Horizontal partitioning
- Vertical partitioning
- Combinations of the above
We will explain and illustrate each of these approaches using
relational databases. The same concepts apply (with some variations) for other data models, such as hierarchical and network models. Suppose that a bank has numerous branches located throughout a state. One of the base relations in the bank's database is the Customer relation. For simplicity, the sample data in the relation apply to only two of the branches (Lakeview and Valley). The primary key in this relation is account number (AcctNumber). BranchName is the name of the branch where customers have opened their accounts (and therefore where they presumably perform most of their transactions).
Lazy or Asynchronous Replication
Eager Replication update strategies are synchronous, in the sense that they require the atomic updating of some number of copies. Lazy Group Replication and Lazy Master Replication both operate asynchronously. If the users of distributed database systems are willing to pay the price of some inconsistency in exchange for the freedom to do asynchronous updates, they will insist that:
- the degree of inconsistency be bounded precisely, and that
- the system guarantees convergence to standard notions of correctness.
Without such properties, the system in effect becomes partitioned as the replicas diverge more and more from one another (Davidson et al, 1985).
Lazy Group Replication
Lazy Group Replication, however, allows any node to update any local data. When the transaction commits, a transaction is sent to every other node to apply the root transactions updates to the replicas at the destination node. It is possible for two nodes to update the same object and race each other to install their updates at other nodes. The replication mechanism must detect this and reconcile the two transactions so that their updates are not
lost. Timestamps are commonly used to detect and reconcile lazy-group transactional updates. Each object carries the timestamp of its most recent update. Each replica update carries the new value and is tagged with the old object timestamp. Each node detects incoming replica updates that would overwrite earlier committed updates. The node tests if the local replica's timestamp and the update's old timestamp are equal. If so, the update is safe.
The local replica's timestamp advances to the new transaction's timestamp and the object value is updated. If the current timestamp of the local replica does not match the old timestamp seen by the root transaction, then the update may be dangerous. In such cases, the node rejects the incoming transaction and submits it for reconciliation.
The reconciliation process is then responsible for applying all waiting update transactions in their correct time sequence.
Transactions that would wait in an Eager Replication system face reconciliation in a Lazy Group Replication system. Waits are much more frequent than deadlocks because it takes two waits to make a deadlock.
Lazy Master Replication
Another alternative to Eager Replication is Lazy Master Replication.
This replication method assigns an owner to each object and the owner stores the object's correct value. Updates are first done by the owner and then propagated to other replicas. When a transaction wants to update an object, it sends a Remote Procedure Call (RPC) to the node owning the object. To achieve serialisability, a read action should send read-lock RPCs to the masters of any objects it reads. Therefore, the node originating the transaction broadcasts the replica updates to all the slave replicas after the master transaction commits. The originating node sends one slave transaction to each slave node.
Slave updates are time-stamped to assure that all the replicas converge to the same final state. If the record timestamp is newer than a replica update timestamp, the update is stale and can be ignored. Alternatively, each master node sends replica updates to slaves in sequential commit order.
Ad Distributed Databases