The Recovery Manager of the System R Database ManagerJIM GRAYTandem Computers, 19333 Vallco Parkway, Cupertino, Californta 95014PAUL McJONESXerox Corporatwn, 3333 Coyote Htll Road, Palo Alto, Cahfornia 94304MIKE BLASGEN, BRUCE LINDSAY, RAYMOND LORIE, TOM PRICE,FRANCO PUTZOLU, AND IRVING TRAIGERIBM San Jose Research Laboratory, 5600 Cottle Road, San Jose, Cahfornm 95193The recovery subsystem of an experimental data management system is described andevaluated. The transactmn concept allows application programs to commit, abort, orpartially undo their effects. The DO-UNDO-REDO protocol allows new recoverable typesand operations to be added to the recovery system Apphcation programs can record datam the transaction log to facilitate application-specificrecovery. Transaction undo andredo are based on records kept in a transaction log. The checkpoint mechanism is basedon differential fries (shadows). The recovery log is recorded on disk rather than tape.Keywords and Phrases transactions, database, recovery, reliabilityCR Categortes: 4.33INTRODUCTIONApplication Interface to System RM a k i n g c o m p u t e r s e a s i e r t o u s e is t h e g o a lof most software. Database managements y s t e m s , in p a r t i c u l a r , p r o v i d e a p r o g r a m ming interface to ease the task of writinge l e c t r o n i c b o o k k e e p i n g p r o g r a m s . T h e rec o v e r y m a n a g e r o f s u c h a s y s t e m in t u r neases the task of writing fault-tolerant application programs.S y s t e m R [ASTR76] is a d a t a b a s e s y s t e mwhich supports the relational model ofd a t a . T h e S Q L l a n g u a g e [CHAM76] p r o vides operators that manipulate the database. Typically, a user writes a PL/I orCOBOL program which has imbedded SQLstatements. A collection of such statementsis r e q u i r e d t o m a k e a c o n s i s t e n t t r a n s f o r -mation of the database. To transfer fundsf r o m o n e a c c o u n t t o a n o t h e r , for e x a m p l e ,requires two SQL statements: one to debitt h e f i r s t a c c o u n t a n d o n e t o c r e d i t t h e second account. In addition, the transactionprobably records the transfer in a historyfile for l a t e r r e p o r t i n g a n d f o r a u d i t i n g p u r poses. Figure 1 gives an example of such ap r o g r a m w r i t t e n in p s e u d o - P L / I .The program effects a consistent transformation of the books of a hypotheticalbank. Its actions are either to discover an error, accept the input message, and produce a failure message,or to discover no errors, accept the input message,Permismon to copy without fee all or part of this material is granted provided that the copies are not made or hstnbuted for direct commercial advantage, the ACM copyright notme and the title of the publication and itsdate appear, and notme is given that copying is by pernusmon of the Association for Computing Machinery. Tocopy otherwise, or to republish, reqmres a fee and/or specific permission. 1981 ACM 0010-4892/81/0600-0223 00.75ComputingSurveys,Vol. 13, No. 2, June 1981

224 J i m Gray et al.CONTENTSINTRODUCTIONApphcatlon Interface to System RStructure of System RModel of Failures1. DESCRIPTION OF SYSTEM R RECOVERYMANAGER1 1 What Is a Transaction?1.2 Transactmn Save Points1 3 Summary2 IMPLEMENTATION OF SYSTEM RRECOVERY2.1 Files, Versmns, and Shadows2 2 Logs and the DO, UNDO, REDO Protocol2.3 Commit Processing2.4 Transactmn UNDO2 5 Transaction Save Points2.6 System Configuratmn, Startup and Shutdown2.7 System Checkpoint2.8 System Restart2 9 Medm Failure2 10 Managing the Log2 11 Recovery and Locking3 EVALUATION3 1 Implementation Cost3 2 Execution Cost3.3 I/O Cost3.4 Success Rate3.5 Complexity3.6 Dmk-Based Log3 7 Save Points3 8 Shadows3.9 Message Recovery, an Oversight3 10 New FeaturesACKNOWLEDGMENTSREFERENCESAv debit the source account by AMOUNT, credit the destination account byAMOUNT, record the transaction in a history file,and produce a success message.The programmer who writes such a program ensures its correctness by ensuringthat it performs the desired transformationon both the database state and the outsideworld (via messages). The programmer andthe user both want the execution to be atomic: either all actions are performed(the transaction has an effect) or the results of all actions are undone (the transaction has no effect); durable: once the transaction completes,Computing Surveys, Vol. 13, No 2, June 1981its effects cannot be lost due to computerfailure; consistent: the transaction occurs asthough it had executed on a system whichsequentially executes only one transactionat a time.In order to state this intention, the SQLprogrammer brackets the transformationswith the SQL statements, BEGINTRANSACTION to signal the beginningof the transaction and COMMITT R A N S A C T I O N to signal its completion.If the programmer wants to return to thebeginning of the transaction, the commandR E S T O R E T R A N S A C T I O N will undoall actions since the issuance of the BEGIN TRANSACTION command (seeFigure 1).The System R recovery manager supports these commands and guarantees anatomic, durable execution.System R generally runs several transactions concurrently. The concurrency control mechanism of System R hides suchconcurrency from the programmer by alocking technique [EswA76, GRAY78,NAUM78] and gives the appearance of aconsistent system.Structure of System RSystem R consists of an external layercalled the Research Data System (RDS),and a completely internal layer called theResearch Storage System (RSS) (seeFigure 2).The external layer provides a relationaldata model, and operators thereon. It alsoprovides catalog management, a datadictionary, authorization, and alternateviews of data. The R D S is manipulatedusing the language SQL [CHAM76]. TheSQL compiler maps SQL statements intosequences of R S S calls.The R S S is a nonsymbolic record-at-atime access method. It supports the notionsof file, record type, record instance, fieldwithin record, index (B-tree associativeand sequential access path), parent-childset (an access path supporting theoperations P A R E N T , FIRST CHILD,NEXT SIBLING,PREVIOUS SIBLING with direct pointers), and cursor(which navigates over access paths to locate

The Recovery Manager of the System R Database ManagerFUNDS TRANSFER. PROCEDURE, BEGIN TRANSACTION;ON ERROR DO; RESTORE TRANSACTION,GET I N P U T MESSAGE;P U T MESSAGE ('TRANSFER FAILED');GO TO COMMIT;END;GET I N P U T MESSAGE;EXTRACT A C C O U N T E B I T , A C C O U N T C R E D I T ,AMOUNT FROM MESSAGE, UPDATE ACCOUNTSSET BALANCE ffi BALANCE - A M O U N TW H E R E ACCOUNTS. N U M B E R ACCOUNT DEBIT; UPDATE ACCOUNTSSET BALANCE BALANCE A M O U N TW H E R E ACCOUNTS. N U M B E R A C C O U N T C R E D I T ; INSERT INTO HISTORY DATE, MESSAGE ;P U T MESSAGE ('TRANSFER DONE');COMMIT: COMMIT TRANSACTIONEND;/*/*/*/* 225in case of error */undo all work */reacquire input */report failure *//* get and parse input *//* do debit *//* do credit *//* keep audit trail *//* report success *//* commit updates *//* end of program */Figure 1. A snnple PL/I-SQL program whmh transfers funds from one account to another.Application Programs in P L / I or COBOL, plus SQLResearch Data System (RDS)* Supports the relational data model Supports the relational language SQL Does naming and authorization Compiles SQL statements into RSS call sequencesResearch Storage System (RSS) Provides nonsymbolic record-at-a-time database access Maps records onto operating system files Provides transaction concept (recovery and locking)Operating System Provides file system to manage disks Provides I/O system to manage terminals Provides process structure (multlprogramming)HardwareFigure 2. System R consists of two layers above theoperating system. The RSS provides the transactionconcept, recovery notions, and a record-at-a-time dataaccess method. The RDS accepts application P L / I orCOBOL programs containing SQL statements. Ittranslates them into COBOL or P L / I programs plussubroutines which represent the compilation of theSQL statements into RSS calls.records). Unfortunately, these objects havethe nonstandard names "segment," "relation," "tuple," "field," "image," "link," and"scan" in the System R documentation.The former, more standard, names are usedhere. RSS provides actions to create instances of these objects and to retrieve,modify, and delete them.The RSS support of data is substantiallymore sophisticated than that normallyfound in an access method; it supports variable-length fields, indices on multiple fields,multiple record types per file, interffle andintraffle sets, physical clustering of recordsby attribute, and a catalog describing thedata, which is kept as a file which may bemanipulated like any other data.Another major contribution of the RSSis its support of the notion of transaction,a unit of recovery consisting of an application-specified sequence of RSS actions. Anapplication declares the start of a transaction by issuing a BEGIN action. Thereafterall RSS actions by that application arewithin the scope of that transaction untilthe application issues a COMMIT or anABORT action. The RSS assumes all responsibility for running concurrent transactions and for assuring that each transaction sees a consistent view of the database.The RSS is also responsible for recoveringthe data to their most recent consistentstate in the event of transaction, action,system, or media failure or a user requestto cancel the transaction.Computing Surveys, Vol. 13, No. 2, June 1981

226 Jim Gray et al.A final component of System R is theoperating system. System R runs under theVM/370 [GRAY75] and the MVS operatingsystem on IBM S/370 processors. The System R recovery manager is also part of theSQL/DS product running on DOS/CICS.The operating system provides processes, asimple file system, and terminal management.System R allocates an operating systemprocess for each user to run both the user'sapplication program and the System R database manager. Application programs arewritten in a conventional programming language (e.g., COBOL or PL/I) augmentedwith the SQL language. A SQL preprocessor maps the SQL statements to sequencesof RSS calls. Typically, a single applicationprogram or group of programs (main plussubroutines) constitute a transaction. Inthis paper we ignore the RDS and assumethat application programs, like those produced by the SQL compiler, consist of conventional programs which invoke sequences of RSS operations.Model of FailuresThe recovery manager eases the task ofwriting fault-tolerant programs. It does soby the careful use of redundancy. Choosingappropriate redundancy requires a quantitative model of system failures.In our experience about 97 percent of alltransactions execute successfully. Of theremainder, almost all fail because of incorrect user input or because of user cancellation. Occasionally {much less than 1 percent) transactions are aborted by the system as a result of some overload such asdeadlock. In a typical system running onetransaction per second, transaction undooccurs about twice a minute. Because of itsfrequency, transaction undo must runabout as fast as forward processing of transactions.Every few days the system restarts (following a crash). Almost all crashes are dueto hardware or operating system failures,although System R also initiates crash andrestart whenever it detects damage to itsdata structures. The state of primary memory is lost after a crash. We assume that thestate of the disks (secondary and tertiarystorage) is preserved across crashes, so atComputing Surveys, Vol. 13, No. 2, June 1981Table 1.Frequency and Recovery Time of FailuresRecovery manager trade-offsFaultTransactionabortSystemrestartMedia failureFrequencyRecoverytuneSeveral per unnuteMillisecondsSeveral per monthSecondsSeveral per yearMinutesrestart the most recently committed stateis reconstructed from the surviving diskstate by referencing a log of recent activityto restore the work of committed andaborted transactions. This process completes within a matter of seconds or minutes.Occasionally, the integrity of the diskstate will be lost at restart. This may becaused by hardware failure (disk head crashor disk dropped on the floor) or by softwarefailure (bad data written on a disk page bySystem R or other program). Such eventsare called media failures and initiate areconstruction of the current state from anarchive version (old and undamaged version of the system state) plus a log of activity since that time. This procedure is invoked once or twice a year and is expectedto complete within an hour.If all these recovery procedures fail, theuser will have lost data owing to an unrecoverable failure. We have very limitedstatistics on unrecoverable failures. Thecurrent release of System R has experienced about 25 years of service in a varietyof installations, and to our knowledge almost all unrecoverable failures have resulted from operations errors {e.g., failureto make archive dumps) or from bugs inthe operating system utility for dumpingand restoring disks. The fact that the archive mechanism is only a minor source ofunrecoverable failure probably indicatesthat it is appropriately designed. Table 1summarizes this discussion.If the archive mechanism fails once everyhundred years of operation, and if there are10,000 installations of System R, then it willfail someone once a month. From this perspective, it might be underdesigned.We assume that System R, the operatingsystem, the microcode, and the hardwareall have bugs in them. However, each of

The Recovery Manager of the System R Database Managerthese systems does quite a bit of checkingof its data structures (defensive programming}. We postulate that these errors aredetected and that the system crashes beforethe data are seriously corrupted. If thisassumption is incorrect, then the situationis treated as a media failure. This attitudeassumes that the archive and log mechanism are very reliable and have failuremodes independent of the other parts ofthe system.Some commercial systems are muchmore demanding. They run hundreds oftransactions per second, and because theyhave hundreds of disks, they see disk failures hundreds of times as frequently astypical users of System R {once a weekrather than once a year). They also cannottolerate downtimes exceeding a few minutes. Although the concepts presented inthis paper are applicable to such systems,much more redundancy is needed to meetsuch demands (e.g., duplexed processorsand disks, and utilities which can recoversmall parts of the database without havingto recover it all every time). The recoverymanager presented here is a textbook one,whose basic facilities are only a subset ofthose provided by more sophisticated systems.The transaction model is an unrealizableideal. At best, careful use of redundancyminimizes the probability of unrecoverablefailures and consequent loss of committedupdates. Redundant copies are designed tohave independent failure modes, making itunlikely that all records will be lost at once.However, Murphy's law ensures that allrecovery techniques will sometimes fail. Asseen below, however, System R can tolerateany single failure and can often toleratemultiple failures.1. DESCRIPTION OF SYSTEM R RECOVERYMANAGER1.1 What is a Transaction?The RSS provides actions on the objects itimplements. These actions include operations to create, destroy, manipulate, retrieve, and modify RSS objects (files, record types, record instances, indices, sets,and cursors). Each RSS action is atomic-- 227it either happens or has no effect--andconsistent--if any two actions relate to thesame object, they appear to execute in someserial order. These two qualities are ensured by (1) undoing the partial effects ofany actions which fail and (2) locking necessary RSS resources for the duration ofthe action.RSS actions are rather primitive. In general, functions like "hire an employee" or"make a deposit in an account" requireseveral actions. The user, in mapping abstractions like "employee" or "account"into such a system, must combine severalactions into an atomic transaction. Theclassic example of an atomic transaction isa funds transfer which debits one account,credits another, writes an activity record,and does some terminal input or output.The user of such a transaction wants it tobe an all-or-nothing affair, in that he doesnot want only some of the actions to haveoccurred. If the transaction is correctly implemented, it looks and acts atomic.In a multiuser environment, transactionstake on the additional attribute that anytwo transactions concurrently operating oncommon objects appear to run serially (i.e.,as though there were no concurrency). Thisproperty is called consistency and is handled by the RSS lock subsystem [ESWA76,GRAY76, GRAY78, NAUM78].The application declares a sequence ofactions to be a transaction by beginning thesequence with a BEGIN action and endingit with a COMMIT action. All interveningactions by that application (be it one orseveral processes) are considered to beparts of a single recovery unit. If the application gets into trouble, it may issue theABORT action which undoes all actions inthe transaction. Further, the system mayunilaterally abort in-progress transactionsin case of an authorization violation, resource limit, deadlock, system shutdown, orcrash. Figure 3 shows the three possibleoutcomes--commit, abort, or system abort i o n - o f a transaction, and Figure 4 showsthe outcomes of five sample transactions inthe event of a system crash.If a transaction either aborts or isaborted, the system must undo all actionsof that transaction. Once a transaction commits, however, its updates and messages toComputmg Surveys, Vol. 13, No. 2, June 1981

228 J i m G r a y e t ITEREADWRITECOMMITABORT SYSTEMABORTSTRANSACTIONFigure 3. The three possible destinms of a transaction. commits, aborts, or is aborted.T1T2T3T4T5IIIIIITime) I I SYSTEMCRASHTICKET AGENTAPPLICATION PROGRAMinput message BEGINSAVE {state)(actions to reserve first hop)SAVE (state)new screennext hop * SAVE (hop)(actions to reserve next hop)last hop * SAVE (hop)(actions to reserve last hop)prmted ticketCOMMIT (reservation)Figure 5. A multihop airlines reservation transactionusing save points. If the application program or ticketagent detects an error, the transaction can undo to aprevious save point and continue forward from theremaking a multihop airline reservation involving several ticket agent interactions-one per hop. The application program estabhshes a save point before and after eachinteraction, with the save point data including the current state of the program variables and the ticket agent screen. If theticket agent makes an error or if a flight isthe external world must persist--its effectsunavailable, the agent or program can backmust be durable. The system will "rememup to the most recent mutually acceptableber" the results of the transaction despiteany subsequent malfunction. Once the sys- save point, thereby minimizing the agent'sretyping. Once the entire ticket is comtem commits to "open the cash drawer" or"retract the reactor rods," it will honor that posed, the transaction commits the datacommitment. The only way to undo the base changes and prints the ticket. Theeffect of a committed transaction is to run program can request a return to save pointa new transaction which compensates for N by issuing the UNDO N action. Of courseall the save points may be washed away bythese effects.a system restart or serious deadlock, butmost error situations can be resolved with1.2 Transaction Save Pointsa minimal loss of the ticket agent's work.The System R save point facility is inThe RSS defines the additional notion oftransaction save point. A save point is a contrast to most systems in which eachfirewall which allows transaction undo to mess