Transactions [Scratchbook] (Article)

Description :: Atomicity, isolation, conflicts, versioning, commit, rollback, retaining, save points, sub-transactions

Transactions are sometimes called "units of work". All operations performed during a transaction will either succeed or fail together. Up until the end of the transaction, the option to abort (rollback, undo) is still available. An error during the commit operation (save) will cause all operations to fail, rather than only the operation which caused the error.

Beyond the basic transaction facilities usually provided (start, commit, rollback), save-points and sub-transactions may also be available. Save-points allow you to undo the last few operations back to a specified point in the sequence of events. Sub-transactions allow you to start a new transaction, with the added possibility of undoing its changes after it commits. A sub-transaction is only fully committed if it, and all of its parents, commit. A sub-transaction is rolled back (undone) if it, or any of its parents, rolls back. Sub-transactions can emulated save-points, but save-points may not be able to emulate sub-transactions. Whereas save-points occur in sequence, sub-transactions may potentially execute concurrently.

There is still some debate as to when, during a transaction, constraints should be validated. At the beginning and end of a transaction, all constraints should be satisfied. During a transaction, it seems reasonable to expect operations to also leave the database in a consistent state. Because there have been cases where constraints could be so draconian as to make it impossible to move between two valid states, even when the transition itself was valid, some have suggested that multiple statements be allowed to be bundled together, similarly to a sub-transaction. The key difference here would be that several statements are treated as one, without (as much) constraint-checking between them. This solution isn't particularly elegant, is too similar to sub-transactions, and still suffers from problems of isolation between statements and transactions. [Note: this is covered again further below. Should be merged.]

Isolation
Read-uncommitted / dirty read: a transaction can see everything being done by other transactions, before it's even been committed. This is generally not a good idea.

Read-committed: a transaction can see changes made by other transactions as those changes become committed. For simple retrieval operations, this may not cause problems. For reports, this may not be such a good option: as the transaction progresses, things may change "under its feet" resulting in mismatched data on reports, etc.

Repeatable-read / snapshot: a transaction can only see what was true when it started, and its own modifications. This is ideal for reporting, as you are guaranteed to see the same data if you look twice (so long as you haven't changed anything yourself.)

Retaining (option): You may also have the option of "commit retaining" and "rollback retaining" allowing a transaction to "continue" from where it was, rather than starting a new one with a new context. This would allow a snapshot transaction to save its changes mid-way, without beginning to see any of the more recent committed changes that occurred in the mean time. "Rollback retaining" would basically allow a transaction to start over in exactly the same state it had the first time.

Note that many of these options are readily available on multi-generational database systems, designed to provide them. Older systems based purely on locking may not provide them at all. Those systems assert that at any given point in time, a variable has exactly one value. They will generally block or fail when reading a variable currently being modified.

Blocking
Most systems will prevent transactions from overwriting each other's changes. Attempting to write over a variable which is currently being modified by another transaction will result in failure, or on some systems, will put your transaction on hold until the other one releases the write-lock by rolling back. If the other transaction commits, most systems will fail your transaction rather than allow you to modify the variable under the wrong assumptions. (Even in read-uncommitted, your operation is probably assuming that the variable holds a value it no longer does: what you last saw is not necessarily still the case, even if you look often.)

When reading a variable that is being modified by another transaction, options differ. Snapshot isolation levels will let you see old versions of the variable, immediately. Other modes may put your transaction on hold, to give you the latest version (once the changes are committed or rolled back), or operate as with snapshots, showing you the latest committed version only.

Some systems may provide the ability to lock every variable as if it had been modified, when it has only been read. These systems generally do not support snapshot isolation modes or historical data: to provide consistency, only one transaction may modify variables at a time, and you cannot modify anything being used by another transaction (even if only for reading.) This protects reading transactions from being forced to see changes from other (writing) transactions, at the cost of concurrency (the ability to work together.)

Concurrency
Generally, two transactions may not modify the same variable at the same time. This generalization, however, isn't accurate. Two transactions may simply not make conflicting changes. A conflict may be as simple as refusing to let two transactions make changes simultaneously, or it may be more flexible. Most systems allow two transactions to insert new rows into a table at the same time, provided the rows do not conflict in terms of uniqueness. A table (relation variable) is just a variable, however, though a complex one (millions of rows, each with data in many columns.) The rule as first stated would be a problem here. Similarly, two transactions may modify different rows of the table at the same time, under most circumstances. Most systems do not allow transactions to modify the same row, however. (A few may provide some concept allowing groups of attributes to be modified as a unit, with each row having several such groups.) Some systems go so far as to only allow one transaction to modify a table at any given time. (This is true in particular for systems designed mostly for reading, with rare modifications.)

As a general rule, however, the conflicting nature of simultaneous changes is decided on a per-case basis, based on rules for the variable (defined by its domain, or possibly elsewhere.) You could define the rules for something as simple as a number to allow two transactions to do something like "a = a + 20" at the same time; assuming they both committed, the final result would be equivalent to doing "a = a + 40", but either (or both) could roll-back, leaving "a" with an appropriate intermediary value. The problem here is expressing "what you mean to do" so as to allow simultaneous changes to be merged successfully. Assuming "a" held the value "13" before two transactions both requested "a = 33", there would be no easy way to tell whether they both meant to increment it, or assign it a new value directly. Could you safely assume they mean to increment? (Likely not.) In particular, versioning algorithms based on differences will have trouble making these decisions for you. To such a system, "a = a + 20" and "a = 33" look the same, because they have the same beginning and end states, and therefore the same difference. The intent, however, is lost and cannot be recreated. If you've ever worked with CVS (concurrent versioning system) software, particularly to track changes during the development of software or documents among multiple contributors, you've likely run into such problems. Most CVS software works best on text files, where changes are detected on a per-line basis. It will attempt to find the difference between your current working draft and the original version you were working from. That difference is then re-applied to whatever current state the document is in, in the central repository. Simple things like two people both appending a new item to a list can cause confusion: the CVS software doesn't know the meaning of the information it's working on, nor what the rules are. Perhaps you were both adding a similar line, and there should only ever be one such line (one or the other of you will have to give up his changes.) Perhaps order matters, and it must know which new item to put first at the end of the existing list. Perhaps order doesn't matter, and you want it to just "take the change" without throwing a fit. But without a good infrastructure to describe these semantics, such a system remains dumb. It creates conflict descriptions in your copy of the files, which you must resolve manually and try again.

Some regard should be given to the importance of the order of requested actions when multiple transactions are allowed to simultaneously request changes to a shared resource. Neither will have seen each others' changes (most likely), so we can't assume the requests were intended to have any order. If we're going to allow two such requests simultaneously, it seems fitting that order shouldn't matter at all. They should be possible to merge without respect to order (in fact, it'd be best if the implementation responsible for deciding whether or not the requests conflict could not access 'order' information at all, to guarantee order-agnosticism.)

Constraints and Transactions
A database depends on constraints to make sure that outside of any transactions (thus at the beginning of any new transactions) the entire database is in a consistent, sane state obeying all business rules applied to it. To do that, constraints must verify that your actions, particularly when you commit, make sense. But they shouldn't just make sure that at the end of your transaction, the state of the database seen inside the transaction is sane: other transactions may have made changes which, when coupled with yours, put the database in a "wrong" state.

There are several types of constraints we should consider, and their place in "time". In all cases, a transaction should not change the database from a good state to a bad state, or to another good state via a change that isn't allowed. There are therefore state constraints and transition constraints. A state constraint verifies that the cumulative sum of your changes results in a sane state. A transition constraint makes sure that an individual change is allowed (such as changing a person's gender -- though in some rare cases, that makes sense.) Furthermore, while transition constraints run within the context of the transaction making the change, state constraints must also look at the current state of other running transactions (and therefore likely future state of the database.) Generally, database systems verify that two transactions will not both insert (simultaneously) rows with the same primary key information (unique column value constraint.) However, few systems (if any) will allow the database to verify directly that two transactions will not insert (simultaneously) hotel reservations for the same room with overlapping (but not necessarily equal) dates. This is a very necessary constraint, however.

Both transition and state constraints at least apply at the end of a transaction (with transition constraints looking at the changes made since the beginning of the transaction.) Verifying them earlier is a "nicety", but also introduces the problem of granularity: should every single operation in a transaction maintain integrity rules? A transaction represents a single atomic operation, and is therefore a single "thing" which is difficult to split into smaller parts. If some operation must be expressed in multiple statements (which, individually, do not satisfy those rules), is there some mechanism to bundle statements? And if so, why are the rules for statements inside the bundle different from the rules for statements outside the bundle?

This leads us to looking at the benefits of sub-transactions (nested transactions) in terms of their ability to express units of change which should be verified against constraints (not just units of change which can be undone, though the ideas are related.) You could in fact "request" that your individual statements be verified earlier than they normally would by nesting them in a transaction which you commit as early as possible, leading to constraint checking. (It's an idea, at least.) Regardless, state constraints "looking into" the state of other transactions should be fair in the arbitration, not looking at changes made by statements that have not yet passed inspection by constraints. First-come, first-served still seems like a good method though. Whoever fails a constraint check first loses. There would therefore be an advantage to getting your changes "in" as soon as possible, to be seen by constraints running against other transactions.

Perhaps more fair would be to verify constraints inside one running transaction based on the current committed state of all other transactions: their changes become visible to the constraint verification mechanism as they commit (which requires them to satisfy all constraints) -- the problem here is one of convenience. It would be preferable for constraints to issue an error only if the last statement made by a client were at fault. This method would however issue constraint errors based on the timing of commit operations from other transactions, which would make debugging (and reporting error messages to users) more difficult. No single cause of failure on the current transaction's part could easily be identified. Furthermore, this requires that every commit operation also re-verify constraints in all other running transactions.

In the least convenient case, constraints are verified only at the end (commit) of a transaction, with the ability to see the committed state of all other transactions. In other words, the constraint would take your changes, add them to all changes committed by other users, and check to make sure the final state of the database is consistent, and accept your commit request on that basis.

As discussed elsewhere, constraints and domains are quite similar. Domains can be viewed as constraints on the possible values which a may be represented in a given variable. The problem here is that no changes to a variable may violate the variable's domain constraint: if it is a set of items (no duplicates), no action (individually or not) may (even temporarily) make it be anything other than a true set, even if by the end of the transaction this has been correct, or the transaction is rolled back. These are non-deferable constraints. Every requested action must respect them.

[split/join semantics, other possibilities with transactions]

Continued at top