The object xct_t uses a dedicated latch to synchronize changes of state (e.g., commit) with checkpoints. During destruction, the _core object must be destructed while the latch is held. The problem is that the code below only does that if the latch is not held at the moment.
if (LATCH_NL != latch().mode())
{
// Someone is accessing this txn, wait until it finished
w_rc_t latch_rc = latch().latch_acquire(LATCH_EX, WAIT_FOREVER);
// Now we can delete the core, no one can acquire latch on this txn after this point
// since transaction is being destroyed
if(_core)
delete _core;
_core = NULL;
if (false == latch_rc.is_error())
{
if (latch().held_by_me())
latch().latch_release();
}
}
This code has two problems. First, the core must be destroyed always, and not just when the latch is held by someone else. This potentially causes a memory leak, and is probably the reason for a SEGFAULT error which happens after running transactions for a few minutes (_core was suddenly getting a NULL value).
Second, when the core is deleted, the method latch() returns null as well, so the code where the latch is released above should not work (even though it does usually -- no idea why).
In essence, I believe the whole latching mechanism is faulty. For instance, by deleting the core and nullifying the latch() method, a following thread which acquires the latch (e.g., checkpoint) will encounter a segmentation fault. I.e., the problem is not necessarily the mutual exclusion, but the order of entering the critical section.
We must think of a better solution for checkpointing the transaction table. Decoupled checkpoints solve that problem in a very simple way and eliminate any kind of synchronization need. However, it might be useful to come up with a solution for standard (non-decoupled) checkpoints as well.
The object xct_t uses a dedicated latch to synchronize changes of state (e.g., commit) with checkpoints. During destruction, the _core object must be destructed while the latch is held. The problem is that the code below only does that if the latch is not held at the moment.
This code has two problems. First, the core must be destroyed always, and not just when the latch is held by someone else. This potentially causes a memory leak, and is probably the reason for a SEGFAULT error which happens after running transactions for a few minutes (_core was suddenly getting a NULL value).
Second, when the core is deleted, the method
latch()returns null as well, so the code where the latch is released above should not work (even though it does usually -- no idea why).In essence, I believe the whole latching mechanism is faulty. For instance, by deleting the core and nullifying the
latch()method, a following thread which acquires the latch (e.g., checkpoint) will encounter a segmentation fault. I.e., the problem is not necessarily the mutual exclusion, but the order of entering the critical section.We must think of a better solution for checkpointing the transaction table. Decoupled checkpoints solve that problem in a very simple way and eliminate any kind of synchronization need. However, it might be useful to come up with a solution for standard (non-decoupled) checkpoints as well.