I'm on ASE 15.0.2 EBF 15959 ESD#6, on Sun Solaris 9. I just had a condition
occur that I don't understand. I hope someone can shed some light.
The database was running fine until a process got into a transaction and
then held a lock which blocked other processes, resulting in a blizzard of
blocking locks. The transaction log did not fill up. I checked syslogshold
and dbcc sqltext, but that didn't give me much useful info. The sqltext did
not appear to be anything that could have caused a runaway transaction. I
forgot to check sysprocesses.physical_io at that time.
Anyway, I killed the process that was indicated in syslogshold, and the
process went into the rollback state. Up until now I think I understand what
is happening. But... The rollback never finished. I checked
sysprocesses.physical_io and it was counting up. I let it go for perhaps
half an hour. I understand this might have been caused by a runaway
transaction (though the transaction log never filled up). I decided to
shutdown and restart the ASE.
I restarted the ASE, and was surprised that it came all the way up with no
delay. I had been expecting that the recovery would take a long time to
complete the huge rollback, but that apparently didn't happen. So now I
don't understand what state that original process was in. Was it really
rolling back? Why did restarting the database help? From now on should I
automatically restart the database when a condition like this occurs? What
else could I have looked at to help me diagnose what was happening, either
before or after the rollback?
Subject: Long rollback
X-Newsreader: Microsoft Outlook Express 6.00.2900.5843
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5579
X-RFC2646: Format=Flowed; Original
Date: 20 Nov 2009 09:14:03 -0800
X-Trace: forums-1-dub 1258737243 10.22.241.152 (20 Nov 2009 09:14:03 -0800)
X-Original-Trace: 20 Nov 2009 09:14:03 -0800, vip152.sybase.com
Xref: forums-1-dub sybase.public.ase.general:28662
Article PK: 77904
User-Agent: Thunderbird 126.96.36.199 (Windows/20070221)
Subject: Re: Long rollback
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
X-Antivirus: avast! (VPS 091118-0, 11/18/2009), Outbound message
Date: 20 Nov 2009 12:17:02 -0800
X-Trace: forums-1-dub 1258748222 10.22.241.152 (20 Nov 2009 12:17:02 -0800)
X-Original-Trace: 20 Nov 2009 12:17:02 -0800, vip152.sybase.com
Xref: forums-1-dub sybase.public.ase.general:28664
Article PK: 77906
> I checked syslogshold
> and dbcc sqltext, but that didn't give me much useful info. The sqltext did
> not appear to be anything that could have caused a runaway transaction. I
> forgot to check sysprocesses.physical_io at that time.
A transaction can contain multiple DML commands. Any of these commands could be capable of obtaining a lock ... a lock
that will be held until the transaction is closed (commit/rollback) ... a lock that can block other spid's.
'dbcc sqltext' typically shows the (partial) SQL text of a *currently* running query; 'dbcc sqltext' will not show you
the SQL text of old/completed queries that would have run earlier in a transaction.
Point being ... in order to find the query that obtained the blocking locks you'll need a complete history of all DML
commands that made up the transaction. This is typically accomplished via the monSysSQLText (MDA) table, monitor
server, Sybase auditing, a network sniffer, or client-side logging of the DML commands.
> I understand this might have been caused by a runaway
> transaction (though the transaction log never filled up). I decided to
> shutdown and restart the ASE.
The excessive physical IOs may have also been a case of rolling back an old transaction who's log pages had already been
flushed from dataserver cache.
The older the transaction *and* the larger the volume of log activity (eg, by other non-blocked spid's) ... the larger
the volume of log pages that can be flushed (from cache) to disk ... the larger the number of log pages that have to be
re-read from disk in order to complete (commit/rollback) a transaction.
In the case of an 'old' unclosed transaction, there's an additional issue/question you'll want to track down ... why
wasn't the transaction closed out in a timely manner? Was the transaction *really* taking a long time to complete its
DML commands, or did the client/application forget to close (commit/rollback) its transaction in a timely manner?
> I restarted the ASE, and was surprised that it came all the way up with no
> delay. I had been expecting that the recovery would take a long time to
> complete the huge rollback, but that apparently didn't happen. So now I
> don't understand what state that original process was in. Was it really
> rolling back? Why did restarting the database help? From now on should I
> automatically restart the database when a condition like this occurs? What
> else could I have looked at to help me diagnose what was happening, either
> before or after the rollback?
Regarding the 'fast' dataserver reboot, my guess is that the rollback probably finished just prior to the shutdown
command taking effect otherwise ... yes ... you should have seen a longer-than-normal recovery period upon rebooting the
NOTE: Part of the database recovery process (at dataserver startup) includes (re)playing non-checkpointed log records;
this includes messages in the errorlog about the number of log records to be processed, and in the case of a large
volume of log activity you'll also see a countdown of the volume of transactions to be processed (something like '60%
complete', '70% complete', etc). What kind of messages are in the dataserver errorlog regarding the recovery process
for the database in question? If you see a small volume of log pages being processed this would tend to indicate that
the rollback had completed, and been checkpointed, prior to the 'shutdown' taking effect.
As to whether or not you should restart the dataserver when an issue like this comes up ... *shrug* ... that's something
you, your management and your users will have to discuss. I suggest you consider ...
- is 100% of your user base affected by the outage (eg, blocking, long rollback)? if not, shutting down the dataserver
*WILL* affect 100% of your users regardless of whether or not they (the users) were affected by the blocking/long-rollback!
- if the rollback has run for a long time, and does not complete before the dataserver is shutdown, upon reboot you'll
have to wait for the rollback to start over from scratch; net result is that you could be extending the outage quite a
bit beyond what you would have had to endure had you just waited for the rollback to complete
- decide on what is considered 'too long' in terms of blocking AS WELL AS long-running (ie, open) transactions, then
implement monitoring (and resolution?) processes to enforce these limits; objective is to catch long-running (ie, open)
transactions and address them before you get to the point where a rollback takes 'too long' to complete
- have your disk-subsystem folks take a look at dataserver device performance metrics; the objective is to insure you
don't have any unnecessary disk bottlenecks (eg, RAID-5 disk in use for logging) that could slow down the dataserver's
disk IOs (read and write)