Waits on buffer exterminate (Automated Memory Management Issues)

From OraWiki

Jump to: navigation, search

Contents

Author

(c) 2007 by Steven of Oz
Author
NameGerald Roehrbein
httpwww.OraForecast.com
e-mailGerald.Roehrbein@OraForecast.com
Release0.9
Created2009.06.12
Last update2009.06.12
ChangesStarted article
Estimated duration of installation none
LicenseFreeware


Purpose


Description of what happened if an Oracle database is configured using "Automated Memory Management".

Installation

  • None

Todo

  • Nothing

Recommendations

  • Do not use Oracle Automated Memory Management for databases with heavy load without proper sizing


Example(s)


Graph below show what happened when using AMM if a idle system becomes busy:

Image:Amm.jpg

How to create such an image? That's just simple. I've generated six one hour AWR reports between 12:00 (12PM) and 18:00 (06PM) and pasted the buffer cache and shared pool size into an Excel table. Than I've inserted an Excel graph object and configured it to use the table as base data.

Image below shows the wait events between 3PM and 4PM:

Image:Awramm.jpg


Between 3PM and 4PM a simple connect took up to 15 seconds! Systems performance was at low level but CPU usage and PIO too!

Script fsora SYSTEM_EVENT_NON_IDLE shows:

EVENT                                              TOTAL_WAITS TOTAL_TIMEOUTS TIME_WAITED   AVERAGE_WAIT WP
-------------------------------------------------- ----------- -------------- ----------- -------------- ----------
cursor: pin S wait on X                                1291682        1291202     2436010    5,075.02083     .01%
control file parallel write                            1811837              0     1767751         .97567     .01%
enq: TX - row lock contention                             5860           3479     1074207      451.15792     .01%
enq: TX - index contention                                6149           3320     1035716      366.10675     .01%
log file sequential read                                532868              0     1092605        2.05042     .01%
wait for unread message on broadcast channel             80416          34824     3700294       81.16104     .02%
buffer busy waits                                       271812          35122     3520305       14.87306     .02%
write complete waits                                     62237          61490     6093395    8,157.15529     .03%
db file parallel write                                 7027790              0     6680407         .95057     .04%
log file parallel write                               35179515              0    10873553         .30909     .06%
buffer exterminate                                      131618         128177    12754628    3,706.66318     .07%
log file sync                                         27385903          29859    12294591         .44943     .07%
db file scattered read                               167238316              0    20828616         .12454     .12%
db file sequential read                              394791081              0    25699496         .06510     .14%
io done                                               11433955        5616028   522529590       89.81371    2.93%
SQL*Net message from client                          136057869              0  1.7189E+10      126.33955   96.41%

This output shows a totally false configured cache because the averages for "sequential and scattered read" are faster as expected. Normally a "scattered read" should take with 8k db_block_size and DB_FILE_MULTIBLOCK_READ_COUNT=16 roundabout 0.65cs and a healthy "sequential read" should take 0,3cs (as calculated by my scripts). If they are faster than there are a lot of filesystem and storage system cache reads.

The image above shows the averages of "most of the non idle events" since instance startup. This image is useful to understand one of the benefits of AWR or ADDM or the requirement of having a lot of experiences with a special application and the Oracle statistics. The averages do not point direct to the problem. An experienced Oracle DBA would understand the problem without a Statspack, AWR or ADDM report because he have to know the normal output of V$SYSTEM_EVENT. But this is nothing someone can teach.

The cache is too small but the hit ratio is pretty good (caclulated using fsora tps using "V$SYSSTAT"):

=========================================
=== Total IO statistics from V$SYSSTAT
=========================================
Cache hit ratio  :...  99,95
Physical read/s  :....707.53
Consistent get/s :.12915.221

Since my tuning goal is always to have less than 100 PIO's this system shows a lot more PIO's per second in average. Application executes a lot of "short full scans" per second and "full index scans". To increase cache size is not best because a SQL review is really required. To increase cache size will help as a workaround but it is not the solution for the problem. One part of the problem is "literal SQL". The application uses a lot of literal statements. Bad practise.

Especially for systems with a small load it's difficult to size cache and IO correct. So that's a recommendation (< 100 PIO/s) which varies from system to system. For example: If database is for a middleware which runs only a few hours at night than 60 PIO's per second as calculated using V$SYSSTAT could be too much. In such a case it makes sense to use STATSPACK and or AWR/ADDM or to have a database trace for a few days to understand workload characteristics.

My writings are not the "holy bible". Just some ideas because real performance analysis and tuning is a difficult task which requires a lot more know how as you would expect. This is just the reason for me to publish such facts because there is a lot more to know and tune to fix such issues and require to hire a real performance expert and not one of those which do this "en passant".

After disabling AMM performance increases and system behave normal:

Image:amm_no_thx.png

Security related issues

  • None

How to make this feature as secure as possible

  • None.

Source code of this extension

  • See above

Planned enhancements

  • Will follow

Discussions


Images above show that a system having one week or a few days "normal load" will have huge problems if the load changes. At 3PM buffer cache usage increases so AMM decide to grow buffer cache and reduce shared pool size. When this happens a lot of "buffer exterminate" waits occur.

To solve this issue my recommendations are:

  • Change programming strategies. Do not use literal SQL and use bulk operations whenever possible.
  • Create proper indexes for your queries
  • Size SGA correct.
  • Set a usable SGA_MAX_SIZE and SGA_TARGET_SIZE
  • Set a minimum for DB_CACHE_SIZE (and the other caches if used)
  • Set a minimum for SHARED_POOL_SIZE
  • Set a maximum for shared pool using _SHARED_POOL_MAX_SIZE (calculate correct, only used if using auto SGA/ Automated Memory Management)
  • Use a frequently executed script which flushs shared pool (if you do not want to set _SHARED_POOL_MAX_SIZE)

If the traditional sizing was made correct than it is not required to use AMM. The major problem of AMM is that it will allways resize caches and this in some cases not very efficient.

Consider this: You bought a dedicated Computer for a database with an amount of memory. Why should Oracle resize caches? If resize occur than the database will destroy content of the caches and this will reduce the efficiency of caching.

In my opinion AMM is a stupid feature regardless to the systems configuration. If there are a few instances at one servers hardware than you won't get a benefit from AMM because you have to size all instances to use the required memory.

The major reason for this problem are design changes from Oracle 8.x to 10.x related to the SHARED_POOL. In Oracle 8 SHARED_POOL was allocated inside SGA with size of SHARED_POOL_SIZE. But in some cases SHARED_POOL increases and my experience was that Oracle allocated additional memory for the SHARED_POOL outside SGA. For those made an update from 9.x to 10.x the recommendation of Oracle was to calculate the correct SHARED_POOL size because 10G allocate SHARED_POOL now inside the SGA and if there is not enough space than this will give (for example) ORA-4030 errors.

Since there are only a few guys at planet earth able to make a proper sizing most DBA's decided to use AMM instead. This wastes world wide hundreds of millions of dollars because in most cases companys not having performance specialists which really understand the way Oracle works their staff try to compensate the loss of performance with faster CPU's, faster storage and so on.


Whatever. In my opinion this feature is useless and only good for those who want to blame an old DBA as a stupid because "they know some new features" and argue that traditional sizing is expensive not required anymore.

Well look in a mirror. There you (and me too) will see one or more stupids! ;-) Because using Oracle it's possible to make things allways a little bit better!


My job is to make Oracle look as perfect as possible. So we at OraForecast.com recommend only well tested features. Since Oracle have competitors like MS-SQL server, DB/2 and MySQL (and a lot of others) it is required to give only best advice because I've seen some frustrated Companys migrated to other databases because of bad consulting.


Oracle is less expensive and much more powerful as the other databases but it is complex, a DBA have to know what he's doing and data modellers and programmers too! But if a Company pay for an expert than the Company have the right to get really expert knowledge and today there work too many novices in the Oracle community.

History

  • 2009.06.12 Start of documentation

Known bugs

  • Reported Description of problem Fixed Release
  • 2009.06.12 No issues known

Running example

  • Not required
Personal tools