Slow Oracle Database Performance on a NetApp (10g, 11g, 12c) Tips

http://www.netapp.com
http://www.netapp.com

If you have a NetApp storage appliance, device, SAN, whatever you want to call it, review this first (NetApp Best Practices for Oracle Databases – published March 2014): http://www.netapp.com/us/media/tr-3633.pdf

The quick and dirty (for those experiencing Production issues right now):

The database-side init.ora:

grep filesystemio $ORACLE_HOME/dbs/initSID.ora

*.filesystemio_options=’SETALL’

The options for filesystemio_options can be summarized as follows:

  • ASYNCH: Asynchronous I/O. Oracle should submit I/O requests to the operating system for processing. This permits Oracle to carry on with other work rather than waiting for I/O completion and increases parallelization of I/O.
  • DIRECTIO: Direct I/O. Oracle should perform I/O directly against physical files rather than routing I/O through the host operating system cache.
  • NONE: Use synchronous and buffered I/O. In these configurations, the choice between shared and dedicated server processes and the number of dbwriters will become more important.
  • SETALL: Use both asynchronous and direct I/O. (preferred for NetApp)

Note: The filesystemio_options parameter has no effect in DNFS and ASM environments. The use of Direct NFS (DNFS) or Automatic Storage Management (ASM) automatically results in the use of both asynchronous and direct I/O.

However…. Because you might end up cloning to an environment that doesn’t support DNFS (remember it’s stuck in the ORACLE_HOME/rdbms/lib object binary), you should have it set to SETALL anyway to allow fall-back to standard NFS (but with asynch and DI/O.).

[This refers to the: cd $ORACLE_HOME/rdbms/lib ; make -f ins_rdbms.mk [dnfs_on / dnfs_off] used to enable and disable DNFS (you also need the /etc/oranfstab created to support this – server:   local:   path: export: mount: etc…)]

If you have 11.2.0.4.x or later (12c) there’s a new DNFS-related init.ora parameter critical for managing packet requests by the database to avoid packet flooding the NFS file server:  Without it (i.e. older versions) DNFS only works well on dedicated DNFS storage because an oracle database on modern hardware can easily over-whelm the NetApp’s ability to service TCP packets (can send upwards of 4,000+ requests per second)

$> grep dnfs $ORACLE_HOME/dbs/initSID.ora

*.dnfs_batch_size=128

On the Linux OS side:

$> grep tcp /etc/sysctl.conf

sunrpc.tcp_slot_table_entries = 128

$> grep nosharecache /etc/fstab

hostname.corp:/vol/testoracled/d01     /u01/app/Oracle              nfs rw,bg,hard,vers=3,proto=tcp,timeo=600,rsize=65536,wsize=65536,nointr,nosharecache

hostname.corp:/vol/testdata            /u01/app/Oracle/oradata     nfs rw,bg,hard,vers=3,proto=tcp,timeo=600,rsize=65536,wsize=65536,nointr,nosharecache

Incidentally, if you want to measure your I/O rates, I have been using SLOB (Silly Little Oracle Basher – http://kevinclosson.wordpress.com/slob/ ) with one small modification to the iostat command in the runit.sh script to capture the I/O activity on all the shares (iostat -mn instead of -xm):

#      ( iostat -xm 3 > iostat.out 2>&1 ) &
( iostat -mn 3 > iostat.out 2>&1 ) &
misc_pids=”${misc_pids} $!”
( vmstat 3 > vmstat.out 2>&1 ) &
misc_pids=”${misc_pids} $!”
( mpstat -P ALL 3  > mpstat.out 2>&1) &
misc_pids=”${misc_pids} $!”

All this besides the usual OEM recommendations (bigger log_buffer, manage SGA size, get block sizes correct), dropped latency from over 40000ms to < 80ms on heavy load, and is producing 12000 IOPS on the same device.

This is an R12 e-Business Suite environment running on an 11.2.0.3.13 (PSU JAN2015) database (5TB).

What triggered this investigation was that the Log Writer process (LGWR) began dumping trace files about 6 months ago with entries that look like:

*** 2014-12-21 19:49:29.827
Warning: log write elapsed time 9988ms, size 1KB

After investigating all the usual suspects about slow disk, I happened upon the aforementioned NetApp white paper. This occurs when multiple hosts are competing for resources under a single NetApp appliance environment (multiple enclosures, multiple dedicated or shared aggregates, etc.)

Situation addressed and under control.

Advertisement

Smartphone Tablet Art Controller App – WiFi Digital Photo Frames Managed by Template

Simple concept – we’ve bought those digital photo frames that can take various memory cards and flash drives to display our photos. And some of them have become WiFi enabled so you can load pictures from your favorite online cloud storage (i.e. Photobucket, Flickr, Snapfish, etc.)

But what about an app to manage such frames all around your house (or office, or college, or whereever?)

Start with a basic photo library app that can build normal collections and folders, but extend the functionality to allow multiple digital photo frames (or even Smart TV’s with WiFi photo RSS feed capability) to be loaded on-demand with your choice of photos on-demand.

SDWiFiCardUse WiFi compatible SD cards like these to provide the basic connectivity, but assign each device (which usually end up with a local IP address) as a controllable frame within the collection application (e.g. Frame 1 (living room), Frame 2 (kitchen), Frame 3 through 5 (hallway), etc.) Now assign those IP’s to a template “gallery” for the App to manage the content and placement.

Simple uses might be: changing all the digital frames in your house to display your best children’s photos during Mother or Father’s Day.  Load historical photos during national holidays. Celebrate a big birthday with a rolling series of funny or serious This is Your Life photos, all being loaded and timed automatically to change at pre-determined intervals.

More advanced use might be professional gallery management, so you can provide previews of gallery forthcoming openings by using inexpensive 11×14 digital frames to give guests an idea of what’s coming next.  Or artists might even end up programming the templates as interactive media showcases or exhibitions unto themselves.

The smartphone or tablet component (or any touchscreen capability)

Set of touchscreen smartphones
Set of touchscreen smartphones

makes it easier to drag and drop photos to specific frames in the template – imagine the application having a basic floorplan of your house with the various digital frames in placeholder positions, so you could drag and drop photos into them as collection sets.  And save them.  And load them instantly.

@jhlui1 #DreamBig #ChangeTheWorld

Multi-path Multiplexed Network Protocol (TCP/IP over MMNP) Redundant Connections

Because connectivity is becoming less a convenience and more often a necessity, if not a criticality, there will be a built-in demand for 24×7 connectivity to/from data sources and targets.

In professional audio, wireless mics used to be a particular problematic technology – while allowing free-roaming around the stage, they were subject to drop-outs and interference from multiple sources, causing unacceptable interruptions in the audio signal quality of a performance. The manufacturers got together and created multi-channel multiplexing allowing transmission of the same signal over multiple channels simultaneously, so that if one channel were interrupted, the other(s) could continue unimpeded and guarantee interruption-free signals.

Now we need the same thing applied to network technology – in particular, the ever-expanding Internet.  Conventional Transmission Control Protocol/Internet Protocol (TCP/IP) addresses single source and single destination routing.  Each packet of data has sender and receiver information with it, plus a few extra bytes for redundancy and integrity checking, so that the receiver is guaranteed that it receives what was originally sent.

The problem occurs when that primary network connection is lost.  The protcol calls for re-transmit requests and allows for re-tries, but effectively once a connection goes down, it is up to the application to decide how to deal with the disconnection.

The answer may be the same as applied to those wireless microphones.  Imagine two router-connected devices, for example a computer and it’s internet DSL box.  Usually only one wire connects the two and if the wire is broken, lost, disconnected, the transmission halts abruptly.

Now imagine having 2 or 4 Cat-5 cables between the devices, along with a network-layer appliance that takes the original TCP/IP packet from the sender and adds rider packets with it to include a path number (i.e. cable-1 to cable-4), plus a timing packet (similar to SMPTE code) that allows the receiver appliance to ensure packets received out-of-order due to latency in different paths, are re-assembled back in the sequential order as they were transmitted.

Then run these time-stamped and route-encoded duplicate packets through a standard compression and encryption algorithm to negate the effects of the added time and channel packet overhead.

[Addendum: 22-MAY-2015] Think of this time+route concept similar to how BitTorrent operates.  There are already companies working on channel aggregation appliances, but usually for combining bandwidth.  This approach is focused on the signal continuity aspect of the channel communication.

Reverse the process at the receiving end, and repeat the algorithm for the reverse-data path.

Transmitter] — [data+time+channelID] — [compression/decompression ] => (multiple connection routes) => [resequencer] — [Receiver

Time for some creative geniuses to make this happen, yesterday.  Banks need it. Companies need it. Even the communication carriers need this.

@jhlui1 #DreamBig #ChangeTheWorld

R12 e-Business Suite and OEM Monitoring – Oracle Spins Freezes

Every so often, system load on an e-Business Suite instance ramps up and response time to users starts climbing, often resulting in user observed errors such as:

  • FRM-92100 Your connect to the server has been interrupted
  • FRM-92102 A network error has occurred

    FRM-92102 Forms Error R12 EBS
    That dreaded FRM-91201 / FRM-91200 error causing you to restart your session.

Or sometimes, the screen just freezes (aka spins, stops, is broken, stuck, motionless, looks like a screen saver,can’t do anything, won’t work, froze-up, etc.) and the person has to close their browser, or even shut-down their workstation and restart.

It's simply not doing anything - Nothing to see here, just move your cursor around. And wait... and wait.
It’s simply not doing anything – Nothing to see here, just move your cursor around. And wait… and wait.

Old technology often barks with unrelated error messages to the actual cause.  If there’s a lot going on with concurrent requests, or interfaces, or analytic extracts running, the front-end response-time slows down, sometimes sufficiently to trigger these kinds of Form errors, even though technically there was no interruption to the network connectivity, either between the hosts, nor the workstation and the middle-tier application server.

However, on the database, the user-experience can be seen, although not necessarily in the place you might expect.  OEM  had introduced it’s Adaptive Metric Thresholds technology back in OEM 11g (in a slightly different place than in 12c (in Oracle Management Server/OMS 12.1.0.4.0).  In OEM 11g, they were a link under the AWR Baseline Reports page.

OEM 11g AWR Baselines Page
See the Baseline Metric Thresholds link at the bottom.

In OEM 12c, you’ll find them under

Targets -> Database -> Peformance -> Adaptive Thresholds -> Baseline Metric Thresholds -> Edit Thresholds:

OEM 12c Baseline Metric Thresholds
Where those adaptive metric thresholds moved in 12c.

 

 

On this page and in the list of Baseline Metrics, when you click into them, you can access the trending statistics being gathered for each metric.  Many times this will provide direct insight into what a user experiences as the “the system is frozen” translates into “the back-end database response time is incredibly bad.”

OEM 12c Baseline Metric Response Time per Transaction vs. Baseline
See the spikes around 7AM and 11:30AM? Those are being associated with “System Froze” reports.

 

In the example here, the database experienced a dramatic slow-down in response almost 5 to 10 times slower than usual, which only lasted a few seconds. But that can be enough to show up in many users’ sessions who might have just kicked off a query, or were trying to save something.  Based upon the information gathered, we set the Warning and Critical thresholds to 1500ms and 2000ms respectively to start sending e-mail alert notifications upon breach of the levels. If the settings are left at “None”, no incident would be raised, and thus, no notification would be sent.

If you’re experiencing odd transient outages or sluggish behavior that defies the normal AWR and ADDM snapshot analysis, go take a look at what OEM has been gathering in the background over time and see if the statistics correlate to any of your issues.  There’s value in that data. Just mine it.

Hellmann’s Gets The Squeeze on Mayo (aka Best Foods)

Kewpie Mayonnaise
Kewpie mayonnaise with sample squeeze top (image courtesy of Spice-World)

We often use Kewpie around the house, mostly because it happens to come in a non-messy squeeze bottle that is convenient to take on picnics and pass around the table at family gatherings.

Thanks to a free sample from Influenster, the latest forth-coming

Best Foods/Hellmans Squeeze
Hellman’s Mayonnaise new squeezable bottle.

packaging from Hellmann’s (also known on the West coast as Best Foods) has decided to release its quite popular (and lighter than Kewpie) mayonnaise in a new top-down squeeze bottle that comes quite nicely designed to sit upside-down without tipping.  The typical spout that is found on similar ketchup bottles has been made pressure-sensitive allowing a very narrow (2mm) stream of mayo to be sent in a controlled stream accurate enough to write with (or at least make fancy lines and streams) whether decorative, or simply portion-controlling (studies show that using a dab here and there can have just as much taste and schmearing the mayo all over everything, yet saves on serving size big time.)  It’s refillable, and using a smoother plastic interior, is designed to allow most of the mayo to exit the bottle leaving less waste behind (Hellman’s quotes “over 1000 lbs. of mayo are wasted each year just because it’s left behind in bottles and jars)  Look for it coming soon to a grocery shelf near you.