8/21/2016Session 1086AIX Performance TuningPart 2 – I/OJaqui LynchFlagship Solutions [email protected]#ibmedge 2016 IBM CorporationAgenda Part 1 CPU Memory tuning Starter Set of Tunables Part 2 I/O Volume Groupsand File systems AIO and CIO Part 3 Network Performance Tools21086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 21

8/21/2016I/O3Rough Anatomy of an I/O LVM requests a PBUF Pinned memory buffer to hold I/O request in LVM layer Then placed into an FSBUF 3 typesThese are also pinnedFilesystemClientExternal PagerJFSNFS and VxFSJFS2 If paging then need PSBUFs (also pinned) Used for I/O requests to and from page space Then queue I/O to an hdisk (queue depth) Then queue it to an adapter (num cmd elems) Adapter queues it to the disk subsystem Additionally, every 60 seconds the sync daemon (syncd) runs to flush dirty I/O out to filesystems orpage space41086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 22

8/21/2016From: AIX/VIOS Disk and Adapter IO Queue Tuning v1.2 – Dan Braden, July 20145IO Wait and why it is not necessarily usefulSMT2 example for simplicitySystem has 7 threads with work, the 8th has nothing so is notshownSystem has 3 threads blocked (red threads)SMT is turned onThere are 4 threads ready to run so they get dispatched andeach is using 80% user and 20% systemMetrics would show:%user .8 * 4 / 4 80%%sys .2 * 4 / 4 20%Idle will be 0% as no core is waiting to run threadsIO Wait will be 0% as no core is idle waiting for IO to completeas something else got dispatched to that coreSO we have IO waitBUT we don’t see itAlso if all threads were blocked but nothing else to run thenwe would see IO wait that is very high61086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 23

8/21/2016What is iowait? Lessons to learn iowait is a form of idle time It is simply the percentage of time the CPU is idle AND there is at least one I/O stillin progress (started from that CPU) The iowait value seen in the output of commands like vmstat, iostat, and topas isthe iowait percentages across all CPUs averaged together This can be very misleading! High I/O wait does not mean that there is definitely an I/O bottleneck Zero I/O wait does not mean that there is not an I/O bottleneck A CPU in I/O wait state can still execute threads if there are any runnable threads7Basics Data layout will have more impact than most tunables Plan in advance Large hdisks are evil I/O performance is about bandwidth and reduced queuing, not size 10 x 50gb or 5 x 100gb hdisk are better than 1 x 500gb Also larger LUN sizes may mean larger PP sizes which is not great for lots of little filesystems Need to separate different kinds of data i.e. logs versus data The issue is queue depth In process and wait queues for hdisks In process queue contains up to queue depth I/Os hdisk driver submits I/Os to the adapter driver Adapter driver also has in process and wait queues SDD and some other multi‐path drivers will not submit more than queue depth IOs to anhdisk which can affect performance Adapter driver submits I/Os to disk subsystem Default client qdepth for vSCSI is 3 chdev –l hdisk? –a queue depth 20 (or some good value) Default client qdepth for NPIV is set by the Multipath driver in the client81086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 24

8/21/2016More on queue depth Disk and adapter drivers each have a queue to handle I/O Queues are split into in‐service (aka in‐flight) and wait queues IO requests in in‐service queue are sent to storage and slot is freed when the IO is complete IO requests in the wait queue stay there till an in‐service slot is free queue depth is the size of the in‐service queue for the hdisk Default for vSCSI hdisk is 3 Default for NPIV or direct attach depends on the HAK (host attach kit) or MPIO drivers used num cmd elems is the size of the in‐service queue for the HBA Maximum in‐flight IOs submitted to the SAN is the smallest of: Sum of hdisk queue depths Sum of the HBA num cmd elems Maximum in‐flight IOs submitted by the application For HBAs num cmd elems defaults to 200 typically Max range is 2048 to 4096 depending on storage vendor As of AIX v7.1 tl2 (or 6.1 tl8) num cmd elems is limited to 256 for VFCs See http://www‐ isg1IV632829Queue Depth Try sar –d, nmon –D, iostat ‐D sar –d 2 6 shows:device %busy avque r w/s 149avwait avserv0. avqueAverage IOs in the wait queueWaiting to get sent to the disk (the disk's queue is full)Values 0 indicate increasing queue depth may help performanceUsed to mean number of IOs in the disk queue avwaitAverage time waiting in the wait queue (ms) avservAverage I/O service time when sent to disk (ms) See articles by Dan Braden: nsf/WebIndex/TD105745 nsf/WebIndex/TD106122101086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 25

8/21/2016iostat ‐Dlbreadbwrtnhdisk0%tm13.7bps255.3K 33.5tps682.7254.6K hdisk514.1254.6K 33.402.1254.6K 000033.46.70.8122.92.4082.1002.1hdisk16 2.71.7M3.9hdisk17 5 00hdisk18 0.12.2K0.52.2K00. 0.12.6K0.62.6K00. 3.40000000000hdisk22 5872.4K 2.4872.4K 02.427.70.2163.2System configuration: lcpu 32 drives 67 paths 216 vdisks 02.4M29.82.4M029. 10.32.3M12.22.3M012.216.40.2248.50000000000hdisk24 9.22.2M52.2M0534.60.2221.9000000000hdisk26 7.92.2M4.52.2M04.5323.12010000000000hdisk27 6.22.2M4.42.2M04.425.40.6219.50000000.1000hdisk28 34.52.2M04.5010.33101.60000000000hdisk29 vgtimeavgwqszTransactions per second – transfers per second to the adapterAverage service timeAverage time in the wait queueAverage wait queue sizeIf regularly 0 increase queue‐depthavgsqszAverage service queue size (waiting to be sent to disk)Can’t be larger than queue‐depth for the diskservqfullRate of IOs submitted to a full queue per secondLook at iostat –aD for adapter queuesIf avgwqsz 0 or sqfull high then increase queue depth. Also look at avgsqsz.Per IBMAverage IO sizes:read bread/rpswrite bwrtn/wpsAlso tryiostat –RDTl int countiostat –RDTl 30 5Does 5 x 30 second snaps11Adapter Queue Problems Look at BBBF Tab in NMON Analyzer or run fcstat command fcstat –D provides better information including high water marks that can be used incalculations Adapter device drivers use DMA for IO From fcstat on each fcs NOTE these are since bootFC SCSI Adapter Driver InformationNo DMA Resource Count: 0No Adapter Elements Count: 2567No Command Resource Count: 34114051Number of times since boot that IO was temporarily blocked waiting for resources such asnum cmd elems too low No DMA resource No adapter elements No command resource– adjust max xfer size– adjust num cmd elems– adjust num cmd elems If using NPIV make changes to VIO and client, not just VIO Reboot VIO prior to changing client settings121086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 26

8/21/2016Adapter Tuningfcs0bus intr lvlbus io addrbus mem addrinit linkintr prioritylg term dmamax xfer sizenum cmd elemspref alpasw fc 2Bus interrupt levelFalseBus I/O addressFalseBus memory addressFalseINIT Link flagsTrueInterrupt priorityFalseLong term DMATrueMaximum Transfer SizeTrue (16MB DMA)Maximum number of COMMANDS to queue to the adapter TruePreferred AL PATrueFC Class for FabricTrueChanges I often make (test first)max xfer sizenum cmd elems0x2000001024Maximum Transfer SizeTrue 128MB DMA area for data I/OMaximum number of COMMANDS to queue to the adapter TrueOften I raise this to 2048 – check with your disk vendorlg term dma is the DMA area for control I/OCheck these are ok with your disk vendor!!!chdev ‐l fcs0 ‐a max xfer size 0x200000 ‐a num cmd elems 1024 ‐Pchdev ‐l fcs1 ‐a max xfer size 0x200000 ‐a num cmd elems 1024 ‐PAt AIX 6.1 TL2 VFCs will always use a 128MB DMA memory area even with default max xfer size – I change it anyway for consistencyAs of AIX v7.1 tl2 (or 6.1 tl8) num cmd elems there is an effective limit of 256 for VFCsSee http://www‐ isg1IV63282Remember make changes too both VIO servers and client LPARs if using NPIVVIO server setting must be at least as large as the client settingSee Dan Braden Techdoc for more on tuning mastr.nsf/WebIndex/TD10574513fcstat –D ‐ Outputlsattr ‐El fcs8lg term dma 0x800000 Long term DMATruemax xfer size 0x200000 Maximum Transfer SizeTruenum cmd elems 2048Maximum number of COMMANDS to queue to the adapter Truefcstat ‐D fcs8FIBRE CHANNEL STATISTICS REPORT: fcs8.FC SCSI Adapter Driver Queue StatisticsHigh water mark of active commands: 512High water mark of pending commands: 104FC SCSI Adapter Driver InformationNo DMA Resource Count: 0No Adapter Elements Count: 13300No Command Resource Count: 0Adapter Effective max transfer value: 0x200000Some lines removed to save spacePer Dan Braden:Set num cmd elems to at least high active high pending or 512 104 626141086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 27

8/21/2016My VIO Server and NPIV Client Adapter SettingsVIO SERVER#lsattr ‐El fcs0lg term dmamax xfer size 0x200000num cmd elemsTrue0x800000Long term DMATrueMaximum Transfer SizeTrue2048Maximum number of COMMANDS to queue to the adapterNPIV Client (running at defaults before changes)#lsattr ‐El fcs0lg term dma0x800000max xfer size 0x200000Maximum Transfer Sizenum cmd elems256Long term DMATrueTrueMaximum Number of COMMAND Elements TrueNOTE NPIV client must be to settings on VIOVFCs can’t exceed 256 after 7.1 tl2 or 6.1 tl815Tunables161086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 28

8/21/2016vmstat –v Output – Not Healthy3.0 minperm percentage90.0 maxperm percentage45.1 numperm percentage45.1 numclient percentage90.0 maxclient percentage1468217 pending disk I/Os blocked with no pbuf11173706 paging space I/Os blocked with no psbuf2048 file system I/Os blocked with no fsbuf238 client file system I/Os blocked with no fsbuf39943187 external pager file system I/Os blocked with no fsbufpbufs (LVM)pagespace (VMM)JFS (FS layer)NFS/VxFS (FS layer)JFS2 (FS layer)numclient numperm so most likely the I/O being done is JFS2 or NFS or VxFSBased on the blocked I/Os it is clearly a system using JFS2It is also having paging problemspbufs also need reviewing17lvmo –a Output2725270 pending disk I/Os blocked with no pbufSometimes the above line from vmstat –v only includes rootvg so use lvmo –a to double‐checkvgname rootvgpv pbuf count 512total vg pbufs 1024max vg pbuf count 16384pervg blocked io count 0pv min pbuf 512Max vg pbuf count 0global blocked io count 2725270this is rootvgthis is the othersUse lvmo –v xxxxvg ‐aFor other VGs we see the following in pervg blocked io countblockedtotal vg bufsnimvg29512sasvg27191991024backupvg60424608lvmo –v sasvg –o pv pbuf count 2048 ‐ do this for each VG affected NOT GLOBALLY181086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 29

8/21/2016Parameter Settings ‐ SummaryPARAMETERNETWORK (no)rfc1323tcp sendspacetcp recvspaceudp sendspaceudp recvspaceMEMORY (vmo)minperm%maxperm%maxclient%lru file repagelru poll intervalMinfreeMaxfreepage steal 6010880JFS2 (ioo)j2 maxPageReadAhead128j2 dynamicBufferPreallocation 803909001096010880 /1 (TL)12816390900109601088112816NEWSET ALL TO1262144 (1Gb)262144 (1Gb)65536655360390JFS, NFS, VxFS, JFS290JFS2, NFS010calculationcalculation1as neededas needed19Other Interesting Tunables These are set as options in /etc/filesystems for the filesystem noatime Why write a record every time you read or touch a file? mount command option Use for redo and archive logs Release behind (or throw data out of file system cache) rbr – release behind on read rbw – release behind on write rbrw – both log null Read the various AIX Difference Guides: gi?query aix AND differences AND guide When making changes to /etc/filesystems use chfs to make them stick201086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 210

8/21/2016filemonUses trace so don’t forget to STOP the traceCan provide the following informationCPU Utilization during the traceMost active FilesMost active SegmentsMost active Logical VolumesMost active Physical VolumesMost active Files Process‐WiseMost active Files Thread‐WiseSample script to run it:filemon ‐v ‐o abc.filemon.txt ‐O all ‐T 210000000sleep 60trcstopORfilemon ‐v ‐o abc.filemon2.txt ‐O pv,lv ‐T 210000000sleep 60trcstop21filemon –v –o pv,lvMost Active Logical ‐‐‐‐‐‐‐util ‐‐‐‐‐‐‐‐‐‐‐‐0.66 4647264834573 45668.9 /dev/gandalfp ga71 lv0.369608345656960.7 /dev/gandalfp ga73 lv0.13 24308161344820363.1 /dev/misc gm10 lv0.11 5380814800 571.6 /dev/gandalfp ga15 lv0.08 944167616850.0 /dev/gandalfp ga10 lv0.07 78763262966614.2 /dev/misc gm15 lv0.05825624259270.9 /dev/misc gm73 lv0.05 1593667568695.7 /dev/gandalfp ga20 lv0.05825625521281.4 /dev/misc gm72 lv0.04 5817622088668.7 /dev/misc gm71 a20/gm72/gm71221086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 211

8/21/2016filemon –v –o pv,lvMost Active Physical ‐‐‐‐‐‐‐util ‐‐‐‐‐‐‐‐‐‐‐‐0.38 453843246126 8193.7 /dev/hdisk200.27 12224671683 5697.6 /dev/hdisk210.19 156961099234 9288.4 /dev/hdisk220.08608374402 3124.2 /dev/hdisk970.08304369260 3078.8 /dev/hdisk990.06 53713622927 4665.9 /dev/hdisk120.066912631857 5321.6 /dev/hdisk102descriptionMPIO FC 2145MPIO FC 2145MPIO FC 2145MPIO FC 2145MPIO FC 2145MPIO FC 2145MPIO FC 214523Asynchronous I/O andConcurrent I/O241086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 212

8/21/2016Async I/O ‐ v5.3Total number of AIOs in usepstat –a grep aios wc –lMaximum AIOservers started since bootservers per cpu TrueNB – maxservers is a per processor setting in AIX 5.3AIO maxserverslsattr –El aio0 –a maxserversmaxservers 320 MAXIMUM number ofOr new way for Posix AIOs is:ps –k grep aio wc -l4205At AIX v5.3 tl05 this is controlled by aioo commandAlso iostat –ATHIS ALL CHANGES IN AIX V6 – SETTINGS WILL BE UNDER IOO THERElsattr -El aio0autoconfig defined STATE to be configured at system restartfastpath enable State of fast pathkprocprio 39Server PRIORITYTruemaxreqs 4096 Maximum number of REQUESTSTruemaxservers 10MAXIMUM number of servers per cpuTrueminservers 1MINIMUM number of serversTrueTrueTrueAIO is used to improve performance for I/O to raw LVs as well as filesystems.25Async I/O – AIX v6 and v7No more smit panels and no AIO servers start at bootKernel extensions loaded at bootAIO servers go away if no activity for 300 secondsOnly need to tune maxreqs normallyioo -a –F moreaio active 0aio maxreqs 65536aio maxservers 30aio minservers 3aio server inactivity 300posix aio active 0posix aio maxreqs 65536posix aio maxservers 30posix aio minservers 3posix aio server inactivity 300##Restricted tunablesaio fastpath 1aio fsfastpath 1aio kprocprio 39aio multitidsusp 1aio sample rate 5aio samples per cycle 6posix aio fastpath 1posix aio fsfastpath 1posix aio kprocprio 39posix aio sample rate 5posix aio samples per cycle 6pstat -a grep aio22 a 1608e24 a 1804a1 1608e1 1804a00001 aioPpool1 aioLpoolYou may see some aioservers on a busy system261086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 213

8/21/2016AIO RecommendationsOracle now recommending the following as starting pointsminserversmaxserversmaxreqs5.36.1 or 7 (non CIO)100200163843 - default20065536 – defaultThese are per LCPUSo for lcpu 10 and maxservers 100 you get 1000 aioserversAIO applies to both raw I/O and file systemsGrow maxservers as you need to27iostat ‐Aiostat -A async IOSystem configuration: lcpu 16 drives 15aio: avgc avfc maxg maif maxr avg-cpu: % user % sys % idle % iowait150Disks:hdisk6hdisk5hdisk905652% tm act23.415.213.90 12288Kbps1846.11387.41695.921.4tpsKb read3.364.710.6Kb wrtn195.2 381485298 61892856143.8 304880506 28324064163.3 373163558 34144512If maxg close to maxr or maxservers then increase maxreqs or maxserversOld calculation – no longer recommendedminservers active number of CPUs or 10 whichever is the smaller numbermaxservers number of disks times 10 divided by the active number of CPUsmaxreqs 4 times the number of disks times the queue depth***Reboot anytime the AIO Server parameters are changed281086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 214

8/21/2016PROCAIO tab in nmonMaximum seen was 192 but average was much less29DIO and CIO DIO Direct I/OAround since AIX v5.1, also in LinuxUsed with JFSCIO is built on itEffectively bypasses filesystem caching to bring data directly intoapplication buffers Does not like compressed JFS or BF (lfe) filesystems Performance will suffer due to requirement for 128kb I/O (after 4MB) Reduces CPU and eliminates overhead copying data twiceReads are asynchronousNo filesystem readaheadNo lrud or syncd overheadNo double buffering of dataInode locks still usedBenefits heavily random access workloads301086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 215

8/21/2016DIO and CIO CIO Concurrent I/O – AIX only, not in LinuxOnly available in JFS2Allows performance close to raw devicesDesigned for apps (such as RDBs) that enforce write serialization at theappAllows non-use of inode locksImplies DIO as wellBenefits heavy update workloadsSpeeds up writes significantlySaves memory and CPU for double copiesNo filesystem readaheadNo lrud or syncd overheadNo double buffering of dataNot all apps benefit from CIO and DIO – some are better with filesystemcaching and some are safer that way When to use it Database DBF files, redo logs and control files and flashback log files. Not for Oracle binaries or archive log files Can get stats using vmstat –IW flags31DIO/CIO Oracle Specifics Use CIO where it will benefit you Do not use for Oracle binaries Ensure redo logs and control files are in their own filesystems with the correct(512) blocksize Use lsfs –q to check blocksizes I give each instance its own filesystem and their redo logs are also separate Leave DISK ASYNCH IO TRUE in Oracle Tweak the maxservers AIO settings Remember CIO uses DIO under the covers If using JFS Do not allocate JFS with BF (LFE)It increases DIO transfer size from 4k to 128k2gb is largest file sizeDo not use compressed JFS – defeats DIO321086 ‐ Edge 2016 ‐ AIX Performance Tuning Pt 216

8/21/2016lsfs ‐q output/dev/ga7 ga74 lv ‐‐/ga74jfs2 264241152 rwyes no(lv size: 264241152, fs size: 264241152, block size: 4096, sparse files: yes, inline log:no, inline log size: 0, EAformat: v1, Quota: no, DMAPI: no, VIX: no, EFS: no, ISNAPSHOT:no, MAXEXT: 0, MountGuard: no)/dev/ga7 ga71 lv ‐‐/ga71jfs2 68157440 rwyes no(lv size: 68157440, fs size: 68157440, block size: 512, sparse files: yes, inline log: no,inline log size: 0, EAformat: v1, Quota: no, DMAPI: no, VIX: no, EFS: no, ISNAPSHOT: no,MAXEXT: 0, MountGuard: no)It really helps if you give LVs meaningful names like /dev/lv prodredo rather than/dev/u9933Telling Oracle to use CIO and AIOIf your Oracle version (10g/11g) supports it then configure it this way:There is no default set in Oracle 10g do you need to set itConfigure Oracle Instance to use CIO and AIO in the init.ora (P