>
Planning, and Solving Storage Performance Problems Fast Using ORtera Heuristics

ORtera's purpose is to make your storage planning and configuration process more effective, faster, cheaper, and fun for you.

You would like to lower the response time on your system, and know how much headroom you have to avoid the possibility of meltdown, way in advance. ORtera Heuristics will guide you with specific metrics, analytics and recommendations, in plain English. Here is how ORtera will help you:

First, install and start ORtera

Upon installation and first starting ORtera, it will perform an initial discovery and 3-minute monitoring session on the 10 busiest application processes and the filesystems they use.


Click to enlarge

You may schedule monitoring sessions of greater length on specific processes and filesystems of your choosing.

When ORtera has completed monitoring, the tree nodes for the processes and filesystems that interest you will indicate by color whether ORtera has detected suboptimal conditions. Red means severe performance-degrading conditions have been detected, yellow means minor conditions, green means no such conditions detected, and gray means no I/O detected during the monitoring period. You should open red and yellow nodes to view the Heuristics tab for each.

In this case, Heuristics indicates that I/O fragmentation at the filesystem level is degrading performance. The heuristic suggests that we configure the filesystem for a 1 MB I/O size:

ORtera also reveals fragmentation at the RAID level. RAID fragmentation is a function of the stripe unit size, and Heuristics suggests an increase in stripe unit. Sometimes you divide the I/O size on purpose, to spread it over multiple LUNS and channels; however, this is a technique used mostly for large sequential I/O. For random I/O, it is generally better not to divide it. When dividing the I/O is not intended, and is causing inefficient use of the lower levels, ORtera identifies it as fragmentation.

ORtera reveals that less than 40% headroom exists before saturation of the physical layer. The heuristics suggests that some of the workload be diverted.

ORtera reveals a 16-to-1 ratio of I/O size between the filesystem and logical device, and 2-to-1 ratio between the logical and physical devices. The I/O size chart verifies the fragmentation condition. You observe it by Ctrl-selecting each layer to view the transformation between layers

Random reads and writes are being transformed into sequential reads and writes, while 80% of the application I/O latency is spent on random reads. This I/O-Type chart is another unique presentation of ORtera, not easily obtained by any other means.

This chart may appear difficult to interpret at first sight, but the more times you use it, the more illumination it provides you about the workload. The breakout of Ops, Data, and Time by access type gives you insight into the balance of the workload and how homogeneous it is.

In this example, you see a large amount of random write (red bars) at the filesystem layer (the top set of bars) converted to sequential write (yellow bars) at the logical device layer (the middle set of bars). There is also random read (blue bars) being converted to sequential read (green bars). Some of the activity at the logical device layer is not application payload, but filesystem metadata, such as access time and file-size updates.

The type chart allows you to make observations regarding relative I/O size by access type balance. For example, 64% of the operations account for 64% of the data (random read at the filesystem level, in this example), or 25% of the operations account for 12% of the data (random read at the logical device level, in this example). Therefore, at the filesystem level, 64% of the data is read, while at the logical layer only 34% is read. The difference is the amount of page-cache hits on read. As you can tell, this chart is information rich, and worth contemplation time.

As Amdahl's Law teaches, where the most time is spent the most potential for performance improvement exists. Here the big time-consumer is the sequential write at the logical layer that has resulted from the I/O fragmentation you observe in this example.

ORtera reveals a 14-to-1 ratio of load level between the filesystem and physical layers. These are discrete load level measurements, not time averages. Load level, as reported by ORtera, is a true measure of the actual number of concurrent I/O threads, and the portion of time spent, void of idle time. The measure of system load level, reported by most other tools, is averaged over time and does not reflect the true demand on the storage system.

The physical layer is operating at 42% of capability. Another important perspective, not generally available without ORtera, concerns the instantaneous arrival and completion rates of the workload. Like the ORtera load-level metric, these are void of idle time. The relationship between them offers insight into the saturation point of the storage configuration. Clearly, an arrival rate that exceeds the completion rate cannot be sustained. ORtera uses capability expectation to calculate Capability Busy, which is based on the observed performance of the current configuration for the current workload composition and average load level. This estimates the sustainable rate of the current configuration at 100% capability for the current workload composition and load level distribution.

There is a large variation in response time. A critical metric for quality of service is not just response time, but the standard deviation of response time. When response time is less variable, performance is more consistent. The ORtera metric, instantaneous bandwidth, is also void of idle time. It shows the true demonstrated bandwidth of the configuration.

The load distribution is to a single file at the filesystem layer, and four LUNS at the physical layer. A very powerful feature of ORtera, not readily available by other means, is a view of the resources used at each level sorted by Ops, Data and Time. In particular, it shows individual files in the filesystem. There are several heuristics dealing with homogeneity and balance. When these conditions are detected, they can be viewed here by selecting the subject resources.

You follow the Heuristics guidance and reconfigure the filesystem for 1-MB I/O. Given the large increase in I/O size at the filesystem layer, you raise the original heuristic guidance for a 128-KB stripe unit to 512 KB. In this example, the filesystem, kernel and logical volume manager were reconfigured like this:

  • UFS
    • maxcontig (newfs -C) 128 (1 MB)
    • cgsize (newfs -c) 240 (cylinders)
    • maxbpg (tunefs -e) 6400
  • Kernel
    • maxphys 1048576 (1 MB)
    • md_maxphys 1048576 (1 MB)
  • SVM
    • Stripe unit 512 KB

Having made the changes, you repeat the monitoring session to see the results. Your configuration change has resulted in greatly improved I/O-size transformations. You have achieved the large I/O desired.

Result: Random writes are no longer converted to sequential writes. There is less sequential read as well. The sequential write component was a result of the fragmentation, and was causing inordinate load levels at the lower levels. The reconfiguration has eliminated this artifact.

Result: Load level ratios are now 4-to-1, all the way down to the physical devices. Before the change there was a 14-to-1 ratio of load level. The physical devices were running near the edge of their performance capability, now they are in a comfortable range. Note the third row, Avg. load level:

Result: Capability utilization was reduced from 27% to 20% at the filesystem level, 47% to 10% at the logical device level, and 42% to 15% at the physical device level.

The impact on headroom is a tremendous improvement, and can add months, if not years, to the longevity of the configuration, and prevent a sudden meltdown. Operating the physical resource in a comfortable range also reduces variability in performance, improving quality of service overall.

Result: You have greatly improved response time, and reduced response time variation, and you have greatly increased delivered bandwidth at all layers. The 4th row from the top of this table shows that the raw power of the configuration to deliver I/O has been improved almost threefold. The last two rows from the bottom show that you have dramatically improved response time and variability of response time at all levels of the configuration.

ORtera Summary Report delivers printer-ready documentation for the system run-book or other logs. It provides an audit trail of baseline and changes in performance for external documentation and sharing with supervisors or customers.

In this case, only three performance-degrading conditions were present before diagnosis by ORtera and reconfiguration. ORtera Heuristics diagnoses and guides you in correction of 24 performance-degrading conditions:
High Sustained Arrival Rate High Average Arrival Rate
Filesystem Space Fragmentation RAID I/O Fragmentation
Filesystem I/O Fragmentation Sequential I/O Not Coalesced
Excessive Read Ahead Unbalanced File I/O Operations
Unbalanced File Data Unbalanced File Response Times
Unbalanced Device I/O Operations Unbalanced Device Data
Unbalanced Device Response Times Marginal Response Times
Unacceptable Response Times High Logical Device Peak Load
High Physical Device Peak Load Large I/O Size Variations
Unbalanced Partition Extremely Large I/O
Low Resource Capability Headroom CPU Overload
Over-Allocation of Physical Resources Unused Physical Resources

Conclusion

With ORtera, in minutes you have diagnosed your storage bottlenecks and the causes of low resource headroom on your system, making the system responsive for users, and preventing a costly sudden meltdown. In addition, you now have a full understanding of the capabilities and constraints of your storage system, and you know it is performing at its best.

ORtera makes your storage configuration process more effective, faster, cheaper, and fun:

What Users Say

"The visual DTrace for storage"

"Cracks the storage stack"

"Fun"

"Impressive"

"Easy and intuitive"

"Well thought out"

"By far the best I've seen"

"Incorporating a rigorous model and detailed heuristics"

"I would really recommend it"

 ©2005 ORtera Inc.|