TiVo has been able to improve both the performance and reliability of their Perforce installation using the IC Manage Perforce appliance. The appliance consists of a dual redundancy configuration for all hardware components, a Linux kernel coupled with a high performance journaled file system and the IC Manage Sync Engine. The Sync Engine is a highly scalable method for managing file updates to distributed workspaces across the enterprise. This paper describes the overall appliance architecture and documents the performance gains delivered by the system in a real world environment.
TiVo has been using Perforce successfully for six years. Their server supports two hundred and sixty three users and a fairly complex production build system consisting of nearly fifty dedicated build machines. Performance started to become an issue – users were unhappy with the decreased responsiveness when executing common commands. Internal analysis had already identified that the production build system, which ran every fifteen minutes on the dedicated build machines, was presenting a significant load on the Perforce server because of simultaneous sync operations across large file sets with complex mappings. Of the three hundred and fifty Perforce clients used by the production build system, as many as fifty or sixty may simultaneously access the Perforce server.
Perforce Server Environments
IC Manage always looks very carefully at the hardware implementation of the server because this is a first order effect. Perforce database performance is closely related to both random and sequential I/O performance. Random I/O can quickly become the limiting factor as the db files increase in size and the B-trees become increasingly unbalanced. With large db files, operations are typically disk seek-time limited.
Sequential I/O performance also plays an important role as files sizes, counts and directory depths increase.
The existing server was running on a Sun Enterprise 420R running the Sun Solaris 8 operating system. The storage was provided by a Hitachi 9200 Storage Array configured as part of a Storage Area Network (SAN) with a RAID5 architecture. The file system was provided by Veritas to overcome the shortcomings of the native Solaris UFS file system.
Network storage models such as Network Attached Storage (NAS) and SAN continue to be a popular choice for IT professionals in medium to large enterprises. Ease of use, reliability and clustering for data safety are the primary concerns, with performance often taking a back seat in storage infrastructure decisions. Unfortunately, these network storage models continue to move the disk farther and farther away from the application, both through physical wires and complex network stacks. Operating system kernels are typically architected to maximize performance using a cache, main memory and local disk I/O model. Network storage, particularly SAN’s, can provide good I/O bandwidth, though typically at the expense of latency. This increased latency can often present a severe bottleneck to application performance, but is often not considered during the design phase.
Our general recommendation is to follow a direct attach RAID10 model with data replication to network storage to improve reliability and availability. We also recommend the use of commodity x86 hardware due to the impressive Mean Time Between Failure (MTBF) performance of high volume electronics. We also recommend the use of Linux combined with the XFS file system to provide maximum file system scalability, performance and lowest cost.
Customer specific requirements
At TiVo, the IT administrators were dealing with an unusually high failure rate in their server room for no obvious reasons and were therefore particularly sensitive to maintaining a high availability environment. The challenge was to build an affordable, high availability direct attach server with automatic failover capability.
In order to build a high availability direct attached system, we must first consider the hardware failure modes:
- Disk failure
- Controller failure
- Power supply failure
- Server failure
Recovery from disk failure is typically handled by the use of redundant disk arrays, commonly known as RAID. RAID 5 is a very popular scheme due to its low cost. However, it starts with a performance penalty since parity computation can present a bottleneck to disk throughput. In addition, disk failures in the array result in the system operating in a degraded mode, further damaging responsiveness. In our experience, failures typically occur when you least want them and degraded RAID 5 performance at a critical moment can be quite challenging. RAID10 is an expensive scheme, since data is mirrored for each drive in the system, but it offers excellent performance with no performance penalty on disk failure.
Controller failure can often cause serious data corruption, particularly if the controller uses a caching scheme for delayed writes to the disk subsystem. There are two ways to implement hot swap redundant controllers: active-passive and active-active. The active- passive configuration is when one controller is active and handles all of the workload. The passive controller monitors the active controller for failures and takes over the workload in a failure situation. The passive controller should also keep a mirrored copy of the active controller’s cache. This assures that no data in cache is lost when the passive controller takes over. The active- active configuration is when both controllers are active at the same time. They split the workload and maintain mirrored copies of each other’s cache. As can be imagined, active-active, high-performance RAID controllers will out perform an active-passive configuration. However, it should be noted that redundant controllers are a definite performance vs. reliability tradeoff. Cache consistency is typically performed over the drive loop, resulting in reduced performance compared to a single non-redundant controller.
Power supply failure is handled by having multiple, hot swappable power supplies for all components – typically both the server and disk array will have dual power inputs which should be connected to separate power feeds or UPS connections to tolerate failures in the power supply and switching components.
Server failure can be handled in one of two ways. A cold spare approach is the simplest, but requires manual intervention in the event of a problem. For automatic failover, dual servers can be configured to automatically take over from the other in the event of a system failure.
The following system was chosen to replace the existing hardware:
- 2 x Dual Xeon 1U rack mount servers with software RAID1 system disks
- Fiber Channel active-active RAID controller with two connections per controller
- 10K RPM Fiber Channel drives in a RAID10 configuration
- Dual power supplies for all components
- Linux RedHat9, SGI-XFS kernel
- Linux Heartbeat High Availability
Heartbeat is an open source project that implements a heartbeat protocol. That is, messages are sent at regular intervals between machines and if a message is not received from a particular machine then the machine is assumed to have failed and some form of evasive action is taken. In this configuration heartbeat sends heartbeat messages over one serial link and one Ethernet connection. A null modem cable connects the serial ports and an Ethernet loop back connects the Ethernet ports.
When heartbeat is configured, a master node is selected. When heartbeat starts up, this node sets up an interface for a virtual IP address, which will be accessed by external end users. If this node fails then another node in the heartbeat cluster will start up an interface for this IP address and use gratuitous ARP to ensure that this machine receives all traffic bound for this address. This method of fail-over is called IP Address Takeover.
Optionally, if the master node becomes available again, resources will fail-over so the master node once again owns them.
Each virtual IP address is associated with a resource, a program that Heartbeat will start on startup and stop on shutdown. In our configuration, the Perforce server is the only controlled resource.
With large file count databases, we have seen issues with commercial backup clients (Veritas, Legato) consuming large amounts of CPU during backups. Combined with the I/O load presented by such software, performance can often be severely degraded during the backup process. To mitigate this, a checkpoint scheme followed by ‘rsync’ archive copies to a network volume was implemented.
Perforce Server Load Average
The Perforce Server Load Average indicates the number of processes running and ready to run over that minute. It is a measure of how busy the machine is at that point of time. Since there are more processes running than just the Perforce processes, this will indicate how busy the server machine itself really is.
The new server reduced the load average in half. The increased load from 4:00 A.M and 7:00 A.M on the new server is the Perforce server checkpoint and backup. The improved performance of the new hardware allowed us to resume daily checkpoints rather than the weekly checkpoints run on the old server. Load spikes due to the production build system are still visible, but greatly reduced.
p4d Process Count
The p4d process count is an indicator of the number of Perforce commands in progress at a given point in time. Each time a user issues a Perforce command, the server forks a new p4d to process that command. The p4d process runs until the Perforce command completes and the connection to the client closes or times out. This means should a user suspend the Perforce command, the p4d process will continue to run on the server. This can lead to the slow accumulation of p4d processes if users suspend or otherwise fail to exit interactive or buffered Perforce commands.
The key thing to look at is not just the peaks of how many p4d processes are running (which indicates many Perforce commands), but also how long it takes for the number of p4d processes to return to a baseline. This shows how long the Perforce server remained busy handling commands. Unsurprisingly, there are regular peaks in the number of p4d processes running matching the production build system requests every fifteen minutes.
There are still many p4d processes running, however, the overall peaks are generally lower showing that the new server is completing work sooner. There is also a visible peak due to Perforce commands queued while the Perforce database is locked during the daily checkpoint.
Number of Perforce Commands Run
The number of Perforce commands executed indicates the amount of Perforce client activity. The graphs below show both the total number of Perforce commands executed as well as the number of commands executed by regular (e.g. non-build system) users. The numbers do not correlate directly to the number of p4d processes in the previous graph. The following graphs are the number of Perforce commands logged over a one- minute period whereas the number of p4d processes running is a snapshot at a given point in time. However, the number of Perforce commands executed does correlate with the load average in that more expensive commands (those that use more CPU time) will tend to increase the load average. As expected, the production build system generates visible peaks on this graph as well.
Elapsed Time to Run Perforce Commands
The elapsed time to run the most commonly used Perforce commands is the biggest indicator of improvement between the old server and new server. Following are several charts showing the elapsed time to run commonly used Perforce commands on the old server and new server. Figures 13 through 20 are scatter plots of elapsed time of interesting Perforce commands, using a unique symbol for each Perforce command, over two hour and 48 hour time periods. Figures 13 through 16 are times for commands executed by all users. Figures 17 through 20 exclude commands executed by the production build system. Line markers for 30, 60 and 120 seconds have been added as well.
Figures 21 through 24 are bar charts showing average times of selected Perforce commands over a roughly two month period. The most dramatic improvement seen was for the verify command.
The elapsed time of a Perforce command depends on a variety of factors including the size and complexity of the Perforce client view, the load on the Perforce server, the type of Perforce command executed (interactive, buffered, suspended, etc) and whether the command was executed during the daily Perforce server checkpoint. Despite these factors, we believe the charts show a reasonable enough sample of Perforce usage at TiVo to demonstrate the new server is significantly faster when executing the most commonly used Perforce commands.
Overall, most commands run for one to two seconds. A very small percentage takes longer than thirty seconds. The highest times are commands operating on a large number of files (e.g. a large merge between branches) or transferring a large amount of data (e.g. an initial sync for a Perforce client with a large view). We believe the increased network bandwidth on the new server (1 gigabit Ethernet vs. 100 megabit Ethernet) also contributed to the improved times for those Perforce commands.
Build Load Test Comparison
The following table shows the time required to complete a Build Load Job test. The test itself attempts to simulate the production build system sync request that occurs every fifteen minutes. The values plotted are the averages of five runs, and no other activity was occurring on the Perforce server at the time. As can be seen in Figure 25, the new server starts at better than half the time, and only improves from there.
The largest contributor to the load on Perforce is still the production build system. However, the new server has improved a number of things. The commands are processed more quickly, and the result is that while there are still build load spikes, they no longer have as much of an effect.
The new Perforce server has shown itself to be anywhere from, 2-4 times faster than the old server, and as a result, the users are reporting much better performance. In fact they no longer notice (or recognize) the production build system requests which happen every fifteen minutes because they no longer see a significant pause in Perforce processing.
The Sync Engine
The IC Manage Sync Engine is a software package that efficiently handles the automated syncing of workspaces. In addition to delivering a powerful mechanism to manage file updates to distributed workspaces across the enterprise, the Sync Engine automatically utilizes performance optimizations available in Perforce from Release 2003.2 onwards.
However, this optimization requires you to remember the last synchronization point of your workspace in order to realize maximum performance improvements. Without the Sync Engine, implementing this feature requires modification to the production build system to fully leverage this performance advantage, which in many cases is not a feasible option.
The Sync Engine uses a MySQL database to store information about the client specifications and the most recently synced change for each of the clients it monitors. This means that the work of determining which clients need to be synced based on a given change is handled outside of Perforce. This reduces the load on Perforce permitting it to be more responsive to user commands.
Sync operations are either time based or manually requested. In order for changes to be shared across workspaces automatically in a scalable fashion, a better approach than interval based syncing is required. The Sync Engine introduces the concept of Change Based Synchronization, where sync operations are guaranteed to update a workspace, resulting in high database efficiency and minimization of read locks and other Perforce resource contention issues.
This is achieved by using the Perforce change counter and sequence commands to effectively poll for changed data and then match it to appropriate set of workspaces that are interested in this data. Since the counters and sequence are low overhead operations, significant performance and scalability gains can be realized with this approach. With the Sync Engine, only clients that need new files will be activated via a sync.
There are two modes of operation for the Sync engine: Automatic and On-Demand. The updates can be on the entire client or view (partial client) basis.
In the automatic mode, workspaces are configured to receive the changes as they occur. No actions are required from the user. Failure notifications are sent via email and also recorded in a log. The Sync Engine will execute an rsh or ssh shell and run a shell script, on the appropriate host.
The on-demand mode allows the user to decide when to run the updates. Updates are tracked but not scheduled, allowing a scalable query based update scheme.
A Modified Direct Attach Model: The Caching Accelerator
The direct attach model has significant advantages, but can present challenges to IT integration. The redundant controller setup also reduces performance since cache mirroring is performed over the drive loop. In our testing, the redundant controller configurations were unable to exceed 70-90 Mbytes/s write performance. With highly optimized single controller setups, we are able to achieve 195 Mbytes/second write performance. The best way to combine both performance and reliability is to use the server as a caching accelerator and perform write-back operations to the network storage. This setup allows us to achieve maximum performance with non-redundant controllers and hosts and perform near real time Perforce replication to a SAN or NAS backend.
The accelerator consists of a hardware server with a direct attached disk cache and two software components 1) A near real time data replicator and 2) a fader. The replicator performs change-based replication of all data to the network storage of choice. Using Michael Shield’s “p4jrep” solution as a basis, the software performs change-based replication of both metadata and the archive. This scheme allows for incremental data copies and eliminates the need to perform memory intensive rsync operations at fixed intervals. In the event of a failure in the hardware, a Perforce server can be run on any host with access to the network storage. A very simple monitoring program coupled with IP address takeover can now perform automatic failover. System performance is temporarily degraded while the cache hardware is fixed.
The fader implements automated data migration. It ensures that the cache size of the IC Manage Perforce Accelerator can be kept constant, eliminating the requirement to scale the local storage to match the Perforce archive size. It works by replacing less frequently accessed data object from the disk cache with a link to the persistent storage, enabling the cache to maintain high performance. The fader can also operate in a bi-directional mode when dealing with text files that have been migrated.
The hardware solution delivered significant performance gains. Users were happy with overall responsiveness of the server and checkpoints were able to run daily instead of weekly. Due to time constraints, the Sync Engine deployment is not yet complete at TiVo so there is no data available at this time to demonstrate the power of this solution, but integration testing demonstrates significant improvements. The Caching Accelerator extension with the Replicator and Fader can provide a higher performance, lower cost and simpler solution for integrating direct attach hardware into the overall IT infrastructure.