As already shown, scale out I/O based on peer-to-peer network communications with local file caching significantly improves compute I/O performance and reduces NFS disk storage capacity requirements. But the capacity of on-premise compute alone is not sufficient in keeping up with the demands of todays’ designs and aggressive project schedules. Increasing on-premise capacity can be a very time consuming and expensive proposition, sometimes taking up to 6 months to bring needed compute capacity online. Design teams want to run hundreds to thousands of jobs immediately, creating a pent-up demand for elastic compute power. Taking advantage of the cloud’s unlimited compute power, state of the art servers and networking can enable high performance elastic compute. However, moving a design workspace with up to 100s of terabytes to the cloud is not a trivial task.
Chip design and verification workflows are very complex with multiple, interdependent NFS environments, usually comprising 10s of millions of files, spanning 100+ terabytes to petabytes of storage when we include all the EDA tools and scripts, foundry PDKs, managed and unmanaged data. Enabling the cloud to run all the workflows and jobs can be daunting without a means to easily and efficiently synchronize these 10’s of millions of interdependent on-premise files across domains. Using software such as rsync or ftp can be very slow and costly, essentially eliminating any ability to gain fast access to cloud compute resources. The cost of cloud storage can also add up, especially if duplicate copies of the design are maintained on both on-premise and cloud environments. Trying to determine the correct subset of design data to copy to the cloud is extremely hard due to all the interdependencies between the data and legacy workflow scripts. More importantly, on-premise EDA tools and workflows are built on an NFS-based shared storage model, where large numbers of compute nodes share the same data. In contrast, the cloud primarily uses local block storage which is typically accessed only by one host at a time. This infrastructure disparity can become a major challenge to address and overcome.
On-demand hybrid cloud bursting enables existing on-premise compute farm workflows to be run identically in the cloud, enabling elastic high-performance compute by taking advantage of virtually unlimited compute power of the cloud and the wide availability of local NVMe storage on cloud compute nodes. The ability to automatically run jobs in the cloud using unmodified on-premise workflows helps preserve millions of dollars and person-hours already invested in on-premise EDA tools, scripts, and methodologies. Engineering jobs can ‘cloud burst’ during times of peak demand, providing capacity in a transparent fashion.
Peer-to-peer cache fabric scales out I/O performance by simply adding more compute nodes as peers. Storing the needed portions of the design workspace in the cache fabric eliminates the need for duplicate NFS storage in the cloud. Once caches in the cloud are hot, they can be fully decoupled from the on-premise environment, reducing storage cost and minimizing NFS filer performance bottlenecks. Working at the file extent level allows for ultra-fine granular data transfer, delivering low latency bursting. Only bytes that are accessed are forwarded into the cloud and selectively filtered design data deltas are written back to on-premise storage for post processing, if necessary. Not having to duplicate all on-premise data significantly reduces the cloud data storage costs, sometimes by as much as 99%. By using temporary cache storage, all data disappears leaving no trace, at the moment that the environment is shut down after jobs are completed. Cloud compute nodes can be shut down upon job completion to ensure that there are no bills for idle CPUs.
A scale out architecture utilizing local NVMe as a peer-to-peer cache network enables high performance elastic compute for EDA tool acceleration. Hybrid cloud bursting provides immediate and transparent access to the clouds’ virtually unlimited capacity. Virtually projecting on-premise storage to the cloud enables instant bursting, reducing cloud storage cost as well as eliminating NFS filer bottlenecks.
Accurate Insight into Project Progress
Being able to analyze project progress, identify resource optimization opportunities, predict tapeout schedule and prevent IP theft are critical success factors for today’s design projects. The standard methods of top-down project schedule tracking and status reporting are reactive and can only address the problem when it becomes obvious and make resolution much harder and costlier.
Complex IC projects generate terabytes of design change and tool run data per day. Source code edits, verification runs, synthesis, power, timing, place and route and DRC/LVS tasks all generate large amounts of data that are used by individual engineers in their day to day work, but little to no aggregation or detailed analysis of these results takes place. Aggregating these results and correlating across project components and domain activities can provide an accurate assessment of project progress. Without this large-scale analysis, engineers and project leads spend many hours compiling and analyzing the state of their project progress to determine their chances of meeting the project objective. But the existing, siloed analysis methodologies based on isolated log data can be very slow and hide risks in the process.
Besides tool result logs, design and verification teams today do not have access to file activity logs, leaving them without a clear picture of what is happening with their design data. Being able to instrument a peer-to-peer local cache network will provide detailed log information for every transaction on every file by every tool. This massive amount of activity data can be automatically stored and analyzed to provide accurate design and verification activity analytics, resource utilization and can also be used for tapeout schedule prediction.
By tracking all key resources (compute, storage, EDA license, active users) needed to develop complex IP and SOCs and analyzing every facet of design activity in real-time, engineers and managers can see the progress of their project at any level of granularity. Metrics such as release readiness state of IP and SOC based on verification status, QoR, DRC errors or timing and bug rates, and the amount of resources consumed to develop, verify and integrate IP and SOC enables engineers to make real-time data-driven decisions.
For example, applying big data and machine learning techniques to structure this data helps to accelerate verification schedules by providing immediate view of key metrics against customized targets for regression runs, debug convergence and coverage progress. Customizable, interactive analysis provides reports of verification progress versus plans, teams and sub-systems allowing management to focus on project progress and optimize resource allocation. Interconnecting unstructured verification data with structured IP repository design data creates a hybrid database that can be analyzed to identify problem areas and link them to related design activity or changes across SOC/IP hierarchy. This allows engineering teams and management to identify project bottlenecks and understand their root causes.
Further drill down of the data can reveal key metrics for measuring resource utilization such as human capital, EDA tool license allocation and the level of effort required to produce and qualify IP. Based on the information, engineers and managers can prioritize compute and human resources for best results. This information can be fed into future IP make versus buy decisions based on cost of development captured with these techniques.
Taking this analysis one step further, predicting tapeout schedules requires accurately aggregating, filtering and analyzing billions of data points to identify which tasks and design components are on time or behind schedule. Every engineering process that consumes or generates data can be a key performance indicator (KPI) for tapeout and needs to be extracted, compared with historical patterns and tuned throughout the design flow to estimate the completion date well in advance. This allows engineers and managers to not only predict the likely tapeout date but also determine human, license and compute resources that will be required to meet the schedule and adjust resources early enough to reduce schedule slip.
IP is the most valuable asset of today’s semiconductor companies and IP theft is a serious and growing problem which leads to a reduction of competitive advantage and lost revenue and profits. Often IP losses occur due to an inability to track internal IP data access, insufficient internal controls and external theft. To stop potential IP theft before it occurs requires continuous monitoring of user logins, database access, network addresses and amount of data being accessed so that any unusual patterns or user credential compromises can be detected in real-time. If attempted IP theft or cyber-attack is detected, the file system access should be immediately revoked for suspicious user accounts without impacting other team members.
Successfully managing design, verification and integration complexities associated with data explosion in the 2020’s requires new solutions that can
- Accelerate EDA applications compute cycles and storage I/O operations
- Enable I/O scale out and hybrid cloud bursting low latency and minimal storage footprint
- Perform advanced analytics to accurately measure all important design tasks and resources and enable intelligent project design making
It starts with the adoption of the Blueprint for EDA Infrastructure, an integrated platform that enables efficient design database management, effective IP development and reuse, EDA job acceleration, cloud adoption and accurate project progress insight.