Case Study PDF clone

Simon Burke, Xilinx (DAC Panel 2015 edited transcript)

Get a PDF version of this transcript with slides

New-Window-Icon

I’m going to talk a little bit about Xilinx and Big Data, mostly on tape out prediction scheduling. I don’t want to minimize the security aspect of it, to me that’s just a slam dunk. It’s not something that’s optional anymore – you just have to do it.

So Xilinx, in case you don’t know, is an FPGA company; I should mention that we are the only tier one FPGA company that’s still independent.

So the real questions when it comes to tape out are: When are going to tape out? Who’s having problems that we don’t know about yet? There’s always the top down project schedule that people have. That’s generally reactive, not predictive in nature, so it accommodates problems after they become visible – not before. And then the third thing that usually comes up, from finance, is: Where’s all this extra money going? Who is spending it, and why?

Specifically which series and which IP design blocks are using extra resources, whether it be licenses or people. We need to be able to come up with an answer to that so that we can either plan or correct next time through.

So the standard answer to when it is going to tape out is usually lots of meetings and lots of excel work. People have weekly meetings, they put stuff into spreadsheets to make graphs, and all looks good.

The problem is, is that engineers generally want to clean their laundry in private. A problem isn’t really a problem to an engineer, it’s an “unsolved challenge” – until its irrevocable and they have to admit it.

So you tend to find problems late, because they don’t consciously feel (but instead unconsciously) like it’s something they can solve without having to make a big deal about it.

There’s a lot of wisdom from previous projects that people apply; sometimes it applies and sometimes it just doesn’t and generally you don’t find out what doesn’t until the end. So retrospectively you know but at the time you don’t. And what you generally get is a reactive system that tells you why tape out was delayed but doesn’t tell you that it’s going to be delayed beforehand.

Now this is a bit of an oversimplification it’s a much more complex problem than this. And you obviously witness the problems as you go through the project, but generally the current model is a reactive model not a predictive model. So there are many answers to the problem but one of the key features is taking the human element out of the raw data collection.

People want to put their bias on it; the way to do that is take the people out, then just collect the raw data and analyze it impartially. This is hard because:

  • There is a lot of raw data
  • You don’t know which data matters and which doesn’t in the beginning
  • Extracting something useful out of a large amount of data can be quite a challenge.

What you want to do is teach smart people what useful data is, not spend an awful amount of time trying to extract it. So what’s the answer? What do you do? What really what you want to do is build a system to

  • Collect and store raw data because you don’t know what’s going to matter until the end. And you’ll change your mind as you go through the project as to what actually is important or not.
  • You want to officially process a large amount of data into a couple small key metrics you can use to predict. You may not know what those metrics are in the beginning, but as you walk through, you’ll start to understand what matters and what does not.
  • You can then use your smart people to think about what all those metrics mean and how to assemble them into a prediction.

So the questions you want answered are: When are really going to tape out? Who’s having problems? And from finance: Where’s all the money going? What that usually translates into is that there are some long pole processes for tape out, and for each of those, there is some data associated with it that’s in your data management system.

You can use that data for a proxy for where those processes are at. What you want is some metrics that cover that data type, open access data, and you want check-in and check-out times and license usage. For DRC you want the number of errors you reported. Each data type has its own metrics that matter, but you can correlate a process with the data type and therefore with some metrics you can use to track that data type.

So you just collect and do a prediction. How hard can that be? A lot of that is very theoretical and high level.

— Xilinx & IC Manage ENVISION, Predictive Analytics & Tapeout Prediction —

What I’m going to show you next is more of a practical example. I think that makes more sense than some high level slides. I’m going to assume that custom layout is my long pole. There are many of them and each technique is different. This is just one example of one metric and one area that you can use to track into a prediction.

So if you assume that custom layout is one of your long poles, what do you need to be able to predict the layout closure time on project? You need to know how many man-hours people spend on a particular block or IP. Plus how long you took on the previous project, and how long it’s taking on the current project.

There’s always a process scaling factor when you go from 28 nm to 20 nm to 16 nm. Stuff just gets harder. It takes longer. It takes more run time. It takes more manual effort from the design teams. So there’s a scaling factor for the process node that is relatively consistent, but you need to know what it is.

And then there’s a complexity scaling factor. Designs tend to get more complex with each generation. Some stay the same, some add features; very rarely do they get simpler or easier to do. So there’s also a scaling factor for complexity you must factor in.

Well, the equation is really easy E = MS squared. How hard could that be? The problem is if you just do that blindly or dumbly or generically across the entire design, generally what you get is a very wrong answer. The problem is the error bar on it is bigger than the prediction bar, so the window of opportunity for early to late is very big.

So, if we have a perfect metric system, one that doesn’t exist yet, what you really want to know is: When was this all checked out? When was this all checked in? Who did it? One person editing five cells for an hour counts the same as editing one cell for one hour. It’s the same library, it’s the same IP design block, and it’s the same person.  You want to know which IC Manage library it was done in, because those things correspond to various IP’s. And there can be many libraries in a single IP. You also want to know when the users checked out the license, and then had the license taken back off because they weren’t using it. Checked out licenses can go for a long lunch break – that doesn’t count as man hours.

So really what you want to do, what you want to know is, what are the actual man-hours they used on the design, not for how long they had the tool checked out, or how long they were in the office. What matters here is gets down to accuracy. What you are trying to do is minimize the error bar in the final prediction – and the details really do matter. If you do this in a more coarse and generic matter, the answer you get won’t be very accurate and your prediction won’t be very good. So assuming you have all that raw data, you have the total man hours per design team. You want to know, for that design team:

  • When did the project start?
  • When did it complete?
  • What are the transition dates from one project to the next, and from one data view to the next?
  • When did they go from RTL to schematics, and from schematics to layout? (for instance)

So what you get is a big set of time series data bases with different intervals – some are days, weeks, months – over a project. Once you’ve got that, you can start predicting based on the previous project and your scaling factors what they should be doing the next project.

So what you do? At the start, you take a wild guess at what the process scaling factor is. The error bar on that is fairly huge so you don’t tell anybody what your prediction is because it’s not very substantiate-able. But as you go through the process, blocks come clean on your current project.

That will give you some quantifiable metrics as to how much extra you have as compared to last time, and the complexity scaling factor for a particular design. So you can start taking the early blocks to go through, using those to refine what your early estimates are for some of these scaling factors. Now, obviously, as the project goes through you get more and more data to base that on. You can use those early blocks as the starting point for estimates for your late blocks. So you get to refine your late blocks that you haven’t yet started based on the same assumptions as the early blocks.

And those kind of factors can be different for different styles of blocks. Mixed-signal designs look different from high speed digital, which look different from standard place and route digital blocks. So you have to group these design into styles and then apply factors for each style. If instead you apply the same thing across the board again, you’ll get kind of incorrect answers.

What you want in your metric system is something that makes that easy to do. You can do it all by hand in spreadsheets or Perl scripts or Python scripts. But it’s kind of a pain. What you want is a metric system that just takes all that data and gives you an easy way to get you a simple answer out of a rather large amount of data, so you can focus on what the answer means as opposed to how you get the answer in the first place.

It’s got to be easy to generate this data often because as you go through the project you’re going to rerun this on a weekly or even bi-weekly basis, so you can refine your prediction as you go forward. You don’t want to be doing this every three or four months because the gaps between the data are so far apart that you really won’t know what changed in the meantime.

It’s the incremental nature – where you can do this quickly and efficiently – that really helps you. So as I said, at the start of the project you take a wild guess and then you kind of refine it down. The complexity one is probably the hardest to do. How do you tell how something changed from one project to the next? Let’s break it down. If we’re talking about layout here, layout is generally proceeded by schematic, and the schematic is generally proceeded by RTL. So you can look at the complexity scaling between the RTL from one project to another, and then use that as the proxy for how the physical design scales.

Likewise when you go from RTL to schematic you refine your number. So there are other data types you can use as proxies for some of these scaling factors before you actually get there. Generally RTL never gates your chip tape out. It’s always the backend physical world. You may start late because the RTL isn’t ready, but you can use the RTL as your proxy for how to get those things really changed before you get the final answer from the physical layout. So how does it really work? That’s all great in theory but what do you do? It’s really simple. The hard part is knowing if the answer is right. Coming up with a prediction of how many resources, or when tapeout is going to be, is fine.

But if the answers is incorrect, nobody’s going to believe you, and you really only get one shot at that. If you screw it up the first time and you lose credibility, and nobody will believe you again. So what’s more important is: How do you know the answer is correct? Starting with something simple, we have a “Design A” in 20 nanometers. It went through, and we know when it started and when it finished, how many man hours it took. And we do the same design in 16 nanometers with limited changes.

So the design itself didn’t change, this was just porting to a new process node. From that you can get the process scaling factor of 1.5. So the more early blocks that complete, the better the rest of it is. You can start doing averaging across blocks.

Now you start with “Design B”, which had lots of design changes, but it’s in the same 20 nm to 16 nm transition.  It took longer, but you know the process scaling factor was 1.5; you already have that from the earlier blocks.

Which means you can go and figure out what the complexity scaling was for that design. You can do that from RTL, you can do it from schematic, or you can do it from real layout. And you can confirm your numbers as you go through. So you can use other design data as a proxy and then refine as should get the block done. Once those early blocks are done, you can collect both the complexity factor as well as refine how you estimate these things. Maybe your estimate wasn’t very good.  You go figure out what you did wrong and go and correct next time through.

And as I said before, you’ll have to collect these things into styles because the P&R block and the full custom block are not going to have the same kind of scale of factors and metrics. So you want to group your designs into particular styles and have different scaling factors each style. If you have a different design style you might have a different complexity scaling factor and a different process scaling and factor to go deal with.

The real thing is for the blocks that start late in the schedule – those are the ones that will start to gate your tape out schedule. How you apply factors for those and first order the process scaling factor, you can just leverage from early designs, so that one’s easy.

If they have limited design changes, then you know what their resource requirements are going to be.

For other designs that have lots of changes, you have to go use schematic or layout or some other metrics to it what the complexity change was from one design to the next.

But you have the process scaling factor, so you can use those two to come out with both the process scaling factor and your estimate on complexity scaling.

So to a first order you can use early data to go predict the later data. That’s really the key to this. Day one on the project, you’re not going to be able to do a very good prediction.  But as you work through the design, you’ll get more and more accurate estimates of various scaling factors from various early blocks that you use to apply to later blocks.

So you get about halfway through the project, and you prediction actually starts making sense. What you’re looking for when you run this every week – or every couple of days, depending on how enthusiastic you are… first, you’ll see the predictions are wobbling all over the place. And they kind of settle down to being kind of a stable point. When you get that stability, and when you get the repeated stability for a period of time, you know your prediction is probably going to be more reasonable and you can start then estimating what the tapeout date is actually going to be.

So what does it all mean? So for our custom layout block that we just talked about, you want to know how many man hours it took to complete each block in the new process node. You know roughly when they should start and end based on the previous projects, the timing of a block starting and ending is very similar. And if you have infinite resources, or you are not constrained on the resource side, you can help predict what the projects timeline going to be and when you would tape out. Resource is never infinite but you can make that assumption if you want.

But the third answer which is, where did all the resources go? You can now answer that question. As you track the man-hours over the project and you track the tool license usage, you can go to tell finance exactly which design team started, when it ended, what man hours did they use, what licenses were used, and how much did that cost the company.

Now it could be that they need those tools. It could be that they were doing something very inefficiently, and they could change their methodology to use less resource going forward. But you know where the money went. It didn’t just pour it down the drain or into a big black hole. So finance is happy.

Next question is how do you factor the constrained resources? And this really isn’t rocket science. If you take your previous estimates, when things started and when they ended, and how much resource it will take in a new process node, you can stack them all up, and figure out when tape outs going to be.

Of course, if you overlay your resources on your actual max, you find your over-allocated resources. It’s standard project management stuff you can’t use more people than you’ve actually got in the company. So what you’ll do is then start delaying some stuff because it’s always the stuff that starts second that gets delayed and not the early stuff. So you start pushing out later blocks, which is typically going to be your long pole to tapeout anyway. And that gives you a new tape out date that management doesn’t really like.

And if you look at the resource profile you find it can have this kind of lazy uptick and reaches a peak and has a lazy downtick toward tape out. What that means is you’re paying the max number of people to work in your company for the entire generation of the project but you’re only using them for about two months in that whole period. There’s time when they’re not idle on the front end or the back end of the project. What you really want to do is change that resource profile.

And this is where the analytics start becoming more useful. So we have a profile that looks like [the above]. It’s not very ideal.

What we really want is something that looks like [the above]. Which is use all our resources as much as possible for the longest period of time and have them all finish at the same time. What you really want to do is start to reduce the resources on some earlier blocks and allocate them to some of the long pole blocks. And for the ones that aren’t critical, delay their start, so you spread out the resource usage of the duration of the project.

What it gives you is a much more efficient profile. Now this is obviously program management 101 and in theory it’s very easy to do. What you really want are the analytics from the previous project and the current project as you go through to understand where your resource is going.

Who’s using it? Who used it before? Who’s going to use it again? When will it start? So you can preemptively start making these allocations, instead of figuring it out halfway through. The earlier you do this, and the earlier you can figure out what the right answer is the less dramatic it is for the design team and the more likely you are to get that tape out date. This is not rocket science, and it’s also not a crystal ball.

This tells you when you should tape out if nothing exceptional happens; but exceptional stuff does happen. You’ve also got to pay attention to surprises, new things, features that come in late, things that this tool obviously isn’t aware of, and factor that in as well. You might have one block that is late because a bunch of new features went in. It won’t tell you that until you get started on the block and get part way through.

So it’s not a crystal ball, it’s a tool to help you to go figure out what the right thing to do is, and if everything goes well, you’ll hit tapeout as efficiently as you can, and you deal with surprises that come up. But it takes you to a point where you can focus on those surprises when they arise, not trying to figure out what everyone else is doing.

So what you really want to do is create preemptive allocation ahead of time so you get the most efficient tape out possible.  Yes, there will be some noise in there, you have go figure out as it goes along, but that’s really what you want to do, to not be reactive to changes but be predictive and preempt those changes and corrections as much as possible. And this kind of analytics will let you do that.

So what happens when things go wrong? Early scaling factors for process or complexity can be just wrong. I mean it’s a guess, people make mistakes for honest reasons and some factors are surprises and not really well understood at the start of a project. What I have here is a resource profile of a 20 and 16 nanometer project over time. What you’re looking at is for 20 nanometer. You have a certain ramp over time and then a close off to tape out.  Let’s assume our earlier estimate for the resource increase was 1.2, or 20 percent more. So you go off and hire 20 percent more labor resource in this case to go do your layout.

As you go through the project you realize that your estimate is wrong. It’s more like a 60 percent uptick, not a 20 percent uptake. It was an estimate, you made a bad call, and you want to go correct it. Now you uptick your resource by 60 percent. But if you were just to increase by 60 percent, that doesn’t solve your problem because you already got halfway through the project and you already burned a bunch of time using less resource than you thought you needed.

So you have to overcompensate by hiring additional resource-based on what’s left to do to keep your tape out on time. You also have the option of delaying tape out but that’s less popular these days. Once you know what the prediction is and where you’re at, you can start tracking your estimates over time you get an early indication that something’s wrong.

The earlier you do something about it, the less traumatic it is to the company and the project to go fix it. And you will have that occasion. This is a tool not to tell you what’s going to happen, it’s to tell you when things are going off the rails or not the way you expect it to be, so you can go do something about it earlier.

What does all that mean? It allows you to do dynamic, active tracking of your prediction or problems earlier than most conventional systems.

It’s predictive, it’s not reactive. And it also lets you, do some interesting ‘what ifs’ as to how you go solve this problem – and generally there’s more than one answer. You can always figure out what those various answers are, which one is more appealing to you. It also gives you some early indications of any missed estimates and those corrections if done earlier are usually less dramatic to your project than if they’re done later.

So, you can do this all manually. I mean, you can use these tools or you can just do it by yourself. You can so some Python scripts you can do in excel – there are some wizards in excel. It’s a great tool for doing this. This stuff is not rocket science. The problem is it always comes down to:

The prediction is only as accurate as the data being used.

So if you just start doing this manually or it’s not fully automated, you start making some defining assumptions to make it an easier problem to solve, those simplifying assumptions quickly erode the accuracy of your predictions. What you really want is fine-grained data collection that reduces the number of assumptions you have to make. And you want to be able to rerun it frequently so that you don’t get caught by one-off noise. You want to be able to average this over time and kind of get out some of that one-off noise that will come up.

You will need to accommodate exceptions. You don’t want a fixed model. You need to be able to have this programmable. You’ll want to be able to access it with a language where you can start adjusting the data and making project specific and company specific assumptions in there. It’s not going to be working just from a generic prospective of an EDA vendor. This is your project and it’s very different from everyone else’s project. You’ll be able to tune that for your needs.

And then the last thing is you need to validate this. If you’re expecting to go buy this and have it predict your next project, you’re living in a fantasy land. What you really need to do is to run this in parallel with an existing project, then use the prediction to find out how accurate the model is and tune it. Only when you validate it in an existing project can you start looking at the next project and have confidence it will be there.

We were running this at 28 nanometer and 20 nanometer and 16. We only got reasonable predictions at the 20 node and at the 16 node because we already had a lot of validation in place.

Inaccurate prediction is worse than no prediction.

If you come up and say, “Hey tapeout is going to be delayed by 3 months” – and it isn’t, that wouldn’t be very popular. Even worse if you say tapeout it’s going to be on this date and it turns out to be 2 months later, even less popular.

It’s got to be right. Too much one way or another is not a good thing.

So what we want is a database that can take in a bunch of data and allow us to do Python-based queries. We like Python, it’s a very flexible language that lets us do a lot without thinking about the coding.

You can build this system yourself – you can go install a machine with Python & Mongo and do it all yourself, of course you can. But why would you do that? Doing Big Data well is not easy. You can do it badly. What you really want is to be able to query that data very quickly and efficiently.

And I don’t want to become Google, I’m not in that business. It doesn’t help my customers to be able to understand how Big Data works. I want to be able to understand what the Big Data means not how to do it. So why would you buy it from a database vendor? They know how to do this. They’ve been doing it for a couple of years and know what matters. They know how to make it efficient. They know how to solve the problem. They might not know how to solve your problem, but they know how to solve the database problem; how to get a lot of the data in there and have it query-able efficiently.

So at least from a Xilinx perspective, we made the decision to work with IC Manage to build the system we want rather than build the system ourselves. And with hindsight I’m very happy with that choice. It means we get to focus on our problem, not on how to make the system work. If you want to do it yourself, it’s not rocket science. You can do that.

I just want to point out this is a real tool that you can use. It’s just not slideware. From a collection perspective, there is a lot of data in there. You can extract your metrics, but those metrics are just meaningless unless you do something useful with them.

You know your projects, your project is different from my project. I know how to predict my project, I have no idea how to do yours.

This system gives you a way to produce these graphs with a Python-based language underneath, and lets you choose the data and filter it and scale it and factor it, the way you want to see it. So it’s not a data-to-graph solution, it’s a data-to-programmable API-to-graph solution, where you tune what graphs are presented and how the data is presented, so that your PM, your program manager, and design leads can go look at the data and have it make sense to them, as opposed to having to try to interpret after the fact.

So the conclusions? What we were looking for was a system that could collect a very large amount of raw data from our data management system, which has a lot of check-in, check-out, and fine-grained data management data. We also wanted to look at external sources for various analysis tools and flows because those also matters when licenses get checked-in and checked-out.  How many of the DRC have we got, if anybody’s even running the DRC.

Usually the biggest thing to catch a problem is the fact there is no data. They should be running the DRC by now and they’re not running it. Why aren’t they running it? So there are also a lot of external sources we want to put in the same system. And then we want to find customer prediction models. The models are custom to Xilinx. They worked for us. They wouldn’t work for anyone else, probably not even Altera. Every company has their own models and the way they work. They have their own long poles. It’s got to be tuned to what you guys do, not just a generic canned solution from an EDA vendor.

We want to know when we’re going to tape out. Obviously that’s a big deal. And it’s not just when we should tape out, you will obviously have problems. If your tape out date is predicted to be too early and nobody believes it, so they don’t work hard towards it and tape outs tend to slip.

Knowing when you’re going to tape out lets you line up the back end resources for silicon verification and packaging and QA and all the rest of the processes. You depend on that data a great deal. You want it to be as accurate as you can be. And for us, to engage with a database vendor to make that system work was good business sense for us. I don’t want to become Google, I just want to use the system to solve my problems. I don’t want to be in the Big Data business.

So the real question is how accurate is it?

At least for Xilinx, I can tell you that we tracked the man hours using the system and because they were direct costed in finance, there was a separate effort on the finance side to go track the man hours independently of IC Manage, using spreadsheets and email and their existing system.

IC Manage [Envision], the system we have, was within two percent of actual over the cost of an entire project, in terms of man hours.

We were 2 percent under – and part of that is because some design teams still using private libraries for doing layout resource. They accounted for that 2-3 percent. Obviously if it’s not in IC Manage, we can’t track it as active work, because we can see licenses getting checked out, checked in, but not corresponding to any IC Manage check-in so we know something’s going on. That’s why we were 2 percent under.

In terms of when we’re taping out, our estimates are typically published around 6 months before tape out on a typical 18 month to 2 year project. That’s about two-thirds to three quarters of the way through. And at that point, six months before tape out, our predictions are within about plus/minus one week. Those are typically not the top down project management schedules that were in place at that time. But as we go to past that six months, we’ve been within a two week window of when we actually tape out. Once again, your mileage will vary. It all comes down to:

  • How you do it
  • What assumptions you make
  • What corners you cut
  • How good you are at tracking your estimates and refining them as you go forward

You can do a bad job and come up with a really bad estimate. One thing we did find was, even the simplest assumptions make a big difference to the prediction and the accuracy of it. The devil is in the details – you really need the details.

                              — AUDIENCE QUESTION AND ANSWER SESSION —

Audience member:
“(Question on Predictive Modeling” (not spoken into microphone)

Simon Burke, Xilinx: The first part of what you do is have a predictive model of what’s normal – what you should expect to see for all your design blocks. Once you get partway through the project and you get some confidence in that model, then you can start looking for places where your design team is starting to depart from the prediction.

Now it could be that they’re right and the predictions wrong or vice-versa, you don’t know. But the important point is you’re looking for those departures so you can go debug and figure them out.

Once you figure out that the team is departing and it’s a real departure, then you can start looking at what changed. Was it more complexity? Is it harder to do? Is there something fundamental about the design that is different? It lets you start tuning the prediction model to accommodate that change and, more importantly, what you really want to do when you see those departures is try to figure out how to apply more resources to that block to get it back on schedule.

To the first order: Why are they late? It doesn’t really matter. The fact that it is late, is the important point. And what you typically do, what we have started doing, is to cannibalize resources from projects that are on schedule, or blocks that are on schedule, to apply to the ones that are behind –  so that you all finish at the same time.

It might seem counter-intuitive, but you know you’re already in a situation where you’re behind schedule. You want to minimize the impact of that. You use it in two ways. One is predictive in terms of what’s normal, and then you start looking for aberrations from normal, and figure out what to do about them.

Audience Member:
I have a question on the scaling factors you used. Is there a way to learn those scaling factors? For example, based on design size, complexity of constraints, vis-à-vis technology, and things like how do you get these 1.2x versus 1.4x versus these numbers?

Simon Burke, Xilinx:  What we’ve experienced is that we can put our designs into certain buckets. There are mixed signal designs, there are P&R designs, and there are more full custom designs. The process scaling factors for those tend to be somewhat different, but very consistent within that sub-type. So figuring out what you have in your chip and how to bucket them is the first step.

The complexity scaling is really, that every design can scale differently for that. What you’re looking for are pre-triggers or pre-predictors. If you’re worried about layout resources, you look at the schematic or RTL for the same design to figure out how much it changed from one project to the next.

You use those early estimates for what the complexity scaling would be, and then you refine as you go forward.

So yes, you start off with an estimate which is why you don’t want to predict tape out in the beginning of the project. It’s a nonsense number. What you want to do is, as you go through the project, you’ll see that the variation of those predictions start to settle down. If you do them every week, they’ll be swinging all over the place and then kind of settle down to a common point.

That’s when you kind of know you have the buckets right and you’ve got the predictions right.

And once that variation starts coming down, and your error bar starts to close down, then you can start using it as a predictive model, and start looking for departures from the predictive model and what that means for tape out. It’s work – you don’t just push a button and it happens.

Audience Member: How much data do you need to do what you talked about? And are the models flexible or do you have to tune it after you know you go through a couple of cycles or something to adjust those models for new predictability or new things you’re doing?

Simon Burke, Xilinx: We track every single file check out and single check in, by whom, when and every single license check out and license check in so it’s a fairly significant amount of data. We do maybe 5000 check-ins a week. And each of those check-ins can be either one cell or 100s of cells. So the volume of data is quite significant from that perspective. In terms of tuning and iterating, what we did was:

  • Went through one project where we just kind of collected and didn’t even try to predict, because it was pointless.
  • Once you have that side data, the second project is where you start predicting. But you keep it private so you can correlate whether your methodology works or not and you tune it. The second project tells you that you can predict the right answer based on what actually happens.
  • Your third project is where it actually starts to be useful, and you can use that data going forward.

So yes there is a learning tuning cycle. The key for us was we wanted to be able to intercept the raw data to present the data with a programmable language that we could tune and adjust and with a vision system – that’s Python. We can collect all the raw data, collapse it, create metrics, then scale and weigh them however we want to come up with that final answer.

Audience Member: How long did it take you to finally get to your final finely tuning, tuned model?

Simon Burke, Xilinx:  The first project was at 28 nm and we collected data on that, pretty much. The second set of projects were in the 20 nanometer node, and also we taped out multiple projects.

  • The first one our prediction was within probably plus/minus two to three weeks.
  • In terms of tape out and the man-hours resource was has always been within about that 2 percent window of tracking.
  • For the third project, our internal team stopped tracking man-hours because the answer we got from our system was on the money. So they just stopped doing checking because it was a lot of effort, and no value added.
  • The tape out prediction closed down to about a two week window so plus/minus one week.

Audience member: Have you applied this to other than looking at tape out time to market? Have you applied it to bug tracking, issue tracking, verification, any other part?

Simon Burke, Xilinx:  Our initial focus was really two important metrics:

  • When are we taping out, can we have a more predictive model of that instead of reactive?
  • Man-hours.

We obviously hire a lot of contractors, in additional to full-time, and we pay them. That has a big budget impact. So we want to know how many we’re hiring, for how long and where and who uses them. So those are the initial two metrics we were interested in getting very good accurate predictions for. We will deploy this into other areas of our design process.

We started this project off to try to predict tape out and where resource was going from a project level. We have to do that at a design level so that the project roll-up kind of makes sense. What we found was a lot of our design teams want to know how much resource they’re using, whether they’re on track, and where things are going. So we were getting a lot of requests from them:

“I don’t care about the rest of the chip just tell me about my stuff”.

And we went down the path doing a presentation layout. Silicon companies tend not to do those things well. It’s not something we’re either good at or have a lot of interest in to be honest. So you end up giving them the raw data and spreadsheets and they figure it out.

So we did find these design teams were very interested in being proactive about their own resources and their own tracking and looking for aberrations early for the same reasons we were doing this.

The sooner they know about problems the sooner they can delete it. But they want to be able to do specific drill downs: they just want to know about this IP or this set of IPs or this variant because they own the subset of data not the complete set. The charts I presented allow you do that kind of drill down. I really care about the project levels and how it rolls up. But they want to know:

  • Just for these two IP’s how do I compare my last project to this project?
  • These two IP’s are supposed to be the same, how do they compare?
  • I own this one, what is this one doing?

So [Envision] lets you do that kind of low level drill down into specific data types and specific IP’s. And then look at the subset data just for that stuff. A lot of design teams are using it from that perspective. They want to know their part of the world, and not the whole context.

It’s kind of interesting that what came out of the big picture actually turned out to be something for the design team from a very low level specific perspective. I skipped through them because of interest of time. (And the data is generic so when you read it, if you are Altera don’t try to assume what we’re doing it’s just made up data.)

This gives you a feel for the type of reports you can do very efficiently and quickly. And that’s really what’s important. A lot of them are customizable.  Again you can tune the data via Python so your end user doesn’t even have to be a programmer. They just see the reports they want.

###

Simon Burke is a senior architect for Xilinx.

2017-02-03T20:09:15+00:00