The quest for known behavior
This post is one I've been meaning to write for a while to explain my personal philosophy about designing, testing, and tooling APIs to provide the best experience for the implementers and users of that API.
In my position on Intel's Linux 3D driver team, I see the way this all plays out from multiple angles. As a member of the Khronos Vulkan working group, I am one of the many spec authors and get my hands dirty with the minutiae of exactly how all the various bits of the API are specified to work. As a driver author, I see how we implement the APIs and all of the various corner cases where things can go wrong. As someone who debugs game issues and communicates with game developers, I see pain of debugging issues in applications and drivers that anything from rendering errors to full system crashes. One of those is obviously worse than the other but neither leads to happy users.
My objective as a spec author and driver developer is to make the Vulkan specification the best it can be and provide the best experience possible for both game developers and the users who enjoy playing their games. So how do we go about accomplishing this?
There is a second issue here which comes from the other side of the API. The specification is a contract and we also have to ensure that the implementation (driver) lives up to it's side of the bargain. For that, we have what we have a conformance test suite (CTS) that vendors are required to run and pass before they can claim that what they have is a Vulkan driver. These tests attempt to test a broad cross-section of the API to give some sense of security that the driver is, indeed, implementing it correctly. In theory, if you pass the conformance test suite then any application which uses the Vulkan API correctly will render correctly on your implementation.
Those are both nice theories but we know that theory and practice are often two different things. That only works if both the validation layers and the conformance test suites are perfect. The reality, however, is that not every corner of API validity is covered by validation. On the implementation side, when you consider both software and hardware, the complexity of the implementation is such that perfect test coverage is impossible to achieve.
I think, by now, I've probably successfully convinced you that making this whole mess work reliably is a hard problem. In fact, it's impossible. Before we get too depressed about the future or lost in the details of validation and conformance testing, let's take a step back and look at the big picture again.
In other words, they want what I call "known behavior". They want to know that their application will run as intended. Ensuring this in general is still an impossible task but keeping the correct perspective helps us prioritize so that we can come as close as possible to the real goal of happy customers. Our goal as spec authors, driver developers, test writers, and validation layer developers should be to ensure that, if an application passes validation, then the developer has a pretty good idea that it will actually work when deployed in the wild.
First, let's look at the group of CTS tests called dEQP-VK.pipeline.stencil. There were around 16,000 tests in this test group that test every possible combination of depth/stencil image format and stencil pass/fail and depth fail op in the API. On the face of things, this sounds like fantastic coverage because it covers all combinations of some things. However, when Vulkan 1.0 was released, this was about 10% of the tests in the CTS, the whole lot caught exactly one bug in our driver, and it took three lines of code to fix it. Meanwhile, there was not a single test in the CTS which tested using depth or stencil on a mip level or array slice other than zero nor were there any tests for different clear operations nor were there any multisampled depth/stencil tests. So, while we had tens of thousands of stencil tests, they they exhaustively covered one tiny corner of the API and left vast swaths completely untested.
A second example is SPIR-V testing. The SPIR-V spec is on the same order of complexity (as far as combinatorial explosions go) as the Vulkan spec itself. It's a very general spec which specifies a binary language for the exchange of shaders between the Vulkan application and driver. Because no one likes to write SPIR-V directly, the choice was made early on to write most CTS shaders in GLSL and use GLSLang to compile them to SPIR-V. We also wrote a few hundred tests directly in SPIR-V to test various control-flow conditions that weren't likely to be generated by GLSLang. The result was that the CTS does a pretty good job of testing that implementations can consume the subset of SPIR-V that's produced by GLSLang. However, as people have started developing SPIR-V compilers for other languages such as HLSL and OpenCL C, we've discovered that driver quality is not so good the moment you step off of the path of what's generated by GLSLang.
The point of those two examples is not to poke fun at any particular person or to make you think that the state of Vulkan testing is bad. Quite the contrary, I feel like the state of Vulkan testing is actually pretty good these days (it was very bad at first) and it only keeps improving. The point is to show how easy it can be to leave giant gaping test coverage holes if you aren't careful.
Testing and validation aren't and never will be perfect. However, with a little care, we can get pretty close to a state of known behavior. As I said above, the state of Vulkan testing and validation today is miles ahead of where it was two and a half years ago. When our driver first shipped, it passed the entire CTS and couldn't render either of the two available games correctly. Today, users are constantly running random Vulkan applications that we (the driver team) have never seen before with good results. That's known behavior!
In my position on Intel's Linux 3D driver team, I see the way this all plays out from multiple angles. As a member of the Khronos Vulkan working group, I am one of the many spec authors and get my hands dirty with the minutiae of exactly how all the various bits of the API are specified to work. As a driver author, I see how we implement the APIs and all of the various corner cases where things can go wrong. As someone who debugs game issues and communicates with game developers, I see pain of debugging issues in applications and drivers that anything from rendering errors to full system crashes. One of those is obviously worse than the other but neither leads to happy users.
My objective as a spec author and driver developer is to make the Vulkan specification the best it can be and provide the best experience possible for both game developers and the users who enjoy playing their games. So how do we go about accomplishing this?
What is undefined behavior?
Fundamentally, an API specification like the Vulkan specification is a contract between the client and the implementation that if the client does X, Y, and Z, then the implementation will do A, B, and C. The difficult part is what happens when that contract is broken. In Vulkan, any misuse of the API on the part of the client results in what we call, "undefined behavior." Here's a short quote from the Vulkan 1.1 specification:The core layer assumes applications are using the API correctly. Except as documented elsewhere in the Specification, the behavior of the core layer to an application using the API incorrectly is undefined, and may include program termination.The consequences of misusing Vulkan are pretty bad. "May include program termination" means that using the API wrong may cause your program to crash or the kernel to decide to kill it. This sits in stark contrast to OpenGL where the worst that happens for most common programming errors is that whatever function you just called harmlessly sets an error code and does nothing. Almost worse than the program crashing is that the undefined behavior may be that it works perfectly and the developer remains blissfully unaware of the problem until someone runs the application on a different Vulkan implementation and it immediately crashes.
How can we avoid undefined behavior?
How can anyone write software against an API that provides no feedback about errors and where the consequences for violating any one of the specification's more than four thousand "valid usage" statements are so dire? For that, we have a set of what we call "validation layers" which do piles of error checking to inform the developer when they are in violation of their side of the API contract. In theory, if the validation layers give the application the green light then it's fulfilling its side of the contract and will get correct rendering.There is a second issue here which comes from the other side of the API. The specification is a contract and we also have to ensure that the implementation (driver) lives up to it's side of the bargain. For that, we have what we have a conformance test suite (CTS) that vendors are required to run and pass before they can claim that what they have is a Vulkan driver. These tests attempt to test a broad cross-section of the API to give some sense of security that the driver is, indeed, implementing it correctly. In theory, if you pass the conformance test suite then any application which uses the Vulkan API correctly will render correctly on your implementation.
Those are both nice theories but we know that theory and practice are often two different things. That only works if both the validation layers and the conformance test suites are perfect. The reality, however, is that not every corner of API validity is covered by validation. On the implementation side, when you consider both software and hardware, the complexity of the implementation is such that perfect test coverage is impossible to achieve.
I think, by now, I've probably successfully convinced you that making this whole mess work reliably is a hard problem. In fact, it's impossible. Before we get too depressed about the future or lost in the details of validation and conformance testing, let's take a step back and look at the big picture again.
The quest for known behavior:
At the end of the day, what do our customers want? In particular, what do the software developers write applications that use the Vulkan API want? It's really very simple: they want to know that their application will run correctly and perform well on their user's computer. They don't care that every possible theoretical correct Vulkan program runs correctly on implementation A. They also don't really care that Vulkan application B works correctly on every theoretically possible correct Vulkan implementation. They care that their application will run correctly and perform well on their user's computer.In other words, they want what I call "known behavior". They want to know that their application will run as intended. Ensuring this in general is still an impossible task but keeping the correct perspective helps us prioritize so that we can come as close as possible to the real goal of happy customers. Our goal as spec authors, driver developers, test writers, and validation layer developers should be to ensure that, if an application passes validation, then the developer has a pretty good idea that it will actually work when deployed in the wild.
Where do we go wrong?
The goal I stated in the previous paragraph sounds obvious, but it's amazingly easy to get so caught up in the details that you forget the big picture. Let me give two examples.First, let's look at the group of CTS tests called dEQP-VK.pipeline.stencil. There were around 16,000 tests in this test group that test every possible combination of depth/stencil image format and stencil pass/fail and depth fail op in the API. On the face of things, this sounds like fantastic coverage because it covers all combinations of some things. However, when Vulkan 1.0 was released, this was about 10% of the tests in the CTS, the whole lot caught exactly one bug in our driver, and it took three lines of code to fix it. Meanwhile, there was not a single test in the CTS which tested using depth or stencil on a mip level or array slice other than zero nor were there any tests for different clear operations nor were there any multisampled depth/stencil tests. So, while we had tens of thousands of stencil tests, they they exhaustively covered one tiny corner of the API and left vast swaths completely untested.
A second example is SPIR-V testing. The SPIR-V spec is on the same order of complexity (as far as combinatorial explosions go) as the Vulkan spec itself. It's a very general spec which specifies a binary language for the exchange of shaders between the Vulkan application and driver. Because no one likes to write SPIR-V directly, the choice was made early on to write most CTS shaders in GLSL and use GLSLang to compile them to SPIR-V. We also wrote a few hundred tests directly in SPIR-V to test various control-flow conditions that weren't likely to be generated by GLSLang. The result was that the CTS does a pretty good job of testing that implementations can consume the subset of SPIR-V that's produced by GLSLang. However, as people have started developing SPIR-V compilers for other languages such as HLSL and OpenCL C, we've discovered that driver quality is not so good the moment you step off of the path of what's generated by GLSLang.
The point of those two examples is not to poke fun at any particular person or to make you think that the state of Vulkan testing is bad. Quite the contrary, I feel like the state of Vulkan testing is actually pretty good these days (it was very bad at first) and it only keeps improving. The point is to show how easy it can be to leave giant gaping test coverage holes if you aren't careful.
How can we achieve known behavior?
We can't actually get there; not really. However, we can make strides in that direction and we can actually get pretty close if we keep the real goal in focus. How do we do that? There are some basic guiding principles that I use when writing spec, working on the driver, or developing tests to help keep my priorities in order and keep focused on the ultimate goal of happy users:- Write specifications that are clear and easy to validate. In the Vulkan specification, we make it easy to validate by describing as much of the client side of the contract as we can in terms of simple "valid usage" statements which are straightforward to turn into validation code.
- Keep the API surface small and easy to test. The more different ways you have to do a particular thing, the more different combinations you end up with. For example, it works differently in our implementation when you clear with LOAD_OP_CLEAR vs. LOAD_OP_DONT_CARE followed by vkCmdClearAttachments vs. vkCmdClearColorImage followed by LOAD_OP_LOAD. Throw in multi-sampling, depth, and stencil, and you have a testing nightmare. In the case of clears, all those mechanisms are there for good reasons but they come at the cost of a higher testing burden. When you can make the API simpler, you should as it reduces the testing burden.
- Watch out for edge cases. This applies to all areas:
- When writing the spec, try to design edge cases out. It can sometimes be tempting to start off with something with lots of edge cases and then try to fix them one by one. Often, it's better to step back and rework the spec or implementation and structure it in such a way that it has fewer edge cases by design.
- When implementing the API, try to design your software with the right level of generality so there are fewer internal edge-cases that need testing. It also often helps to have fewer layers and abstractions that interact in strange ways which can lead to more edge cases.
- When implementing the API, watch out for edge cases and ensure they are tested. Only someone actively working on our driver would understand the all of the different image clearing paths we have and know that separately testing rendering to an image and texturing from an identical image doesn't actually cover the case of rendering, transitioning to SHADER_READ_ONLY_OPTIMAL, and then texturing. Whenever I'm implementing a new feature, I actively pay attention to places where I know it could go wrong and ensure that there are tests in the CTS which test those cases.
- When writing tests, look for all the non-obvious combinations. It's impossible to test every combination of everything. However, it's better to test a lot of different types of combinations than to exhaustively test one tiny corner. See also my story about the stencil tests.
- Test everything. This really should go without saying, but there's no excuse for having a feature that simply isn't tested at all. It doesn't matter how small it is or how it's classified, or how many people are implementing or using it, it needs to be tested. Our team makes it a policy that nothing lands in our driver without independent tests that can be run in our CI system. It doesn't matter if an application uses the feature successfully so you know it works; the tests need to run in CI.
- Write tests/validation for bugs. Every time you find a bug in an application, it's something the validator didn't catch. Every time you find a bug in an driver, it's something the CTS didn't catch. Take advantage of the opportunity when bugs arise, to identify the testing or validation hole which allowed that bug to creep through and fix it.
Testing and validation aren't and never will be perfect. However, with a little care, we can get pretty close to a state of known behavior. As I said above, the state of Vulkan testing and validation today is miles ahead of where it was two and a half years ago. When our driver first shipped, it passed the entire CTS and couldn't render either of the two available games correctly. Today, users are constantly running random Vulkan applications that we (the driver team) have never seen before with good results. That's known behavior!
Comments
Post a Comment