The Manifesto. CPU Rightmark

The CPU RightMark (CPU RM) benchmark is meant for objective performance measurement of modern and future CPUs in various computational tasks such as physical process modelling and 3D graphics issue solving. The emphasis is placed on testing under the load of a CPU-RAM bundle and FPU/SIMD units.

What is pure performance?

When applications are tested to measure processor performance, test results are often affected by other subsystems such as video or storage. Sometimes operating system interaction with peripherals effects as well. That is why even if we compare identical systems on different processors it would be difficult to get just a comparative assessment of pure processor performance instead of general software/hardware performance.

The CPU RM benchmark eliminates influence of all subsystems, except for memory and CPU-RAM bus, at the expense of measuring performance of different test application parts, which operation time is not related to "external" task execution time. For example, video pages changeover and hard drive access. Only processor operating time is considered.

What is the purpose of pure performance?

Why do we need pure performance results, if they are different from those obtained in real applications anyway? For example, what is the purpose of pure performance in graphics applications, if the general performance depends mostly on a graphics card? And it's necessary to choose a processor to match the graphics card. As an example let's estimate 3D game performance in 1024x768x32bpp at the highest detail level. Let's assume that systems on 2GHz and 1GHz processors have identical results. We would choose a 1GHz processor as it is cheaper. When a new computer game arrives, the central processor unit becomes a bottleneck in any resolution, because this game, for example, uses tricky algorithms of objects visibility, dynamic detail levels, and utilizes very complicated and realistic physics model.

Overall system performance would be a function of processor and other subsystems performance. That is why it's necessary to measure pure performance of processor not affected by any other system components or software to understand their pros and contras.

Another example: real application performance might depend on hard drive performance, while processors in such a system would show approximately equal results. But given RAM capacity grows with time and a hard drive is obsolete, this minor difference becomes major.

Furthermore, the overall comparative system performance may be lower than pure comparative processor performance. But it can't be higher.

For example, identical systems with different processors will show approximately equal results in a graphics application. But when new drivers are released providing certain bugfixes or greater AGP throughput, graphics subsystem performance will increase in this application, and the difference between processors will become much more considerable.

But how can we measure pure processor performance in graphics applications if they are crammed with graphics driver calls? A synthetic test showing the amount of multiplications, additions and other arithmetical operations per second that processor can perform won't do. It doesn't consider CPU real task branching, jumps prediction and pipeline optimizations. Besides, it's necessary to determine application's instructions ratio as well.

But in fact, any 3D graphics application, be it 3D modeller or Quake engine, executes a geometry task. Most likely, some elements of it, e.g. triangle texturing, will be offloaded to a 3D accelerator. However, graphics optimization algorithms (to avoid overloading accelerator with invisible objects, etc.) are quite complicated and require considerable CPU performance.

It is interesting that a lot of AI, route search algorithms or various optimization algorithms, etc. are largely geometrical.

The solution of most scientific tasks (geometry, statistics, modelling) is in the end separated into simple operations. In particular, scalar vector multiplication, norm of vector calculation, matrices multiplication/addition alternated with algorithm branching. The operations above are constantly repeated in any 3D graphics application. Moreover, they product most computational load.

CPU RightMark measures time of geometrical calculation execution not considering graphics drivers calls. Since the VirtualRay graphics engine used for visualizations is entirely software and doesn't use any 3D accelerator hardware features, and bases ray tracing, it doesn't execute specific triangles texturing operation usually offloaded to 3D accelerator. CPU RM carries out only culling/sorting calculations typical for geometrical applications.

The correct performance measurements of the latest processors requires that a test supports all the innovations like additional processor instructions and architecture peculiarities. Otherwise, it won't be able to adequately measure performance in new applications. CPU RightMark meets all these requirements. And thanks to its open-source nature the benchmark can be modified anytime to support new processor features.

The CPU RM has two component: physical model calculation and scene rendering. Each component has different variants optimized for different processor instructions. Physical model calculation has two versions, one using SSE2, and the other FPU, as the calculation involves values of double type.

The rendering unit has two parts as well: preliminary scene calculation and ray tracing corresponding to resulting image pixels. The ray tracing unit coded in Assembler utilizes SSE optimizations, but hardly any FPU instructions. Preliminary scene calculation part coded in C++ features complicated algorithms. Such separation meets trends of realtime rendering tasks implementation.

Separate performance measurements of different application parts allow estimating CPU performance in various applications. It also enables to test instruction set (e.g. SIMD) implementation quality.

The test is reasonaly optimized for either instruction set. It is not absolute, but obtained in the real development process that lasted for a reasonable period of time. A part of the code, which uses FPU instructions, is compiled with MS VC++ 6.0 that is the most popular compiler for Win32 games. The SSE2 instructions part is complied with Intel C++ 5.0 with full optimization, because Intel is bound to offer the most effective code generation tool for its processors.

Thanks to high-precision measurements it takes less than a minute to obtain stable repeatable results. CPU RM benchmark provides results proportional to CPU clock rate (with a small correction for memory). This is what we are focused on. As the test application mostly performs effective data caching, memory performance doesn't affect the results critically. This enables to test processor performance without considering memory efficiency. And this is good, as memory types change.

Test source code are freely available for everybody. Moreover, we welcome your ideas regarding the improvement and development of this test, and we are always eager to discuss and help you realize them.

Make RIGHT things! Join us!