In this post I’ll talk more about the upcoming software renderer including progress, performance and future plans. Note that this work is happening in parallel to the version 0.20 support, the software renderer won’t be available until after the next build.
As I’ve been working on the next release, I occasionally fiddle with the software renderer. Currently I’m writing it in a separate test app, to avoid integration issues until after version 0.20 is released. Until recently, screenshots would be really boring since it was all boilerplate code but I’ll go ahead and show the first screenshot now:
Statistics for images shown:
Resolution: 960×600.
Current features:
- 32 bit renderer (8 bit will also be supported with all the palette effects)
- Perspective correct texturing
- 16 bit Z-Buffer (will probably be upgraded 32 bit)
- Sub-pixel/sub-texel accuracy (no wobbling edges or textures)
- Per polygon color, used for lighting (supports RGB, only intensity shown here)
- Z-Fogging using a fog table.
- Clipping.
And of course it supports 2D blits (including arbitrary scaling):
Currently performance isn’t where I want it to be for multiple reasons:
- It’s using GDI for the final blit which can be slow. Obviously it should use DirectDraw and/or Direct 3D when available (just for the final blit though).
- The bottleneck is the actual rasterization, in the future this will be optimized using SSE/SSE2 when available. However this won’t occur until it has all the base features.
- SSE/SSE2 optimizations for other parts of the pipeline as needed (transform, clipping, etc.).
Some of the additional features required before integration (post-0.20):
- Dynamic lights (using the proper radial attenuation).
- Mipmapping (probably with mip selection done per scanline).
- Various pipeline improvements, backface culling and so on.
Finally I plan on adding hierarchical z-buffer based occlusion culling – at the batch, polygon and maybe even span level in order to reduce overdraw to near zero. This, combined with an optimized rasterizer, should allow the software renderer to run at very high resolutions on modern CPUs. This is more practical with software rendering (or at least simpler to get right) due to the lack of latency and also the fact that it handles small batches very well, which means that rendering can be sorted from front to back at a finer granularity without hurting batch performance or state switching performance.
Ultimately the goal is 60fps at 1024×768 on a 1 GHz CPU. Once the final blit is fixed (i.e. not taking 11ms at 1680×1050 because of GDI), I’m close to that goal now but still have to make it faster using the above methods in order to also have gameplay too (i.e. collision detection, AI, combat, etc.). ![]()
SSE will only be used when available, potentially widening the CPU support if you run at low resolutions (i.e. 320×200 or 640×480).
Hopefully this will allow anyone with a semi-modern computer (i.e. a computer purchased within the last 11 years) to play DaggerXL at good framerates, at least with the basic feature set. In addition the hardware renderer will go through another round of optimizations, as well as general program speed improvements and loading improvements upon start up.


April 4, 2011 at 12:49 pm
GDI should not be a cause of low productivity, if you use SetDIBits (). I got over 1000 fps at blitting 800×600 images.
SSE and SSE2 optimizations usually implemented in compilation stage, аdditional winnings will be lower than expected.
April 4, 2011 at 1:30 pm
I’ve seen significant savings from manual use of SSE in the past. Compilers do a decent job but there are certain optimizations they just will not do. Of course I’m referring to the inner-most loops where performance is critical. In addition, with the use of intrinsics, the compiler is still capable of performing whole program optimizations and proper register allocation, giving you the best of both worlds.
Of course this doesn’t mean that it’s easy. Manual SSE usage, via inline assembly or intrinsics, doesn’t magically make the code faster then what the compiler can produce.
As for GDI – yes it’s fine at lower resolutions and even at higher resolutions it’s ok… but it still takes too much time. In addition the performance is variable based on OS and drivers. For example my Vista system GDI performs quite well versus XP, which has much worse performance. However Direct Draw / DirectX performance blitting is pretty even across those systems.
April 4, 2011 at 3:25 pm
Actually VC++ won’t ever generate a single SSE instruction from your code no matter what options you select. Those options just toggle use of SSE instructions in the runtime. The only way to ever get this to happen is to do it yourself. With other compilers it is generally true as well, or the case where it does is so narrow you will never reach it.
April 4, 2011 at 2:24 pm
I felt somewhat bad…
At not knowing some of the lingo you were using in earlier updates, but this one takes the cake. I’ve got no idea what you’re talking about! In any case, it’s still exciting to see that you’ve got time again for working on the project.
April 4, 2011 at 3:24 pm
Sorry about that.
What would you guys like to see from these blog posts?
April 7, 2011 at 8:10 pm
I actually enjoy the technical discussion. These posts are a good way to get some exposure on what it takes to build a custom game engine.
April 4, 2011 at 4:22 pm
I like that you do more technical posts once in a while. It’s hard to find decent blogs from developers who aren’t either raging assholes or boring business devs.
April 4, 2011 at 5:43 pm
I say that as both a boring business dev and an asshole.
April 4, 2011 at 7:17 pm
If the intent is to eventually port this, wouldn’t it be better to use OpenGL or something instead of Direct 3D?
April 5, 2011 at 12:22 am
Simply put, I prefer Direct3D. In addition some GPUs have under-performing OpenGL drivers on Windows, so this will allow it to run better on more hardware. That said, there will also be an OpenGL renderer and of course the software renderer discussed above. Honestly, the easiest route would have been to just stick with OpenGL only – since the bad OpenGL drivers aren’t that bad anymore (performance wouldn’t be as good but probably acceptable) but since this is a hobby project that isn’t the route I decided to take.
April 4, 2011 at 9:36 pm
I’m not much of a programmer, but even I know what a massive undertaking writing a software renderer entails. You didn’t even need to do this. You could have just said. “Modern GPUs only. Tough luck, everyone else.” But you didn’t. This is truly a testament to you as a programmer, and in the end those with less than stellar PCs will thank you a thousand times over.
April 4, 2011 at 11:05 pm
I really like the technical posts. Although I don’t “get” all the super technical stuff, I do get the gist of it and it’s like a peek down a road I didn’t take (but could have). More importantly, it shows the sheer thought and complexity a project such as this entails. It’s a nice reminder for those that may be impatient
April 5, 2011 at 12:55 pm
It’s really nice to see you think about low-ends computer! And of course even better to see that you’re working steadily on the project, which a TERRIFIC project
Waiting for 0.2 like a maniac >.<
April 7, 2011 at 11:24 am
Hey,finally a dev thinks about slow HTPCs.
Some people still have VIA C3/C7 CPUs.
HL2 deprecated the sotware renderer and the OPENGL too,what a bad idea!
April 7, 2011 at 3:02 pm
hello lucius! please, could you make just a little (however inexact) estimate of time to finishing version 1.0?
althought, i very appreciate your work at remake of most atmospheric game ever, im older man again and worried a bit about not to live to see the complet remake of the jewel.
:¨c)
April 8, 2011 at 6:16 am
That’s a gloomy thing to read my friend… Hang on!
April 7, 2011 at 6:20 pm
Excellent news.
April 10, 2011 at 9:41 pm
Not sure how much good SSE will do to you.
I would guess the most cycles are spent doing scanlines. If you interpolate floats, FIST:ing will eat the cycles.
If you use fixed point representation, you must write your own division routine in asm.
At least VS2010 vomits horrible AllDiv standard implementation for 64 bit variable divided by 32 one on x86.
Not to mention the fact that it refuses to inline it even on (/Ox).
You could get around it by going X64, but the “low end hardware” idealism would go straight out of the window.
April 12, 2011 at 11:25 pm
Truly spectacular work! But please don’t forsake the highly anticipated Darkxl beta!