Tuesday, November 10, 2009

My new GPU accelerated NES emulator

So I finally set about dumping tile rendering and sprite rendering tasks for my emulator onto the GPU, with some interesting hurdles.

I've been laying the groundwork for doing this for awhile, as I've rearranged how CHR ROM/RAM work with an eye towards offlining them to the GPU.

For tiles, the basic routine is:

The emulator sends CHRROM and RAM to the video card as a texture.

The emulator creates a 'pixel information map' which it sends to the video card, also as a texture.

The pixel information map is essentially a texture in which each texel represents an onscreen pixel, but rather than send the decoded color data, I fill the values with the information I will need to properly decode the pixel. It's a sort of deferred shading for NES. A pixel information "pixel" has 16 bit components (d3d surface format R16G16B16A16_UNorm) and looks like this:

a: ppuByte0/ppuByte1 - 8 bits each. I pack the current values of the bytes representing these two registers.
r: CurrentXScroll, CurrentVScroll - 8 bits each
g: CurrentPaletteID - 8 bits for palette ID, which identifies which shadowed palette this pixel is associated with (more on this later), 2 bits to identify the current nametable (0-3), 6 bits for future expansion
b: CurrentBankSwitchID - I allow this to have the full 16 bits.

To decode chrRom, I use an array of 16 ints which represent the start position of each 1k segment. As bankswitches happen during frame rendering, I push these arrays into a stack, and it is that stack which is sent to the GPU each frame. This way the chrRom only needs to be sent once, and can be 'mapped in' to nametable or pattern table ram virtually, on the graphics device.

There could be hundereds or even thousands of bankswitches during a frame, it's all up to whatever shenanigans the cart's mapper can pull. If the cart was switching every pixel, it could be 61440 entries.

Palette ram is shadowed via a similar scheme, every time it is written to during a frame, all 32 entries are pushed onto a stack and the 'current palette' index is what goes to the video card, as well as the palette stack as a texture. This doesn't happen as often as bankswitches, and when it does, there's not a lot of writing going on, since the programmer has to disable the NES PPU during rendering to be able to fiddle with palette ram. However, it is done in some more advanced games and in a lot of demos.

Nothing happens for tiles in the vertex shader.

The pixel shader's algorithm is like so:
pixelInfo, paletteShadows, bankSwitchShadow textures must all be sampled in point mode! Iterpolating user-specific data like that is not going to work.

1) fetch and decode pixel information texel, something like this. Remember your data is normalized into the 0 to 1 space, so you need to make it an integer again.

2) check ppuByte1 bit 3, if it is false, set pixelindex to 0

3) else, get the current nametable, apply x and y scroll values, calculate tile index

4) fetch the two bytes representing the tiles line for this Y coordinate, extract the bit from each byte representing this tiles lower two bits

5) fetch the associated attribute byte, shift it to the left 2, and or it with the pixel.

6) I now have a 4 bit pixel representing an entry in the nes' internal palette. Look that up in the palette texture, based on it's current palette id

7) the result is now a hsv value in the nes' own special format, which is converted to rgb using a special function i wrote called DecodePixel, which decodes the nes' hsv information, gets its color information, converts it into YIQ format, and then finally to rgb.

This last step does what the NES does, though may seem redundant when a simple palette lookup will get you to the same place. There are good reasons to let it walk throug the color spaces, however. For one, this allows me to tweak color, tint, saturation, brightness and contrast, just like i would on a real old timey tv set. It also allows for any other image processing effects to take place in these spaces. Blooming and lots of popular modern day effects start with a luminace map, and I'm essentially getting one for free.


Note on performance: tiles aren't too bad, but there is a lot of texture sampling going on. 1 for the pixel information, 2 for the pattern entries, 1 for the attribute byte, and 1 for the current palette. The only needed flow control is to test whether tiles are to be drawn at all, and the 5 texture samples aren't a big deal at all.


Sprites are a whole other story, and I'll tell it a whole other time. In short, if i were to take a per pixel approach, that gives me no choice but to evaluate all 64 sprites to find one visible on that pixel. So thats up to 128 texture fetches for the pattern entries alone, let alone another 64 for the attribute bytes. It can be done, but for frig sakes this isn't Crysis and I want to leave some GPU left over for the purposes of pure, raw, zazz. Not only is this a perf problem, there's also no way for it to be technically correct, as there's no way to "only evaluate 8 sprites per scanline". The best I could do is to evaluate 8 per pixel, but the pixel next to it could have a whole other 8 sprites.

I'm still working on the sprites routine, and I'll follow up with that once it works.

Sprite and tile pixel fetching were the two biggest heavy hitters in my emuation, so I'm eager to get this working, so i can slice and dice out all of the legacy code and see how much more badass it is to use 320 vector processors instead of one dumpy old pentium 4.

No comments: