Indie developer of small games featuring small animals, small spaceships and big fun!
Newer...
Older...
12 Jan 2021
Link (seems better for mobile)
Demo showing the combination of the stars and fade effect I used in Pico Space.
[UPDATE 2021-4-12: 8bit and 16bit cached modes (explanation below and in code), some other small tweaks]
[UPDATE 2021-5-6: added interleaved 8bit mode]
The stars are just simple particles that have x,y,z coordinates.
In this demo I use a couple of sin functions to give them some movement combined with a divide by the z coord for a bit of parallax. In game, I feed in the player's position.
Then I clamp the resulting x,y values to the screen with the modulus operator so they're always visible (%128). It does mean that the same stars go past constantly, but otherwise I was processing a lot of particles that don't get seen very often (not aiming for realism here).
This works by mapping the colour of every pixel on the screen to another colour that tends to a target e.g. 1 (dark blue)-> 2 (black).
You can use a similar mapping with the pal(x,1) function to do fades to black between screens etc. but that fades everything including anything else drawn that frame.
In this demo I process the pixels already in screen memory so that the screen is faded by a step, then draw fresh stars on top of that.
It's pretty expensive to do the whole screen like this (IIRC about 90% of performance at 60fps) so I've set it up to do every fourth scan line, starting from a different point each frame. Effectively a quarter of the screen is faded at a time. It takes 4 frames to fade the whole screen one step.
I initially tried fading in quarter strips top to bottom, but the tearing on bigger objects like planets looked pretty bad.
Using the order 0,2,1,3 for the scanlines does some rough dithering to make the effect look a bit more uniform. A random value flr(rnd(4)) works quite well too, but is messier looking.
Since I found using poke4 to work on 8 pixels at a time was fastest (not surprising really) dithering horizontally is limited and isn't in the demo. Nevertheless, I keep meaning to try a "Z" pattern i.e.
0000000011111111 2222222233333333
I'm concerned it might cost too much more in performance/tokens for too little visual improvement.
Of course, as soon as I write about it the old subconscious starts working away and it takes 5 minutes to implement just that - a reverse N pattern as it turns out. Same performance, same tokens. See the new cart.
The effect works fine by extracting each pixel's colour value via shifting and masking then dumping the mapped values back onto the screen, but it's still pretty performance heavy.
When I was writing PICO Space I'd read a few times that procedurally generated content used a lot of memory so I didn't want to try anything like the following, but now I have a much better idea of the game's memory requirements I thought I'd give it a go.
Pixels in PICO-8's screen are determined by a 4-bit value, but peeking and poking only works with 8-bit granularity at best i.e. a pair of pixels or more at once. The mappings I have contain 16 values for each possible colour of a pixel.
Considering pairs of pixels instead of single pixels, there are 16 * 16 = 256 possible combination of colours that need to be mapped. Why not store a table with each of these values - it can't be that large, right?
Turns out it isn't, especially when compared to the 2MB of space lua is given in PICO-8. In fact the demo seems to only use about 2K or so (which is still a lot more than the 256 bytes it should take, but still pretty small).
This means that a lot of masking and shifting isn't as necessary inside the inner loop. It even takes fewer tokens. The performance improvement is enough that half or even all of the screen being processed per frame isn't too bad.
The next step was obviously to try mapping 4 pixels at a time using 16-bit values.
This would need a table of 16^4 = 65536 entries which isn't very big for a modern machine, but is pushing it pretty far for PICO-8. It's possible - take a look at the code. It also takes up a lot more memory: about 1200KB it seems. That's well over half of the total space available and for my purposes in PICO Space is enough to give me sporadic out of memory errors as it stands (PICO Space takes about 600-900KB depending on the size of the current galaxy and how much is going on in it at any particular moment). For other games it may be absolutely fine and it's tempting since there's about a 2x speed-up compared to my original implementation of the effect using this technique.
PICO-8's number format is 16bit.16bit fixed point so every value I've been storing so far is actually 32 bits in size whether I use all of those bits or not. Why not use them all?
Storing mappings for 8 pixels isn't going to work: 16^8 = 4,294,967,296 - a bit too much for PICO-8.
Instead, the last implementation that I've tried (so far) stores two 16-bit values in each number in the cache table so that the same amount of mapping values as in the previous section takes half the entries and hence half the space. The upper 16 bits take the even values; lower 16 bits the odd values.
This brings the memory usage down to about 600KB or so, which is fairly reasonable.
Unfortunately, the two mapping values packed into a single PICO-8 table value need to be unpacked to be used in the inner loop of the effect. By the time shifts and masks are applied to do this I couldn't get the performance to really be any better than the original effect (without any caching of values), never mind faster than the other cached value versions.
Up until this point I'd only considered making the effect faster and not "better". Two horizontally adjacent pixels are represented by each byte in the screen so one of the first compromises I'd made was to assume I couldn't fade these separately per frame and so fading the whole screen over four frames was done with chunks of at least two horizontally adjacent pixels at a time.
Since the 8bit cache version uses so little memory, is faster and deals with all combinations of two pixels both fading on the same iteration it struck me that there wouldn't be much cost to keeping two caches of 8bit values, one with the left side pixel faded, one with the right and swapping which cache is used per frame. When combined with alternating which rows are processed, this allows a dither pattern that works on a block of 2x2 pixels - no more horizontal chunking:
01 23
-- fading stars -- by drakeblue function _init() pal(15,140,1) -- mid blue instead of light peach g_scpal_map={ -- map of every colour to a darker colour {[0]=0,unpack(split'0,1,1,2,1,13,10,2,4,9,3,15,5,4,1')}, -- could equally map to lighter colour to fade to white -- or redder colour etc. {[0]=1,unpack(split'15,8,11,8,13,7,7,9,10,7,10,6,6,7,12')}, {[0]=2,unpack(split'2,8,5,8,4,9,10,8,8,9,10,13,2,8,13')}, -- map white/black or whatever your "target" colour is to another colour -- and it gets a bit trippy {[0]=1,unpack(split'15,8,11,8,13,7,0,9,10,7,10,7,6,7,12')}, {[0]=7,unpack(split'0,1,1,2,1,13,6,2,4,9,3,15,5,4,1')}, } g_dith={[0]=0,2,1,3} -- generate some stars with 3d coords g_stars={} srand(1) for i=1,500 do add(g_stars,{x=rnd(4096),y=rnd(4096),z=rnd(30)+0.1,c=ceil(rnd(15))}) end g_sys_p,g_show_ui=0,1 g_fade_types={scr_fade,scr_fade_z,clear,blank, scr_fade_8bit,scr_fade_8bit_bytez,scr_fade_8bit_half,scr_fade_8bit_all, scr_fade_16bit,scr_fade_16bit_half,scr_fade_16bit_all, scr_fade_16bp,scr_fade_16bp_half,scr_fade_16bp_all, scr_fade_8bit_inter} g_fade_type_names={"scr_fade","scr_fade_z","cls","none", "scr_fade_8b","scr_fade_8b_bz","scr_fade_8b_half","scr_fade_8b_all", "scr_fade_16b","scr_fade_16b_half","scr_fade_16b_all", "scr_fade_16bp","scr_fade_16bp_half","scr_fade_16bp_all", "scr_fade_8bit_inter"} g_fade=15 g_map=1 init_maps() end -- does nothing function blank() end function clear() cls() end --------------------------------------------------------- -- sets up a pre-computed mapping of all possible pairs -- of pixels to mapped pixels so that a byte can be processed -- at a time. Removes the need for masking values retrieved -- from memory -- takes up v little memory approx 2k function update_8bit_map() g_8bit_map={} for i=0,255 do g_8bit_map[i]=g_scpal_map[g_map][i>>>4&0xf]*16+g_scpal_map[g_map][i&0xf] end end function update_8bit_map2() g_8bit_map0={} g_8bit_map1={} for i=0,255 do -- g_8bit_map0[i]=g_scpal_map[g_map][i>>>4&0xf]*16+g_scpal_map[g_map][i&0xf] -- g_8bit_map1[i]=g_scpal_map[g_map][i>>>4&0xf]*16+g_scpal_map[g_map][i&0xf] g_8bit_map0[i]=(i&0xf0)+g_scpal_map[g_map][i&0xf] g_8bit_map1[i]=g_scpal_map[g_map][i\16&0xf]*16+(i&0xf) end end --------------------------------------------------------- -- sets up a pre-computed mapping of all possible quadruplets -- of pixels to mapped pixels so that 2 bytes can be processed -- at a time. Removes the need for masking values retrieved -- from memory -- takes up a lot of memory approx 1200k function update_16bit_map() g_16bit_map={} for i=0x8000,0x7fff do g_16bit_map[i]=g_scpal_map[g_map][i>>>8&0xf]*256+g_scpal_map[g_map][i>>>12&0xf]*4096+ g_scpal_map[g_map][i>>>4&0xf]*16+g_scpal_map[g_map][i&0xf] end end ------------------------------------------------------------------------------------------ -- prt with colour 0 (black) outline function prt_out(s,x,y,c) print(s,x-1,y,0) print(s,x+1,y) print(s,x,y-1) print(s,x,y+1) return print(s,x,y,c) end function init_maps() if g_fade>14 then g_8bit_map=nil g_16bp_map=nil g_16bit_map=nil update_8bit_map2() elseif g_fade>11 then g_8bit_map=nil g_8bit_map0=nil g_8bit_map1=nil g_16bit_map=nil update_16bp_map() elseif g_fade>8 then g_16bp_map=nil g_8bit_map=nil g_8bit_map0=nil g_8bit_map1=nil update_16bit_map() elseif g_fade>4 then g_16bit_map=nil g_16bp_map=nil g_8bit_map0=nil g_8bit_map1=nil update_8bit_map() else g_8bit_map=nil g_8bit_map0=nil g_8bit_map1=nil g_16bit_map=nil g_16bp_map=nil end end --------------------------------------------------------- -- sets up a pre-computed mapping of all possible quadruplets -- of pixels to mapped pixels so that 2 bytes can be processed -- at a time. Removes the need for masking values retrieved -- from memory, but packs values so needs to be unpacked again -- takes up half the memory of prev: approx 600k function update_16bp_map() g_16bp_map={} local val for i=0x8000,0x7fff do local pack=g_scpal_map[g_map][i>>>8&0xf]*256+g_scpal_map[g_map][i>>>12&0xf]*4096+ g_scpal_map[g_map][i>>>4&0xf]*16+g_scpal_map[g_map][i&0xf] if i&1==0 then val=pack>>>16 else g_16bp_map[i\2]=pack+val end end end function _update60() end function _draw() g_sys_p+=1 -- value to feed animation and fade function. -- in game, i use the player's position to transform -- the star's positions for drawing if btnp(🅾️) then g_fade=(g_fade%#g_fade_types)+1 init_maps() end if btnp(❎) then g_map=(g_map%#g_scpal_map)+1 init_maps() end g_fade_types[g_fade](g_sys_p) --scr_fade(flr(rnd(4))) -- fun too -- switch between single pixels exclusively and some crosses if btnp'1' then g_points=nil elseif btnp'0' then g_points=1 end -- switch overlay on and off if btnp'2' then g_show_ui=1 elseif btnp'3' then g_show_ui=nil end -- draw stars local snx,sny=sin(g_sys_p/1280)*550,sin(g_sys_p/2560)*710 for i,s in pairs(g_stars) do if g_points then pset((s.x-snx)/s.z%128,(s.y-sny)/s.z%128,s.c) else circfill((s.x-snx)/s.z%128,(s.y-sny)/s.z%128,s.c%2,s.c) end end -- show some stats, current algorithm and mapping if g_show_ui then prt_out("mem:"..stat(0).." cpu:"..stat(1)..":"..stat(2),0,0,12) prt_out("🅾️change algo:"..g_fade_type_names[g_fade].."\n❎change palette map",1,116,12) end end -- fades a quarter of the lines on the screen at a time -- scan line by scan line using mapping above. -- which line is dictated by p. -- takes quite a chunk of performance -- even only doing a quarter of the screen at a time. function scr_fade(p) local dith={[0]=0,2,1,3} -- try to mix up lines a bit -- local tables seem to be faster. -- change start line based on dith value local m,d=g_scpal_map[g_map],0x6000+(dith[p%4]<<6) -- for a quarter of the 128 lines on the screen for j=0,31 do local j8=j<<8 -- saves a token -- for every 4bytes of this line for a=d+j8,d+j8+60,4 do -- grab existing value local v=$(a) -- map every pixel's colour to another one -- shift and mask 4bit pixel in 32bit value to just 4bit value -- to allow look up in map then shift back -- need logical shift >>> since don't want to consider sign poke4(a,m[v&0xf]|m[(v>>>4)&0xf]<<4|m[(v>>>8)&0xf]<<8|m[(v>>>12)&0xf]<<12 |m[(v<<16)&0xf]>>>16|m[(v<<12)&0xf]>>>12|m[(v<<8)&0xf]>>>8|m[(v<<4)&0xf]>>>4) end end end -- fades a quarter of the lines on the screen at a time -- following a Z pattern -- which line is dictated by p. -- takes quite a chunk of performance -- even only doing a quarter of the screen at a time. function scr_fade_z(p) local dith={[0]=0,64,4,68} -- backwards N pattern actually -- local tables seem to be faster. -- change start line based on dith value local m,d=g_scpal_map[g_map],0x6000+(dith[p%4]) -- for half of the 128 lines on the screen for j=0,63 do local j8=j<<7 -- saves a token -- for every second 4bytes of this line for a=d+j8,d+j8+56,8 do -- grab existing value local v=$a -- map every pixel's colour to another one -- shift and mask 4bit pixel in 32bit value to just 4bit value -- to allow look up in map then shift back -- need logical shift >>> since don't want to consider sign poke4(a,m[v&0xf]|m[(v>>>4)&0xf]<<4|m[(v>>>8)&0xf]<<8|m[(v>>>12)&0xf]<<12 |m[(v<<16)&0xf]>>>16|m[(v<<12)&0xf]>>>12|m[(v<<8)&0xf]>>>8|m[(v<<4)&0xf]>>>4) end end end ------------------------------------------------------------- -- 8 bit -- sacrifice a little bit (2-3k) of lua ram -- for performance -- fades a quarter of the lines on the screen at a time -- following a Z pattern -- which line is dictated by p. -- uses precomputed table with pairs of values -- takes quite a bit less performance because -- there's no need for pixel swizzling function scr_fade_8bit(p) local dith={[0]=0,64,4,68} -- backwards N pattern actually -- local tables seem to be faster. -- change start line based on dith value local m,d=g_8bit_map,0x6000+(dith[p%4]) -- for half of the 128 lines on the screen for j=0,0x1f80,128 do -- for every second 4bytes of this line for a=d+j,d+j+56,8 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke(a,m[@a],m[@(a+1)],m[@(a+2)],m[@(a+3)]) end end end -- fades a quarter of the lines on the screen at a time -- following a Z pattern -- which line is dictated by p. -- uses precomputed table with pairs of values -- takes quite a bit less performance because -- there's no need for pixel swizzling function scr_fade_8bit_bytez(p) local dith={[0]=0,64,1,65} -- backwards N pattern actually -- local tables seem to be faster. -- change start line based on dith value local m,d=g_8bit_map,0x6000+(dith[p%4]) -- for half of the 128 lines on the screen for j=0,0x1f80,128 do -- for every second 4bytes of this line for a=d+j,d+j+56,8 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke(a,m[@a]) poke(a+2,m[@(a+2)]) poke(a+4,m[@(a+4)]) poke(a+6,m[@(a+6)]) end end end ------------------------------------------------------------------------------------------ -- fades a quarter of the screen at a time -- scan line by scan line, left pixel then right pixel byte by byte -- which line, side of pair is dictated by p function scr_fade_8bit_inter(p) -- local tables seem to be faster. -- change start line based on oddness value local d,m=0x6000+p%2*64,p&2==0 and g_8bit_map0 or g_8bit_map1 -- for half of the 128 lines on the screen for j=0,0x1f80,128 do -- for every 4bytes of this line for a=d+j,j+d+60,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke(a,m[@a],m[@(a+1)],m[@(a+2)],m[@(a+3)]) end end end -- fades half of the lines on the screen at a time -- which line is dictated by p. -- uses precomputed table with pairs of values -- takes about the same performance as doing a quarter -- of the screen because there's no pixel swizzling. -- effect is less noticeable function scr_fade_8bit_half(p) -- local tables seem to be faster. -- change start line based on oddness value local m,d=g_8bit_map,0x6000+(p&1)*64 -- for half of the 128 lines on the screen for j=0,0x1f80,128 do -- for every 4bytes of this line for a=d+j,j+d+60,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke(a,m[@a],m[@(a+1)],m[@(a+2)],m[@(a+3)]) end end end -- fades all of the screen at a time -- uses precomputed table with pairs of values -- takes about half of total perf at 60 fps. -- might be okay for some games e.g. -- might be okay if 30fps is target. -- effect is much less noticeable function scr_fade_8bit_all() -- local tables seem to be faster. -- change start line based on oddness value local m=g_8bit_map -- for all of the 128 lines on the screen for j=0,0x1fc0,64 do -- for every 4bytes of this line for a=0x6000+j,j+0x603c,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke(a,m[@a],m[@(a+1)],m[@(a+2)],m[@(a+3)]) end end end ------------------------------------------------------------- -- 16 bit -- sacrifice a lot (approx 1200k) of lua ram -- for even more performance -- if your game doesn't use lua memory much -- then this may be the way to go -- fades a quarter of the lines on the screen at a time -- following a Z pattern -- which line is dictated by p. -- uses precomputed table with quads of values -- takes quite a lot less performance because -- there's no need for pixel swizzling and half the read/writes function scr_fade_16bit(p) local dith={[0]=0,64,4,68} -- backwards N pattern actually -- local tables seem to be faster. -- change start line based on dith value local m,d=g_16bit_map,0x6000+(dith[p%4]) -- for half of the 128 lines on the screen for j=0,63 do local j8=j<<7 -- saves a token -- for every second 4bytes of this line for a=d+j8,d+j8+56,8 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke2(a,m[%a],m[%(a+2)]) end end end -- fades half of the lines on the screen at a time -- which line is dictated by p. -- uses precomputed table with pairs of values -- takes quite a bit less performance because -- there's no need for pixel swizzling and half the read/writes -- effect is less noticeable function scr_fade_16bit_half(p) -- local tables seem to be faster. -- change start line based on oddness value local m,d=g_16bit_map,0x6000+(p&1)*64 -- for half of the 128 lines on the screen for j=0,63 do local j8=j<<7 -- saves a token -- for every 4bytes of this line for a=d+j8,j8+d+60,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke2(a,m[%a],m[%(a+2)]) end end end -- fades all of the screen at a time -- uses precomputed table with pairs of values -- takes a little more perf than initial version -- effect is much less noticeable function scr_fade_16bit_all() -- local tables seem to be faster. -- change start line based on oddness value local m=g_16bit_map -- for all of the 128 lines on the screen for j=0,127 do local j8=j<<6 -- saves a token -- for every 4bytes of this line for a=0x6000+j8,j8+0x603c,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker poke2(a,m[%a],m[%(a+2)]) end end end ------------------------------------------------------------- -- 16 bit packed -- sacrifice a bit less (approx 600k) of lua ram -- but because of the unpacking needed it's not actually v fast. -- unless there's a way to solve that this is pretty useless -- fades a quarter of the lines on the screen at a time -- following a Z pattern -- which line is dictated by p. -- uses precomputed table with quads of values function scr_fade_16bp(p) local dith={[0]=0,64,4,68} -- backwards N pattern actually -- local tables seem to be faster. -- change start line based on dith value local m,d=g_16bp_map,0x6000+(dith[p%4]) -- for half of the 128 lines on the screen for j=0,63 do local j8=j<<7 -- saves a token -- for every second 4bytes of this line for a=d+j8,d+j8+56,8 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker local v1,v2=%a,%(a+2) -- anyone know how to pack/unpack values faster than this? -- poke2(a,m[v1\2]<<((v1&1)<<4),m[v2\2]<<((v2&1)<<4)) -- poke2(a,m[v1\2]<<(v1&1==0 and 16 or 0),m[v2\2]<<(v2&1==0 and 16 or 0)) poke2(a,v1&1==0 and m[v1\2]<<16 or m[v1\2],v2&1==0 and m[v2\2]<<16 or m[v2\2]) end end end -- fades half of the lines on the screen at a time -- which line is dictated by p. -- uses precomputed table with pairs of values -- takes quite a bit less performance because -- there's no need for pixel swizzling and half the read/writes -- effect is less noticeable function scr_fade_16bp_half(p) -- local tables seem to be faster. -- change start line based on oddness value local m,d=g_16bp_map,0x6000+(p&1)*64 -- for half of the 128 lines on the screen for j=0,63 do local j8=j<<7 -- saves a token -- for every 4bytes of this line for a=d+j8,j8+d+60,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker local v1,v2=%a,%(a+2) poke2(a,v1&1==0 and m[v1\2]<<16 or m[v1\2],v2&1==0 and m[v2\2]<<16 or m[v2\2]) end end end -- fades all of the screen at a time -- uses precomputed table with pairs of values -- takes a little more perf than initial version -- effect is much less noticeable function scr_fade_16bp_all() -- local tables seem to be faster. -- change start line based on oddness value local m=g_16bp_map -- for all of the 128 lines on the screen for j=0,127 do local j8=j<<6 -- saves a token -- for every 4bytes of this line for a=0x6000+j8,j8+0x603c,4 do -- map every pair of pixels to a mapped pair -- 4 bytes at a time -- shorter and quicker local v1,v2=%a,%(a+2) poke2(a,v1&1==0 and m[v1\2]<<16 or m[v1\2],v2&1==0 and m[v2\2]<<16 or m[v2\2]) end end end