Improving FPS


#70

The “china J-Links” come in 2 different generations. This ad does not say if you can upgrade the firmware to work with Keil MDK 5. That is the version you need and it is more rare than the one that will work only up till MDK 4


#71

Not 100% sure but I think you need something that mentions the v9 firmware

I can also highly recommend the original J-Link EDU version, which I had for a long time. Zero problems, works really well


#72

Ok, just write your suggestion here and will put it in PR.


#73

I don’t know what I did, but I had the plasma demo running at 125fps at one point. lost it though, I was compiling the wrong target, randomly messing about with the code, confused about why nothing was changing. When I came back to it today, its 79fps.

Also I wouldn’t mind having the gb emulator, for novelty value mostly :sunglasses:

[edit] it was @FManga’s code further up. 125fps on the plasma demo. Unrolled loops gets 140fps.
How do I get rid of the line of noise along the middle of the screen?


#74

Maybe an addition to the next PR so we can have a define in My_settings.h?


#75

@spinal:
I’m not sure if the flag would be useful for others, simply disabling the code is simple enough.
Did you manage to get rid of the noise?
If not, try this:

#define TGL_WR(OP)							\
  *reinterpret_cast< volatile uint32_t *>(0xA0002284) = 1 << LCD_WR_PIN; \
  OP;									\
  *reinterpret_cast< volatile uint32_t *>(0xA0002204) = 1 << LCD_WR_PIN;

 void Pokitto::lcdRefreshMode13(uint8_t * scrbuf, uint16_t* paletteptr, uint8_t offset){
   uint16_t x,y;
   uint32_t scanline[110]; // read two nibbles = pixels at a time
   uint8_t *d;
   uint32_t *s;

   write_command_16(0x03); write_data_16(0x1038);
   write_command(0x20); write_data(0);
   write_command(0x21); write_data(0);
   write_command(0x22);
   CLR_CS_SET_CD_RD_WR;
   SET_MASK_P2;

   volatile uint32_t *LCD = reinterpret_cast< volatile uint32_t * >(0xA0002188);

   d = scrbuf;// point to beginning of line in data
   for(y=0;y<88;y++){

     s = scanline;

     for(x=0;x<110;x+=10){
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
       *LCD = *s = paletteptr[(*d + offset)&255]<<3; TGL_WR(s++);TGL_WR(d++);	
     }

     s = scanline;
     
     for(x=0;x<110;x+=10){
       *LCD = *s; TGL_WR(s++);TGL_WR(s);
       *LCD = *s; TGL_WR(s++);TGL_WR(s);       
       *LCD = *s; TGL_WR(s++);TGL_WR(s);
       *LCD = *s; TGL_WR(s++);TGL_WR(s);       
       *LCD = *s; TGL_WR(s++);TGL_WR(s);
       *LCD = *s; TGL_WR(s++);TGL_WR(s);       
       *LCD = *s; TGL_WR(s++);TGL_WR(s);
       *LCD = *s; TGL_WR(s++);TGL_WR(s);       
       *LCD = *s; TGL_WR(s++);TGL_WR(s);
       *LCD = *s; TGL_WR(s++);TGL_WR(s);       
     }
     
   }
   
 }


#76

I was thinking of something like this (though LCD could probably use a better name. Any suggestions?):

// HWLCD.cpp
volatile uint32_t *LCD = reinterpret_cast< volatile uint32_t * >(0xA0002188);

// HWLCD.h

#define TGL_WR(OP)							\
  LPC_GPIO_PORT->SET[LCD_WR_PORT] = 1 << LCD_WR_PIN;			\
  OP;									\
  LPC_GPIO_PORT->CLR[LCD_WR_PORT] = 1 << LCD_WR_PIN;			

#define TGL_WR								\
  LPC_GPIO_PORT->SET[LCD_WR_PORT] = 1 << LCD_WR_PIN;			\
  __asm("nop"); \
  LPC_GPIO_PORT->CLR[LCD_WR_PORT] = 1 << LCD_WR_PIN;			

extern volatile uint32_t *LCD;

For some reason, using the LCD pointer is about 7 fps faster than using mbed’s MPIN[2]. Even stranger, this does not seem to affect SET/CLR used in TGL_WR.


#77

Looks fine.


#78

What’s MPIN defined as?

It’s possible that the indexing operation isn’t getting optimised away if it’s volatile,
because doing so might violate some of the guarantees of volatile.


#79

The same way CLR and SET:

#ifdef __cplusplus
  #define   __I     volatile             /*!< Defines 'read only' permissions                 */
#else
  #define   __I     volatile const       /*!< Defines 'read only' permissions                 */
#endif
#define     __O     volatile             /*!< Defines 'write only' permissions                */
#define     __IO    volatile             /*!< Defines 'read / write' permissions              */


typedef struct {                                    /*!< GPIO_PORT Structure                                                   */
  __IO uint8_t   B[88];                             /*!< Byte pin registers                                                    */
  __I  uint32_t  RESERVED0[42];
  __IO uint32_t  W[88];                             /*!< Word pin registers                                                    */
  __I  uint32_t  RESERVED1[1896];
  __IO uint32_t  DIR[3];                            /*!< Port Direction registers                                              */
  __I  uint32_t  RESERVED2[29];
  __IO uint32_t  MASK[3];                           /*!< Port Mask register                                                    */
  __I  uint32_t  RESERVED3[29];
  __IO uint32_t  PIN[3];                            /*!< Port pin register                                                     */
  __I  uint32_t  RESERVED4[29];
  __IO uint32_t  MPIN[3];                           /*!< Masked port register                                                  */
  __I  uint32_t  RESERVED5[29];
  __IO uint32_t  SET[3];                            /*!< Write: Set port register Read: port output bits                       */
  __I  uint32_t  RESERVED6[29];
  __O  uint32_t  CLR[3];                            /*!< Clear port                                                            */
  __I  uint32_t  RESERVED7[29];
  __O  uint32_t  NOT[3];                            /*!< Toggle port                                                           */
} LPC_GPIO_PORT_Type;

#define LPC_GPIO_PORT                   ((LPC_GPIO_PORT_Type      *) LPC_GPIO_PORT_BASE)


#80

@Hanski @FManga how about some PRs so I get to test? I don’t have enough time to start parsing from your tennis match where the ball is at the moment.


#81

I am just about to make a PR…


#82

Excellent! Thanks. I am waiting.


#83

A pull request done


#84

Pull request merged.

Does this also include the Mode13 stuff? @spinal, @FManga ?


#85

No, that’s just mode2. I think @spinal is still experimenting with mode13?
I can send a PR for ModeGBC soon, if you’d like.


#86

Yes please if possible.

I am on summer holiday, as you may or may not have noticed from a bunch of new games ported. I intend to be flooding the place with new content. I have good time to look at the improvements you guys have come up with.


#87

So… close…


#88

That is awesome!

When I did my version, I used Zboy as base, remapped the 64kB ROM address calls to static flash (instead of ram).

I had a feeling the real issue with the speed was some sort of a timing glitch and not lack of cpu


#89

I saw you were using zboy, so I went with it too. I’m also using flash for ROM, but I think it may be possible to put it in RAM for mbc0 games (that way we’d be able to load games from the SD). I’ve been testing with Tetris, Dr Mario and Kwirk. They’re still a bit slow, but playable. I’m only focusing on mbc0 games, for now.

I uploaded the code in its current/rough state here: https://github.com/felipemanga/PokittoZBoy

At first I also thought timing could be an issue, so I disabled the throttling and didn’t get any improvement. Since I couldn’t find a hardware debugger in Brazil, I modified ProjectABE for profiling. I can now make a “hotspot ranking” like this:

#6ca1 ___ 100%
#6865 ___ 99.48%
#5f8d ___ 99.4%
#b4d6 ___ 73.38%
#b4d8 ___ 73.38%
#b504 ___ 71.88%
#383c ___ 36.82%
#3844 ___ 36.82%
#3848 ___ 36.82%
#3854 ___ 36.82%
...

ModeGBC was the first bottleneck and replacing it with the version I posted previously helped a bit. Then I had to rewrite setPixel, getPixel, and most of DrawBackground. No point in drawing to framebuffer, copying that to PokittoLib’s framebuffer, then copying that to the LCD.

I removed the huge switch from the CPU interpreter and used an array of function pointers, instead. To lower the per-op overhead, I update the CPU more often than the other systems (16:1 was the most I could get away with).

The MMU was next: I use SRAM1 as a “RAM palette”, to index addresses into blocks of memory. Reading is now simple enough that the compiler inlines it.

The profiler also pointed out things like:
CurLY = VideoClkCounterVBlank / 456
This results in a call to udivsi3, so I replaced it with a fixed point multiply:
CurLY = VideoClkCounterVBlank * (0x1000000 / 456) >> 24

Now individual opcodes are the bottleneck (#6ca1 is the implementation of LD A,($FF00+n)). This is a good thing, because that means the rest of the emulator is not in the way, and a bad thing because it’s hard to make it any faster.