Optimising an AVR interrupt handler for Genesis controller emulation

Intro
First C-only implementation
"Naked" interrupt with inline assembler
Taking advantage of unused peripheral registers
Lowering interrupt latency by placing code directly in the vector
Preparing in advance for the next transition
The rest and conclusion

Intro

My challenge was to build an SNES controller to Genesis/Megadrive adapter. A standard controller typically uses a multiplexer (74157 for instance) able to react very quickly (ex: within 27 ns) to the selection signal. But I wanted to use my multiuse PCB2 circuit which do not have a multiplexer, but only an Atmega8 clocked at 16 MHz (the maximum). Of course I wired the selection signal to an external interrupt pin. However for various reasons, when the interrupt occurs the corresponding code is not executed instantly. This is were things become interesting.

On this page I write about the path I followed until the response time became acceptable and the adapter reliable. But first, here's a quick overview of how the Genesis controller works.

The controller uses a DB9-F style connector and runs under 5 volts. There are 6 output signals (Controller to console) and 1 input signal (Console to controller) used to select which set of buttons the output signal must report.

DB9 pin	Function
5	+5 volt
8	GND
7	Selection signal (SELECT)

DB9 pin	Function when SELECT==0	Function when SELECT==1
1	D-Pad up	D-Pad up
2	D-Pad down	D-Pad down
3	0	D-Pad left
4	0	D-Pad right
6	Button A	Button B
9	Button START	Button C

For more details, I recommend reading the following: segasix.txt

First C-only implementation

I wired the adapter such that all outputs were on a single AVR port (PORTC) and that this port sole use was to control said outputs. This makes it possible to set all the outputs in a single operation (one write) rather than in 3 operations (read, modify, write).

The global variables S0_PC and S1_PC are updated by the mainloop following each poll of the SNES controller. S0_PC holds the value to present on PORTC when SELECT is low and S1_PC holds the value to present when select is high.

INT0 is configured to trigger on both rising and falling edges. The interrupt handler is therefore executed each time SELECT changes state.

Here is the simple handler code I began with:

ISR(INT0_vect)
{
	if (PIND & (1<<PIND2)) {
		PORTC = S1_PC;
	} else {
		PORTC = S0_PC;
	}   
}

Reaction time: 1.46uS. Too slow, does not work!

A quick avr-objdump -d fichier.elf shows us how this was compiled:

0000005e <__vector_1>:
5e:   1f 92           push    r1
60:   0f 92           push    r0
62:   0f b6           in      r0, 0x3f        ; 63
64:   0f 92           push    r0
66:   11 24           eor     r1, r1
68:   8f 93           push    r24
6a:   82 9b           sbis    0x10, 2 ; 16
6c:   03 c0           rjmp    .+6             ; 0x74 <__vector_1+0x16>
6e:   80 91 b3 00     lds     r24, 0x00B3
72:   02 c0           rjmp    .+4             ; 0x78 <__vector_1+0x1a>
74:   80 91 b4 00     lds     r24, 0x00B4
78:   85 bb           out     0x15, r24       ; 21  
7a:   8f 91           pop     r24
7c:   0f 90           pop     r0
7e:   0f be           out     0x3f, r0        ; 63
80:   0f 90           pop     r0
82:   1f 90           pop     r1
84:   18 95           reti

How inefficient! The write to PORTC (out 0x15) is done way to late! Obviously the compiler cannot guess the need for updating PORTC as soon as possible. It also does not worry much about using registers without a reason, which means they must saved and restored with push/pop. I'm not very impressed by the useless initialisation of r1 (the __zero_reg__) to zero when it is not used at all. But since the eor instruction is used, SREG (0x3f) is changed and must therefore be saved too...

Note that this was compiled with the -Os option. -O3 was not better.

"Naked" interrupt with inline assembler

Here is a new interrupt handler with the ISR_NAKED flags which prevents the compiler from generating code at the beginning and end of the handler. This is now our responsability. Very good since we can write a very simple handler.

ISR(INT0_vect, ISR_NAKED)
{
	asm volatile(
		"   push r16                    \n"
		"   lds r16, S1_PC              \n"
		"   sbis 0x10, 2    ; PIND2     \n"
		"   lds r16, S0_PC              \n"
		"   out 0x15, r16   ; PORTC     \n"
		"   pop r16                     \n"
		"   reti                        \n"
	::);
}

With this new version, reaction time fell to 960 ns! The adapter began to work, but not reliably. I.e., when the jump button was held, the character would repeatedly jump. It was likely the current timing was overlapping with the acceptability threshold. I had to do better.

The push and lds instructions use two cycles each. If the full firmware was in assembler, it would be easy to select two registers to hold the S0_PC and S1_PC variables, making it possible to access them using only one cycle with the mov instruction. Moreover, preserving r16 could be done away with by reserving a third register.

But I want to keep as much C code as possible. It might be tempting to declare a global register variable (eg: register unsigned char value asm("r3")) but it would then be necessary to make sure libraries or other sources in the same project don't touch the reserved registers. The gcc option -ffixed-3 would be useful for this. But I did not want this project to depend on a specially compiled avr-libc, nor did I want to manually make sure the registers are not used by the library by disassembling. (Even if it did work now, you never know with future versions). So I decided not to take this approach.

Taking advantage of unused peripheral registers

That said, there is another way to access to access values in a single cycle. Unused peripheral registers can be used if you make sure there won't be side effects. But this depends on what peripheral your project uses.

I decided to use UBRRL (baud rade low byte) to store the S0_PC value, and OCR2 (output compare 2) to store S1_PC. Also, r16 is saved in EEDR (eeprom data register). Writing and reading from those peripheral registers do not have any effect on the program. Now this is a technique I'd be impressed to see a compiler use…

Well this gives us the following handler:

ISR(INT0_vect, ISR_NAKED)
{
	asm volatile(
	"   out 0x1D, r16 ; EEDR        \n"
	"   in r16, 0x23    ; OCR2      \n"
	"   sbis 0x10, 2    ; PIND2     \n"
	"   in r16, 0x09    ; UBRLL     \n"
	"   out 0x15, r16   ; PORTC     \n"
	"   in r16, 0x1D    ; EEDR      \n"
	"   reti                        \n"
	::);
}

Since a few cycles were saved, the reaction time is now around 800 ns. And the adapter seems to be reliable. But I think we are still close to the unreliability threshold. No problem since it is still possible to improve!

Lowering interrupt latency by placing code directly in the vector

By default, the interrupt vector table is at flash address 0x0000. When an interrupt handler is implemented, a rjmp instruction is placed at the corresponding offet to jump to the actual interrupt handler code. Here, __vector_1 is the address of our interrupt handler.

00000000 <__vectors>:
0:   12 c0           rjmp    .+36            ; 0x26 <__ctors_end>
2:   2d c0           rjmp    .+90            ; 0x5e <__vector_1>
4:   2b c0           rjmp    .+86            ; 0x5c <__bad_interrupt>
....

This rjmp instruction is wasting 2 cycles before our code is executed. Since I know INT0 is the only interrupt this project uses, I also know I can place the handler code directly in the vector at address 0x0002.

The atmega8 supports moving the interrupt vector from address 0x0000 to the start of the bootloader section. The effective address depends on how the "fuses" are configured. In my case, the address is 0x1800 (Word address 0xC00).

I created a .boot section by adding -Wl,--section-start=.boot=0x1800 when linking. The interrupt handler "function" that I will place there will therefore have to be marked with __attribute__((section(".boot"))).

This same "function" will required the "naked" attribute to make sure the compiler does not place code around the inline assembler block. The assembler code is the same as before, except for the two nop instructions used to skip the first vector (reset). Note that defining .boot one word later would make the two nop instructions unnecessary. But I prefer it that way.

void fastint(void) __attribute__((naked)) __attribute__((section(".boot")));

void fastint(void)
{
	asm volatile(
		"   nop\nnop\n                  \n" // VECTOR 1 : RESET
		"   out 0x1D, r16 ; EEDR        \n"
		"   in r16, 0x23    ; OCR2      \n"
		"   sbis 0x10, 2    ; PIND2     \n"
		"   in r16, 0x09    ; UBRLL     \n"
		"   out 0x15, r16   ; PORTC     \n"
		"   in r16, 0x1D    ; EEDR      \n"
		"   reti                        \n"
	::);
}

Reaction time: Around 760nS. Not bad!

Preparing in advance for the next transition

Updating PORTC within even less time is possible if we already know the state of the SELECT line. And we do! The interrupt handler is executed each time the SELECT line changes. If we sample the SELECT line while the interrupt is executing, a value of 0 means the next transition is to 1, and vis versa. I did not realize this before this step! I only did when I began thinking of the 6 button implementation that would follow..

I also began exploiting ICR1L to store the value to put on PORTC on the next transition:

	asm volatile(
	"   nop\nnop\n                  \n" // VECTOR 1 : RESET
	"   out 0x1D, r16 ; EEDR        \n" 

	"   in r16, 0x26    ; ICR1L     \n" 
	"   out 0x15, r16   ; PORTC     \n"
	
	// Prepare ICR1L for the next transition
	"   in r16, 0x09    ; UBRLL     \n"
	"   sbis 0x10, 2    ; PIND2     \n"
	"   in r16, 0x23    ; OCR2      \n"
	"   out 0x26, r16   ; ICR1L     \n"

	"   in r16, 0x1D    ; EEDR      \n"
	"   reti                        \n"
	::);

I was looking at the source code above and wondered what I could do about the cycle wasted by saving r16... Then I realized I could use r1, which is also known as __zero_reg__, a register kept at 0 by gcc.

Because this interrupt handler is executed with other interrupts disabled, __zero_reg__ can be freely used, but its value of 0 must be restored before returning. No need to save it first since it should have been zero. However, we must be careful. The clr instruction has an effect on flags, so SREG would need to be saved.. Also, loading a 0 with ldi is not possible because this instruction requires a register from r16 and above. (__zero_reg__ is r1). So I used a lds to load __zero_reg__ with a zero from memory which has no effect on the flags.

uint8_t zero = 0;

asm volatile(
	"   nop\nnop\n                  \n" // VECTOR 1 : RESET

	"   in __zero_reg__, 0x26   ; ICR1L     \n" 
	"   out 0x15, __zero_reg__  ; PORTC     \n"

	// Now, let's prepare for the next transition.

	"   in __zero_reg__, 0x09   ; UBRLL     \n"
	"   sbis 0x10, 2    ; PIND2     \n"
	"   in __zero_reg__, 0x23   ; OCR2      \n"
	"   out 0x26, __zero_reg__  ; ICR1L     \n"

	"   lds __zero_reg__, zero      \n"
	"   reti                     	\n"
	::);

Reaction time: From 490ns to 630ns.
The graphic on the left represents the final timing. The bottom trace is the SELECT line. The top trace falls to 0 to transmit the state of a depressed START button within an average time of 560ns. Note that one CPU cycle is 62.5ns. The jitter is of approximately 3 cycles and depends one the moment the falling edge occurs in relation with the CPU clock phase, but also on the instruction currently executed by the main loop (Multi-cycle instructions must complete before the interrupt handler runs).

It would still be possible to save one cycle by reserving a register to stock the next PORTC value. The initial in instructions would not be needed then. But at this point, the performance seems high enough.

The rest and conclusion

From this point, I made many changes to have the adapter appear as a 6 button controller to the Genesis console. The code has been changed to put different values on PORTC according to a sequence of SELECT pulses. Conditional access to OCR2 and UBRRL was therefore replaced by memory access. But thanks to the optimisations presented here, reaction time has not increased at all.

If you'd like to see the final code, you may download the project sources through the projet page.