One would think that 16-bit, unsigned, binary shifts on a Z80 microprocessor would be as trivial as it comes. But my work on Easter calculations made me realise just how little has been actually written down about optimising Z80 code for even the simplest of cases. So here's a quick guide; alas, probably thirty years too late!
The reason these shifts aren't obvious is because of the inherent asymmetry in the Z80 processor. Although it has a fairly orthogonal 8-bit ALU, its 16-bit (address) pipeline only has a simple adder. Multiplying HL by two (a left shift) is trivial: ADD HL, HL. Shifting right is a whole can of worms; to shift a 16-bit quantity to the right, it is sometimes quicker (and/or shorter) to shift/rotate left and adjust.
The following are routines that I believe are either the fastest or shortest to achieve the shifts. Optimal alternatives are given for preserving other registers, where necessary. Flags on exit are undefined.
Many thanks to Eric Fry for pointing out a couple of mistakes I made on a previous version of this page.
; T-Cycles: 11 ; Bytes: 1 ; Trashed: None ADD HL, HL |
; T-Cycles: 22 ; Bytes: 2 ; Trashed: None ADD HL, HL ADD HL, HL |
; T-Cycles: 33 ; Bytes: 3 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 44 ; Bytes: 4 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 55 ; Bytes: 5 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
This is the first point where we have a meaningful choice: the left-hand code is shorter and does not destroy the original contents of the A register, whereas the right-hand code is faster.
; T-Cycles: 66 ; Bytes: 6 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 52 ; Bytes: 13 ; Trashed: A XOR A SRL H RR L RRA SRL H RR L RRA LD H, L LD L, A |
These three snippets are, respectively: shortest, fastest and fastest without side effects.
; T-Cycles: 77 ; Bytes: 7 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 32 ; Bytes: 8 ; Trashed: A XOR A SRL H RR L RRA LD H, L LD L, A |
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SRL H RR L LD H, L LD L, 0 RR L |
; T-Cycles: 11 ; Bytes: 3 ; Trashed: None LD H, L LD L, 0 |
Although neither of the following snippets use other registers, one is faster while the other is shorter.
; T-Cycles: 19 ; Bytes: 5 ; Trashed: None SLA L LD H, L LD L, 0 |
; T-Cycles: 22 ; Bytes: 4 ; Trashed: None LD H, L LD L, 0 ADD HL, HL |
; T-Cycles: 23 ; Bytes: 6 ; Trashed: A LD A, L ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 33 ; Bytes: 5 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: None SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: A LD A, L ADD A, A ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 44 ; Bytes: 6 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 31 ; Bytes: 8 ; Trashed: A LD A, L ADD A, A ADD A, A ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 55 ; Bytes: 7 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 43 ; Bytes: 11 ; Trashed: None SLA L SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, L RRCA RRCA RRCA AND 224 LD H, A LD L, 0 |
; T-Cycles: 66 ; Bytes: 8 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 51 ; Bytes: 13 ; Trashed: None SLA L SLA L SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 30 ; Bytes: 8 ; Trashed: A LD A, L RRCA RRCA AND 224 LD H, A LD L, 0 |
; T-Cycles: 77 ; Bytes: 9 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 46 ; Bytes: 12 ; Trashed: None LD H, 0 RR L RR H RR L RR H LD L, 0 |
; T-Cycles: 24 ; Bytes: 6 ; Trashed: A XOR A RR L LD L, A RRA LD H, A |
; T-Cycles: 28 ; Bytes: 7 ; Trashed: None RR L LD HL, 0 RR H |
The left shifts range from minima of 11 to 48 T-cycles or between 1 and 8 code bytes.
; T-Cycles: 16 ; Bytes: 4 ; Trashed: None SRL H RR L |
; T-Cycles: 32 ; Bytes: 8 ; Trashed: None SRL H RR L SRL H RR L |
; T-Cycles: 48 ; Bytes: 12 ; Trashed: None SRL H RR L SRL H RR L SRL H RR L |
; T-Cycles: 44 ; Bytes: 11 ; Trashed: A LD A, L SRL H RRA SRL H RRA SRL H RRA LD L, A |
; T-Cycles: 64 ; Bytes: 16 ; Trashed: None SRL H RR L SRL H RR L SRL H RR L SRL H RR L |
; T-Cycles: 56 ; Bytes: 14 ; Trashed: A LD A, L SRL H RRA SRL H RRA SRL H RRA SRL H RRA LD L, A |
; T-Cycles: 72 ; Bytes: 11 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 57 ; Bytes: 9 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 78 ; Bytes: 11 ; Trashed: None PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF |
; T-Cycles: 42 ; Bytes: 7 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 63 ; Bytes: 9 ; Trashed: None PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF |
; T-Cycles: 27 ; Bytes: 5 ; Trashed: A XOR A ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 30 ; Bytes: 6 ; Trashed: None ADD HL, HL LD L, H LD H, 0 RL H |
; T-Cycles: 11 ; Bytes: 3 ; Trashed: None LD L, H LD H, 0 |
; T-Cycles: 19 ; Bytes: 5 ; Trashed: None SRL H LD L, H LD H, 0 |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: None SRL H SRL H LD L, H LD H, 0 |
Here, you only waste 1 T-cycle if you want to preserve A.
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, H RRA RRA RRA AND 31 LD L, A LD H, 0 |
; T-Cycles: 43 ; Bytes: 11 ; Trashed: None SRL H SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 38 ; Bytes: 10 ; Trashed: A LD A, H RRA RRA RRA RRA AND 15 LD L, A LD H, 0 |
; T-Cycles: 51 ; Bytes: 13 ; Trashed: None SRL H SRL H SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, H RLA RLA RLA AND 7 LD L, A LD H, 0 |
; T-Cycles: 55 ; Bytes: 9 ; Trashed: None LD L, H LD H, 0 ADD HL, HL ADD HL, HL ADD HL, HL LD L, H LD H, 0 |
; T-Cycles: 30 ; Bytes: 8 ; Trashed: A LD A, H RLA RLA AND 3 LD L, A LD H, 0 |
; T-Cycles: 44 ; Bytes: 8 ; Trashed: None LD L, H LD H, 0 ADD HL, HL ADD HL, HL LD L, H LD H, 0 |
; T-Cycles: 24 ; Bytes: 6 ; Trashed: A XOR A RL H LD H, A RLA LD L, A |
; T-Cycles: 26 ; Bytes: 7 ; Trashed: None RL H LD HL, 0 RL L |
The right shifts range from minima of 11 to 57 T-cycles or between 3 and 11 code bytes.
If you were developing a library to perform arbitrary shifts, you might consider the following:
ShiftRight: ; Assume 0 <= A <= 15 LD (ShiftRightJump-1), A XOR 15 ADD A, A ADD A, A LD (ShiftRightJump+1), A LD A, 0 ; Overwritten operand ShiftRightJump: JR $+2 ; Overwritten operand JP ShiftRight15 NOP JP ShiftRight14 NOP JP ShiftRight13 NOP JP ShiftRight12 NOP JP ShiftRight11 NOP JP ShiftRight10 NOP CP 0 ; Two NOPs ShiftRight9: SRL H ShiftRight8: LD L, H LD H, 0 RET JP ShiftRight7 NOP JP ShiftRight6 NOP JP ShiftRight5 NOP ShiftRight4: SRL H RR L ShiftRight3: SRL H RR L ShiftRight2: SRL H RR L ShiftRight1: SRL H RR L ShiftRight0: RET ShiftNegative: ; Assume -15 <= A <= 0 LD (ShiftNegativeJump-1), A ADD A, A ADD A, A SUB ShiftNegativeJump-ShiftRight0+2 LD (ShiftNegativeJump+1), A LD A, 0 ; Overwritten operand ShiftNegativeJump: JR $+2 ; Overwritten operand ShiftRight15: RL H LD HL, 0 RL L RET ShiftRight14: LD L, H LD H, 0 ADD HL, HL ADD HL, HL LD L, H LD H, 0 RET ShiftRight13: SRL H ShiftRight12: SRL H ShiftRight11: SRL H ShiftRight10: SRL H SRL H LD L, H LD H, 0 RET ShiftRight6: PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF RET ShiftRight5: PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF RET ShiftAny: CP 241 JP NC, ShiftNegative CP 16 JR NC, ShiftRight16 ShiftLeft: ShiftPositive: ; Assume 0 <= A <= 15 LD (ShiftLeftJump-1), A ADD A, A ADD A, A ADD A, A LD (ShiftLeftJump+1), A LD A, 0 ; Overwritten operand ShiftLeftJump: JR $+2 ; Overwritten operand ShiftLeft0: RET ShiftRight7: ADD HL, HL LD L, H LD H, 0 RL H RET ShiftLeft1: ADD HL, HL RET NOP NOP NOP NOP NOP NOP ShiftLeft2: ADD HL, HL ADD HL, HL RET NOP NOP NOP NOP NOP ShiftLeft3: ADD HL, HL ADD HL, HL ADD HL, HL RET ShiftLeft16: ShiftRight16: LD HL, 0 RET ShiftLeft4: ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL RET NOP NOP NOP ShiftLeft5: ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL RET NOP NOP ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL RET NOP ShiftLeft7: SRL L LD H, L LD L, 0 RR L RET ShiftLeft8: LD H, L LD L, 0 RET NOP NOP NOP NOP ShiftLeft9: SLA L LD H, L LD L, 0 RET ShiftLeft11: SLA L ShiftLeft10: SLA L SLA L LD H, L LD L, 0 RET JP ShiftLeft11 NOP NOP NOP NOP NOP ShiftLeft12: LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL RET JP ShiftLeft13 NOP NOP NOP NOP NOP JP ShiftLeft14 NOP NOP NOP NOP NOP ShiftLeft15: RR L LD HL, 0 RR H RET ShiftLeft6: PUSH AF XOR A SRL H RR L RRA SRL H RR L RRA LD H, L LD L, A POP AF RET ShiftLeft13: SLA L SLA L SLA L SLA L SLA L LD H, L LD L, 0 RET ShiftLeft14: LD H, 0 RR L RR H RR L RR H LD L, 0 RET |
This library provides you with a number of useful entry points within its 335 code bytes. However, some do make use of self-modifying code.