One would think that 16-bit, unsigned, binary shifts on a Z80 microprocessor would be as trivial as it comes. But my work on Easter calculations made me realise just how little has been actually written down about optimising Z80 code for even the simplest of cases. So here's a quick guide; alas, probably thirty years too late!
The reason these shifts aren't obvious is because of the inherent asymmetry in the Z80 processor. Although it has a fairly orthogonal 8-bit ALU, its 16-bit (address) pipeline only has a simple adder. Multiplying HL by two (a left shift) is trivial: ADD HL, HL. Shifting right is a whole can of worms; to shift a 16-bit quantity to the right, it is sometimes quicker (and/or shorter) to shift/rotate left and adjust.
The following are routines that I believe are either the fastest or shortest to achieve the shifts. Optimal alternatives are given for preserving other registers, where necessary. Flags on exit are undefined.
Many thanks to Eric Fry for pointing out a couple of mistakes I made on a previous version of this page.
; T-Cycles: 11 ; Bytes: 1 ; Trashed: None ADD HL, HL |
; T-Cycles: 22 ; Bytes: 2 ; Trashed: None ADD HL, HL ADD HL, HL |
; T-Cycles: 33 ; Bytes: 3 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 44 ; Bytes: 4 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 55 ; Bytes: 5 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
This is the first point where we have a meaningful choice: the left-hand code is shorter and does not destroy the original contents of the A register, whereas the right-hand code is faster.
; T-Cycles: 66 ; Bytes: 6 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 52 ; Bytes: 13 ; Trashed: A XOR A SRL H RR L RRA SRL H RR L RRA LD H, L LD L, A |
These three snippets are, respectively: shortest, fastest and fastest without side effects.
; T-Cycles: 77 ; Bytes: 7 ; Trashed: None ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 32 ; Bytes: 8 ; Trashed: A XOR A SRL H RR L RRA LD H, L LD L, A |
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SRL H RR L LD H, L LD L, 0 RR L |
; T-Cycles: 11 ; Bytes: 3 ; Trashed: None LD H, L LD L, 0 |
Although neither of the following snippets use other registers, one is faster while the other is shorter.
; T-Cycles: 19 ; Bytes: 5 ; Trashed: None SLA L LD H, L LD L, 0 |
; T-Cycles: 22 ; Bytes: 4 ; Trashed: None LD H, L LD L, 0 ADD HL, HL |
; T-Cycles: 23 ; Bytes: 6 ; Trashed: A LD A, L ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 33 ; Bytes: 5 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: None SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: A LD A, L ADD A, A ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 44 ; Bytes: 6 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 31 ; Bytes: 8 ; Trashed: A LD A, L ADD A, A ADD A, A ADD A, A ADD A, A LD H, A LD L, 0 |
; T-Cycles: 55 ; Bytes: 7 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 43 ; Bytes: 11 ; Trashed: None SLA L SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, L RRCA RRCA RRCA AND 224 LD H, A LD L, 0 |
; T-Cycles: 66 ; Bytes: 8 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 51 ; Bytes: 13 ; Trashed: None SLA L SLA L SLA L SLA L SLA L LD H, L LD L, 0 |
; T-Cycles: 30 ; Bytes: 8 ; Trashed: A LD A, L RRCA RRCA AND 224 LD H, A LD L, 0 |
; T-Cycles: 77 ; Bytes: 9 ; Trashed: None LD H, L LD L, 0 ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL ADD HL, HL |
; T-Cycles: 46 ; Bytes: 12 ; Trashed: None LD H, 0 RR L RR H RR L RR H LD L, 0 |
; T-Cycles: 24 ; Bytes: 6 ; Trashed: A XOR A RR L LD L, A RRA LD H, A |
; T-Cycles: 28 ; Bytes: 7 ; Trashed: None RR L LD HL, 0 RR H |
The left shifts range from minima of 11 to 48 T-cycles or between 1 and 8 code bytes.
; T-Cycles: 16 ; Bytes: 4 ; Trashed: None SRL H RR L |
; T-Cycles: 32 ; Bytes: 8 ; Trashed: None SRL H RR L SRL H RR L |
; T-Cycles: 48 ; Bytes: 12 ; Trashed: None SRL H RR L SRL H RR L SRL H RR L |
; T-Cycles: 44 ; Bytes: 11 ; Trashed: A LD A, L SRL H RRA SRL H RRA SRL H RRA LD L, A |
; T-Cycles: 64 ; Bytes: 16 ; Trashed: None SRL H RR L SRL H RR L SRL H RR L SRL H RR L |
; T-Cycles: 56 ; Bytes: 14 ; Trashed: A LD A, L SRL H RRA SRL H RRA SRL H RRA SRL H RRA LD L, A |
; T-Cycles: 72 ; Bytes: 11 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 57 ; Bytes: 9 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 78 ; Bytes: 11 ; Trashed: None PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF |
; T-Cycles: 42 ; Bytes: 7 ; Trashed: A XOR A ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 63 ; Bytes: 9 ; Trashed: None PUSH AF XOR A ADD HL, HL RLA ADD HL, HL RLA LD L, H LD H, A POP AF |
; T-Cycles: 27 ; Bytes: 5 ; Trashed: A XOR A ADD HL, HL RLA LD L, H LD H, A |
; T-Cycles: 30 ; Bytes: 6 ; Trashed: None ADD HL, HL LD L, H LD H, 0 RL H |
; T-Cycles: 11 ; Bytes: 3 ; Trashed: None LD L, H LD H, 0 |
; T-Cycles: 19 ; Bytes: 5 ; Trashed: None SRL H LD L, H LD H, 0 |
; T-Cycles: 27 ; Bytes: 7 ; Trashed: None SRL H SRL H LD L, H LD H, 0 |
Here, you only waste 1 T-cycle if you want to preserve A.
; T-Cycles: 35 ; Bytes: 9 ; Trashed: None SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, H RRA RRA RRA AND 31 LD L, A LD H, 0 |
; T-Cycles: 43 ; Bytes: 11 ; Trashed: None SRL H SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 38 ; Bytes: 10 ; Trashed: A LD A, H RRA RRA RRA RRA AND 15 LD L, A LD H, 0 |
; T-Cycles: 51 ; Bytes: 13 ; Trashed: None SRL H SRL H SRL H SRL H SRL H LD L, H LD H, 0 |
; T-Cycles: 34 ; Bytes: 9 ; Trashed: A LD A, H RLA RLA RLA AND 7 LD L, A LD H, 0 |
; T-Cycles: 55 ; Bytes: 9 ; Trashed: None LD L, H LD H, 0 ADD HL, HL ADD HL, HL ADD HL, HL LD L, H LD H, 0 |
; T-Cycles: 30 ; Bytes: 8 ; Trashed: A LD A, H RLA RLA AND 3 LD L, A LD H, 0 |
; T-Cycles: 44 ; Bytes: 8 ; Trashed: None LD L, H LD H, 0 ADD HL, HL ADD HL, HL LD L, H LD H, 0 |
; T-Cycles: 24 ; Bytes: 6 ; Trashed: A XOR A RL H LD H, A RLA LD L, A |
; T-Cycles: 26 ; Bytes: 7 ; Trashed: None RL H LD HL, 0 RL L |
The right shifts range from minima of 11 to 57 T-cycles or between 3 and 11 code bytes.
If you were developing a library to perform arbitrary shifts, you might consider the following:
ShiftRight:
; Assume 0 <= A <= 15
LD (ShiftRightJump-1), A
XOR 15
ADD A, A
ADD A, A
LD (ShiftRightJump+1), A
LD A, 0 ; Overwritten operand
ShiftRightJump:
JR $+2 ; Overwritten operand
JP ShiftRight15
NOP
JP ShiftRight14
NOP
JP ShiftRight13
NOP
JP ShiftRight12
NOP
JP ShiftRight11
NOP
JP ShiftRight10
NOP
CP 0 ; Two NOPs
ShiftRight9:
SRL H
ShiftRight8:
LD L, H
LD H, 0
RET
JP ShiftRight7
NOP
JP ShiftRight6
NOP
JP ShiftRight5
NOP
ShiftRight4:
SRL H
RR L
ShiftRight3:
SRL H
RR L
ShiftRight2:
SRL H
RR L
ShiftRight1:
SRL H
RR L
ShiftRight0:
RET
ShiftNegative:
; Assume -15 <= A <= 0
LD (ShiftNegativeJump-1), A
ADD A, A
ADD A, A
SUB ShiftNegativeJump-ShiftRight0+2
LD (ShiftNegativeJump+1), A
LD A, 0 ; Overwritten operand
ShiftNegativeJump:
JR $+2 ; Overwritten operand
ShiftRight15:
RL H
LD HL, 0
RL L
RET
ShiftRight14:
LD L, H
LD H, 0
ADD HL, HL
ADD HL, HL
LD L, H
LD H, 0
RET
ShiftRight13:
SRL H
ShiftRight12:
SRL H
ShiftRight11:
SRL H
ShiftRight10:
SRL H
SRL H
LD L, H
LD H, 0
RET
ShiftRight6:
PUSH AF
XOR A
ADD HL, HL
RLA
ADD HL, HL
RLA
LD L, H
LD H, A
POP AF
RET
ShiftRight5:
PUSH AF
XOR A
ADD HL, HL
RLA
ADD HL, HL
RLA
ADD HL, HL
RLA
LD L, H
LD H, A
POP AF
RET
ShiftAny:
CP 241
JP NC, ShiftNegative
CP 16
JR NC, ShiftRight16
ShiftLeft:
ShiftPositive:
; Assume 0 <= A <= 15
LD (ShiftLeftJump-1), A
ADD A, A
ADD A, A
ADD A, A
LD (ShiftLeftJump+1), A
LD A, 0 ; Overwritten operand
ShiftLeftJump:
JR $+2 ; Overwritten operand
ShiftLeft0:
RET
ShiftRight7:
ADD HL, HL
LD L, H
LD H, 0
RL H
RET
ShiftLeft1:
ADD HL, HL
RET
NOP
NOP
NOP
NOP
NOP
NOP
ShiftLeft2:
ADD HL, HL
ADD HL, HL
RET
NOP
NOP
NOP
NOP
NOP
ShiftLeft3:
ADD HL, HL
ADD HL, HL
ADD HL, HL
RET
ShiftLeft16:
ShiftRight16:
LD HL, 0
RET
ShiftLeft4:
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
RET
NOP
NOP
NOP
ShiftLeft5:
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
RET
NOP
NOP
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
RET
NOP
ShiftLeft7:
SRL L
LD H, L
LD L, 0
RR L
RET
ShiftLeft8:
LD H, L
LD L, 0
RET
NOP
NOP
NOP
NOP
ShiftLeft9:
SLA L
LD H, L
LD L, 0
RET
ShiftLeft11:
SLA L
ShiftLeft10:
SLA L
SLA L
LD H, L
LD L, 0
RET
JP ShiftLeft11
NOP
NOP
NOP
NOP
NOP
ShiftLeft12:
LD H, L
LD L, 0
ADD HL, HL
ADD HL, HL
ADD HL, HL
ADD HL, HL
RET
JP ShiftLeft13
NOP
NOP
NOP
NOP
NOP
JP ShiftLeft14
NOP
NOP
NOP
NOP
NOP
ShiftLeft15:
RR L
LD HL, 0
RR H
RET
ShiftLeft6:
PUSH AF
XOR A
SRL H
RR L
RRA
SRL H
RR L
RRA
LD H, L
LD L, A
POP AF
RET
ShiftLeft13:
SLA L
SLA L
SLA L
SLA L
SLA L
LD H, L
LD L, 0
RET
ShiftLeft14:
LD H, 0
RR L
RR H
RR L
RR H
LD L, 0
RET |
This library provides you with a number of useful entry points within its 335 code bytes. However, some do make use of self-modifying code.