16-bit Shifts on Z80 - chilliant.com

One would think that 16-bit, unsigned, binary shifts on a Z80 microprocessor would be as trivial as it comes. But my work on Easter calculations made me realise just how little has been actually written down about optimising Z80 code for even the simplest of cases. So here's a quick guide; alas, probably thirty years too late!

The reason these shifts aren't obvious is because of the inherent asymmetry in the Z80 processor. Although it has a fairly orthogonal 8-bit ALU, its 16-bit (address) pipeline only has a simple adder. Multiplying HL by two (a left shift) is trivial: ADD HL, HL. Shifting right is a whole can of worms; to shift a 16-bit quantity to the right, it is sometimes quicker (and/or shorter) to shift/rotate left and adjust.

The following are routines that I believe are either the fastest or shortest to achieve the shifts. Optimal alternatives are given for preserving other registers, where necessary. Flags on exit are undefined.

Many thanks to Eric Fry for pointing out a couple of mistakes I made on a previous version of this page.

Unsigned 16-bit Left Shifts

The following code snippets shift the unsigned value in HL to the left. The least significant bit is set to zero.

Shift Left 1

  ; T-Cycles: 11
  ; Bytes: 1
  ; Trashed: None
  ADD HL, HL

Shift Left 2

  ; T-Cycles: 22
  ; Bytes: 2
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL

Shift Left 3

  ; T-Cycles: 33
  ; Bytes: 3
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

Shift Left 4

  ; T-Cycles: 44
  ; Bytes: 4
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

Shift Left 5

  ; T-Cycles: 55
  ; Bytes: 5
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

Shift Left 6

This is the first point where we have a meaningful choice: the left-hand code is shorter and does not destroy the original contents of the A register, whereas the right-hand code is faster.

  ; T-Cycles: 66
  ; Bytes: 6
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 52
  ; Bytes: 13
  ; Trashed: A
  XOR A
  SRL H
  RR L
  RRA
  SRL H
  RR L
  RRA
  LD H, L
  LD L, A

Shift Left 7

These three snippets are, respectively: shortest, fastest and fastest without side effects.

  ; T-Cycles: 77
  ; Bytes: 7
  ; Trashed: None
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 32
  ; Bytes: 8
  ; Trashed: A
  XOR A
  SRL H
  RR L
  RRA
  LD H, L
  LD L, A

  ; T-Cycles: 35
  ; Bytes: 9
  ; Trashed: None
  SRL H
  RR L
  LD H, L
  LD L, 0
  RR L

Shift Left 8

  ; T-Cycles: 11
  ; Bytes: 3
  ; Trashed: None
  LD H, L
  LD L, 0

Shift Left 9

Although neither of the following snippets use other registers, one is faster while the other is shorter.

  ; T-Cycles: 19
  ; Bytes: 5
  ; Trashed: None
  SLA L
  LD H, L
  LD L, 0

  ; T-Cycles: 22
  ; Bytes: 4
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL

Shift Left 10

  ; T-Cycles: 23
  ; Bytes: 6
  ; Trashed: A
  LD A, L
  ADD A, A
  ADD A, A
  LD H, A
  LD L, 0

  ; T-Cycles: 33
  ; Bytes: 5
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 27
  ; Bytes: 7
  ; Trashed: None
  SLA L
  SLA L
  LD H, L
  LD L, 0

Shift Left 11

  ; T-Cycles: 27
  ; Bytes: 7
  ; Trashed: A
  LD A, L
  ADD A, A
  ADD A, A
  ADD A, A
  LD H, A
  LD L, 0

  ; T-Cycles: 44
  ; Bytes: 6
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 35
  ; Bytes: 9
  ; Trashed: None
  SLA L
  SLA L
  SLA L
  LD H, L
  LD L, 0

Shift Left 12

  ; T-Cycles: 31
  ; Bytes: 8
  ; Trashed: A
  LD A, L
  ADD A, A
  ADD A, A
  ADD A, A
  ADD A, A
  LD H, A
  LD L, 0

  ; T-Cycles: 55
  ; Bytes: 7
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 43
  ; Bytes: 11
  ; Trashed: None
  SLA L
  SLA L
  SLA L
  SLA L
  LD H, L
  LD L, 0

Shift Left 13

  ; T-Cycles: 34
  ; Bytes: 9
  ; Trashed: A
  LD A, L
  RRCA
  RRCA
  RRCA
  AND 224
  LD H, A
  LD L, 0

  ; T-Cycles: 66
  ; Bytes: 8
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 51
  ; Bytes: 13
  ; Trashed: None
  SLA L
  SLA L
  SLA L
  SLA L
  SLA L
  LD H, L
  LD L, 0

Shift Left 14

  ; T-Cycles: 30
  ; Bytes: 8
  ; Trashed: A
  LD A, L
  RRCA
  RRCA
  AND 224
  LD H, A
  LD L, 0

  ; T-Cycles: 77
  ; Bytes: 9
  ; Trashed: None
  LD H, L
  LD L, 0
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL

  ; T-Cycles: 46
  ; Bytes: 12
  ; Trashed: None
  LD H, 0
  RR L
  RR H
  RR L
  RR H
  LD L, 0

Shift Left 15

  ; T-Cycles: 24
  ; Bytes: 6
  ; Trashed: A
  XOR A
  RR L
  LD L, A
  RRA
  LD H, A

  ; T-Cycles: 28
  ; Bytes: 7
  ; Trashed: None
  RR L
  LD HL, 0
  RR H

The left shifts range from minima of 11 to 48 T-cycles or between 1 and 8 code bytes.

Unsigned 16-bit Right Shifts

The following code snippets shift the unsigned value in HL to the right. The most significant bit is set to zero.

Shift Right 1

  ; T-Cycles: 16
  ; Bytes: 4
  ; Trashed: None
  SRL H
  RR L

Shift Right 2

  ; T-Cycles: 32
  ; Bytes: 8
  ; Trashed: None
  SRL H
  RR L
  SRL H
  RR L

Shift Right 3

  ; T-Cycles: 48
  ; Bytes: 12
  ; Trashed: None
  SRL H
  RR L
  SRL H
  RR L
  SRL H
  RR L

  ; T-Cycles: 44
  ; Bytes: 11
  ; Trashed: A
  LD A, L
  SRL H
  RRA
  SRL H
  RRA
  SRL H
  RRA
  LD L, A

Shift Right 4

  ; T-Cycles: 64
  ; Bytes: 16
  ; Trashed: None
  SRL H
  RR L
  SRL H
  RR L
  SRL H
  RR L
  SRL H
  RR L

  ; T-Cycles: 56
  ; Bytes: 14
  ; Trashed: A
  LD A, L
  SRL H
  RRA
  SRL H
  RRA
  SRL H
  RRA
  SRL H
  RRA
  LD L, A

  ; T-Cycles: 72
  ; Bytes: 11
  ; Trashed: A
  XOR A
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  LD L, H
  LD H, A

Shift Right 5

  ; T-Cycles: 57
  ; Bytes: 9
  ; Trashed: A
  XOR A
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  LD L, H
  LD H, A

  ; T-Cycles: 78
  ; Bytes: 11
  ; Trashed: None
  PUSH AF
  XOR A
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  LD L, H
  LD H, A
  POP AF

Shift Right 6

  ; T-Cycles: 42
  ; Bytes: 7
  ; Trashed: A
  XOR A
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  LD L, H
  LD H, A

  ; T-Cycles: 63
  ; Bytes: 9
  ; Trashed: None
  PUSH AF
  XOR A
  ADD HL, HL
  RLA
  ADD HL, HL
  RLA
  LD L, H
  LD H, A
  POP AF

Shift Right 7

  ; T-Cycles: 27
  ; Bytes: 5
  ; Trashed: A
  XOR A
  ADD HL, HL
  RLA
  LD L, H
  LD H, A

  ; T-Cycles: 30
  ; Bytes: 6
  ; Trashed: None
  ADD HL, HL
  LD L, H
  LD H, 0
  RL H

Shift Right 8

  ; T-Cycles: 11
  ; Bytes: 3
  ; Trashed: None
  LD L, H
  LD H, 0

Shift Right 9

  ; T-Cycles: 19
  ; Bytes: 5
  ; Trashed: None
  SRL H
  LD L, H
  LD H, 0

Shift Right 10

  ; T-Cycles: 27
  ; Bytes: 7
  ; Trashed: None
  SRL H
  SRL H
  LD L, H
  LD H, 0

Shift Right 11

Here, you only waste 1 T-cycle if you want to preserve A.

  ; T-Cycles: 35
  ; Bytes: 9
  ; Trashed: None
  SRL H
  SRL H
  SRL H
  LD L, H
  LD H, 0

  ; T-Cycles: 34
  ; Bytes: 9
  ; Trashed: A
  LD A, H
  RRA
  RRA
  RRA
  AND 31
  LD L, A
  LD H, 0

Shift Right 12

  ; T-Cycles: 43
  ; Bytes: 11
  ; Trashed: None
  SRL H
  SRL H
  SRL H
  SRL H
  LD L, H
  LD H, 0

  ; T-Cycles: 38
  ; Bytes: 10
  ; Trashed: A
  LD A, H
  RRA
  RRA
  RRA
  RRA
  AND 15
  LD L, A
  LD H, 0

Shift Right 13

  ; T-Cycles: 51
  ; Bytes: 13
  ; Trashed: None
  SRL H
  SRL H
  SRL H
  SRL H
  SRL H
  LD L, H
  LD H, 0

  ; T-Cycles: 34
  ; Bytes: 9
  ; Trashed: A
  LD A, H
  RLA
  RLA
  RLA
  AND 7
  LD L, A
  LD H, 0

  ; T-Cycles: 55
  ; Bytes: 9
  ; Trashed: None
  LD L, H
  LD H, 0
  ADD HL, HL
  ADD HL, HL
  ADD HL, HL
  LD L, H
  LD H, 0

Shift Right 14

  ; T-Cycles: 30
  ; Bytes: 8
  ; Trashed: A
  LD A, H
  RLA
  RLA
  AND 3
  LD L, A
  LD H, 0

  ; T-Cycles: 44
  ; Bytes: 8
  ; Trashed: None
  LD L, H
  LD H, 0
  ADD HL, HL
  ADD HL, HL
  LD L, H
  LD H, 0

Shift Right 15

  ; T-Cycles: 24
  ; Bytes: 6
  ; Trashed: A
  XOR A
  RL H
  LD H, A
  RLA
  LD L, A

  ; T-Cycles: 26
  ; Bytes: 7
  ; Trashed: None
  RL H
  LD HL, 0
  RL L

The right shifts range from minima of 11 to 57 T-cycles or between 3 and 11 code bytes.

Unsigned 16-bit Shift Library

If you were developing a library to perform arbitrary shifts, you might consider the following:

  ShiftRight:
    ; Assume 0 <= A <= 15
    LD (ShiftRightJump-1), A
    XOR 15
    ADD A, A
    ADD A, A
    LD (ShiftRightJump+1), A
    LD A, 0 ; Overwritten operand
  ShiftRightJump:
    JR $+2 ; Overwritten operand
    JP ShiftRight15
    NOP
    JP ShiftRight14
    NOP
    JP ShiftRight13
    NOP
    JP ShiftRight12
    NOP
    JP ShiftRight11
    NOP
    JP ShiftRight10
    NOP
    CP 0 ; Two NOPs
  ShiftRight9:
    SRL H
  ShiftRight8:
    LD L, H
    LD H, 0
    RET
    JP ShiftRight7
    NOP
    JP ShiftRight6
    NOP
    JP ShiftRight5
    NOP
  ShiftRight4:
    SRL H
    RR L
  ShiftRight3:
    SRL H
    RR L
  ShiftRight2:
    SRL H
    RR L
  ShiftRight1:
    SRL H
    RR L
  ShiftRight0:
    RET
  ShiftNegative:
    ; Assume -15 <= A <= 0
    LD (ShiftNegativeJump-1), A
    ADD A, A
    ADD A, A
    SUB ShiftNegativeJump-ShiftRight0+2
    LD (ShiftNegativeJump+1), A
    LD A, 0 ; Overwritten operand
  ShiftNegativeJump:
    JR $+2 ; Overwritten operand
  ShiftRight15:
    RL H
    LD HL, 0
    RL L
    RET
  ShiftRight14:
    LD L, H
    LD H, 0
    ADD HL, HL
    ADD HL, HL
    LD L, H
    LD H, 0
    RET
  ShiftRight13:
    SRL H
  ShiftRight12:
    SRL H
  ShiftRight11:
    SRL H
  ShiftRight10:
    SRL H
    SRL H
    LD L, H
    LD H, 0
    RET
  ShiftRight6:
    PUSH AF
    XOR A
    ADD HL, HL
    RLA
    ADD HL, HL
    RLA
    LD L, H
    LD H, A
    POP AF
    RET
  ShiftRight5:
    PUSH AF
    XOR A
    ADD HL, HL
    RLA
    ADD HL, HL
    RLA
    ADD HL, HL
    RLA
    LD L, H
    LD H, A
    POP AF
    RET
  ShiftAny:
    CP 241
    JP NC, ShiftNegative
    CP 16
    JR NC, ShiftRight16
  ShiftLeft:
  ShiftPositive:
    ; Assume 0 <= A <= 15
    LD (ShiftLeftJump-1), A
    ADD A, A
    ADD A, A
    ADD A, A
    LD (ShiftLeftJump+1), A
    LD A, 0 ; Overwritten operand
  ShiftLeftJump:
    JR $+2 ; Overwritten operand
  ShiftLeft0:
    RET
  ShiftRight7:
    ADD HL, HL
    LD L, H
    LD H, 0
    RL H
    RET
  ShiftLeft1:
    ADD HL, HL
    RET
    NOP
    NOP
    NOP
    NOP
    NOP
    NOP
  ShiftLeft2:
    ADD HL, HL
    ADD HL, HL
    RET
    NOP
    NOP
    NOP
    NOP
    NOP
  ShiftLeft3:
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    RET
  ShiftLeft16:
  ShiftRight16:
    LD HL, 0
    RET
  ShiftLeft4:
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    RET
    NOP
    NOP
    NOP
  ShiftLeft5:
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    RET
    NOP
    NOP
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    RET
    NOP
  ShiftLeft7:
    SRL L
    LD H, L
    LD L, 0
    RR L
    RET
  ShiftLeft8:
    LD H, L
    LD L, 0
    RET
    NOP
    NOP
    NOP
    NOP
  ShiftLeft9:
    SLA L
    LD H, L
    LD L, 0
    RET
  ShiftLeft11:
    SLA L
  ShiftLeft10:
    SLA L
    SLA L
    LD H, L
    LD L, 0
    RET
    JP ShiftLeft11
    NOP
    NOP
    NOP
    NOP
    NOP
  ShiftLeft12:
    LD H, L
    LD L, 0
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    ADD HL, HL
    RET
    JP ShiftLeft13
    NOP
    NOP
    NOP
    NOP
    NOP
    JP ShiftLeft14
    NOP
    NOP
    NOP
    NOP
    NOP
  ShiftLeft15:
    RR L
    LD HL, 0
    RR H
    RET
  ShiftLeft6:
    PUSH AF
    XOR A
    SRL H
    RR L
    RRA
    SRL H
    RR L
    RRA
    LD H, L
    LD L, A
    POP AF
    RET
  ShiftLeft13:
    SLA L
    SLA L
    SLA L
    SLA L
    SLA L
    LD H, L
    LD L, 0
    RET
  ShiftLeft14:
    LD H, 0
    RR L
    RR H
    RR L
    RR H
    LD L, 0
    RET

This library provides you with a number of useful entry points within its 335 code bytes. However, some do make use of self-modifying code.

`ShiftLeft0` to `ShiftLeft16`

These routines shift HL left by the appropriate amount in the fastest way possible, without modifying any other registers (except possibly flags). They take between 10 and 76 T-cycles.

`ShiftRight0` to `ShiftRight16`

These routines shift HL right by the appropriate number of bits in the fastest way possible, without modifying any other registers (except possibly flags). They take between 10 and 90 T-cycles.

`ShiftLeft` and `ShiftRight`

These two routines shift HL left or right by the number of bits specified in A. Register A must be between 0 and 15 inclusive. Only HL is altered. They take, at most, 133 and 150 T-cycles respectively.

`ShiftPositive`

Shifts HL left (positive direction) the number of bits specified in A. Register A must be between 0 and 15 inclusive. Only HL is altered. It takes between 67 and 133 T-cycles.

`ShiftNegative`

Shifts HL right (negative direction) the number of bits specified in A. Register A is treated as an 8-bit signed quantity and must be between -15 (0xF1) and 0 (0x00) inclusive. Only HL is altered. It takes between 70 and 150 T-cycles.

`ShiftAny`

Shifts HL left or right the number of bits specified in A. Register A is treated as an 8-bit signed quantity. If A is outside the range [-15,+15], HL will be set to zero. In all cases, only HL is altered. It takes between 56 and 167 T-cycles.