In response to Paul Messmer's email, I revised the operation counting methodology on my bithacks page to not count large constants as operations. It probably takes more time to load the value from a variable than a large immediate, even when multiple instructions are required to load it, because fetching values from memory has grow slower with faster CPUs clockrates. Since I don't count variables as operations, I decided not to count constants. My initial policy was due to the small number of registers on the popular x86 CPUs and the mulitple instructions required for loading large immediates on MIPS CPUs.
He also recommeded that I count the length of computational dependency chains rather than total operations. For example, in the following, x should take less time to compute than y:
x = a + b + c + d + e + f;While (a+b) is added, a modern superscalar processor could add (c+d) and (e+f) simultaneously. Unfortunately IA32 processors are limited in the number of registers they can use, and compilers for them typically don't bother to emit instructions that can be executed in parallel. I tested this with g++ 3.4 on a Pentium III, an Athlon XP, and an Athlon 64. Only after I tweeked the assembly output for the Athlon 64 (with it running in 64-bit mode) could I get the expected 40% speed-up.
y = ((((a + b) | c) + d) | e) + f;
So before updating the bithacks page to reflect the length of the computational dependency chains, I'll wait until at least half of the CPUs sold are capable of x86-64 (or EM64T, as Intel puts it), and people are typically running them in 64-bit mode (on Windows 64 or Linux 64), and compilers are capable of emitting code that can be executed in parallel.
/* The following uses the Gnu Assembler (gas) syntax, which has the destination register last */
/* Compiler's a+b+c+d+e+f: */
addl %esi, %edi
addl %edx, %edi
addl %ecx, %edi
addl %r8d, %edi
leal (%rdi,%r9), %eax
/* My hand-tweaked a+b+c+d+e+f: */
addl %esi, %edi
addl %ecx, %r8d
addl %edx, %r9d
addl %edi, %r8d
leal (%r8,%r9), %eax
/* Compiler's ((((a+b)|c)+d)|e)+f: */
addl %esi, %edi
orl %edx, %edi
addl %ecx, %edi
orl %r8d, %edi
leal (%rdi,%r9), %eax
Paul Messmer sent me an improvement to my bithack for determining if a word has a zero byte. It performs a quick pretest, which produces false positives on the rare occasions that the first byte is 0x80. All positives are checked using the more reliable method. Because it only takes 3 operations for the fast pretest, it runs faster.
bool mayHaveZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
if (mayHaveZeroByte) // or may just have 0x80 in the high byte
{
bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
}
// Return bits from x in even positions and bits from y in the odd
unsigned short
InterleaveBitsFromBytes(unsigned char const x, unsigned char const y)
{
return ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *
0x0102040810204081ULL >> 49) & 0x5555 |
((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *
0x0102040810204081ULL >> 48) & 0xAAAA;
}
This technique works by using two instances of the same basic pattern transformation, which spreads the bits of x and y out, requiring only an OR operation. The spreading is described below.We want bit pattern abcdefgh to become a0b0c0d0e0f0g0h abcdefgh * 0000000100000001000000010000000100000001000000010000000100000001 = 0x0101010101010101ULL ================================================================== abcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefghabcdefgh & 1000000001000000001000000001000000001000000001000000001000000001 = 0x8040201008040201ULL ================================================================== a00000000b00000000c00000000d00000000e00000000f00000000g00000000h * 0000000100000010000001000000100000010000001000000100000010000001 = 0x0102040810204081ULL ------------------------------------------------------------------ a00000000b00000000c00000000d00000000e00000000f00000000g00000000h + 00b00000000c00000000d00000000e00000000f00000000g00000000h0000000 + 0000c00000000d00000000e00000000f00000000g00000000h00000000000000 + 000000d00000000e00000000f00000000g00000000h000000000000000000000 + 00000000e00000000f00000000g00000000h0000000000000000000000000000 + 0e00000000f00000000g00000000h00000000000000000000000000000000000 + 000f00000000g00000000h000000000000000000000000000000000000000000 + 00000g00000000h0000000000000000000000000000000000000000000000000 + 0000000h00000000000000000000000000000000000000000000000000000000 ================================================================== aebfcgdhebfcgdhe0fcgdhe0f00dhe0f0g0he0f0g0h00f0g0h0000g0h000000h >>1001001 = 49 ================================================================== ?????????????????????????????????????????????????aebfcgdhebfcgdh & 0000000000000000000000000000000000000000000000000101010101010101 = 0x5555 ================================================================== 0000000000000000000000000000000000000000000000000a0b0c0d0e0f0g0hPosted by seander at 03:27 PM | Comments (0)
const float v; // compute c = int(log2(v))
int x = *(int *)&v;
int c = x >> 23;
if (c == 0) {
// handle subnormals
while ((x & 0x400000) == 0) {
x <<= 1;
c--;
}
}
return c - 127;
I didn't like the idea that the code would loop indefinately, so I replaced the loop, which computes the integer log base 2 of the mantissa, with a version that computes it via table lookup; it is slightly modified from the general case, since we need only deal with 23 bits. It should be faster than the code above.
const float v; // find int(log2(v)), where v > 0.0 && finite(v)
const int x = *(int *) &v;
int c = x >> 23; // c gets the result;
if (c)
{
c -= 127;
}
else
{ // subnormal, so recompute using mantissa: c = intlog2(x) - 149;
register unsigned int t; // temporary
// Note that LogTable256 was defined earlier
if (t = x >> 16)
{
c = LogTable256[t] - 133;
}
else
{
c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;
}
}
r = (v ^ (v >> (sizeof(int) * 8 - 1))) + ((unsigned) v >> (sizeof(int) * 8 - 1)) for anyone interested in avoiding the patented method for computing the integer absolute value without branching, though his takes another operation. I added my own variation of it, r = (v ^ (v >> (sizeof(int) * 8 - 1))) + (v < 0), which takes as many operations, though gcc generated fewer x86 instructions for it (4) than his (5). The patented version is still the best at 3.
Another variation came to mind, but I didn't think it sufficiently portable:
union {long long ll; int i[2];} t;
t.ll = v; // sign-extend
r = (t.i[0] ^ t.i[1]) - t.i[ BYTE_ORDER == LITTLE_ENDIAN];
I received an email from a Sean A. Irvine in New Zealand earlier this week. He was reading my bithacks page and suggested that I mention sign extending. I guess he uses Java, which doesn't have the bit fields feature that C has. I wrote three entries for sign extending, depending on if you have the bit fields of C and a fixed bit width or want something standard for variable bit widths or want something quick and dirty again for variable bit widths.
#!/bin/awk -f
# makethumbs
# Copyright (c) 2004 Sean Eron Anderson
# Take images filenames input and create thumbnails and output html to show them.
# Stdout will be a HTML table with the thumbnails and links to the full images.
# Each input line will have the image filename followed by an optional
# description, followed by an optional thumbnail size.
# Lines that start with # are comments
# Lines that start with <H followed by a digit 1-9 followed by > are headings, so
# So:
# <H2>\tExample Title
# Simple example: Go to a directory with images and type:
# ls -1 [^_]*.JPG | makethumbs >index.html
BEGIN {
FS = "\t";
intable = 0;
}
/<[Hh][1-9]>\t(.*)/ {
if (intable) {
print "</table>";
}
match($1, /<(.*)>/, a);
printf("<%s>%s</%s>\n\n", a[1], $2, a[1]);
if (intable) {
print "<table>"
}
}
!/^<[Hh][1-9]>\t/ && !/^[#]/ && /.+/ {
if (!intable) {
printf("<table>\n");
intable = 1;
}
size = ($3 == "") ? "x100" : $3;
resizecmd = "test _" $1 " -nt " $1 " || convert -size " size " " $1 " -resize x100 +profile '*' _" $1;
system(resizecmd);
geomcmd = "identify -size 1x1 -verbose "$1" | grep 'Base geom' | sed 's/ Base geometry: //'";
geomcmd | getline geom;
sizecmd = "identify -size 1x1 -verbose "$1" | grep 'Filesize' | sed 's/ Filesize: //'";
sizecmd | getline size;
datecmd = "date -r " $1;
datecmd | getline date;
info = $1 " " size " " geom " " date;
printf(" <tr>\n <td><a href='%s'><img src='_%s' alt='' title='%s'></a></td>\n",
$1, $1, info);
printf(" <td>%s</td>\n </tr>\n", $2);
}
END {
if (intable) {
printf("</table>\n");
}
}
The other day a person named Glenn Slayden sent me a new hack for my bithacks page as well as a correction. The correction was for finding the log base 2 of an N-bit integer in O(lg(N)) operations where the number is known to be a power of 2 (special case below). Some months ago I had optimized the array of constants for the general code above, but forgot to keep the constants for the special case code below, which caused it to fail. It is fixed now. The new hack is for conditionally setting or clearing bits in a word. Suppose you want to set bits, defined by mask m, in a word, w, if flag f is true; if flag f is false you want to clear the bits in w that are defined in m. The hack is to do this without branching: w ^= (-f ^ w) & m. This seems to involve more operations than the branching version: if (f) w |= m; else w &= m. Yet, I did some speed tests on an Athlon XP 2100+ using g++ -O3, and sure enough it does provide a 5-10% improvement.
Last Friday around 1:18 a.m. I received an email in slightly broken English from some random person who had read my bithack page and wanted to know how to swap two bits in a byte. I assume he was a novice programmer, since the obvious solution seemed straight-forward. I emailed him the answer but continued thinking about it for about 5 more hours, and came up with a couple improvements, so I added Swapping individual bits with XOR.
Later I created an ontological tableau of the various operations and solution methods to highlight where the holes are. I should figure out more precisely how many operations things take and add that to the table, and then format the table as a webpage, with links from each cell to the code.