December 31, 2003

Snowfall in Central Park, New York

I was chatting with Dave the other day and he mentioned he was visiting www.tradesports.com during Christmas "dinner."  I had checked it out a few months ago, but not recently, and I found they have a section for betting on the weather, which stimulated my curiosity for some reason.  Presently, it only has one sub-category, snowfall in Central Park for various months or the whole season.  I have only briefly been in New York City to change planes, so I'm not exactly an expert on snowfall there, but the folks at Noaa have a nice collection of historical data, going back over 133 years.  I imported this data into MS Excel and played with it.

The resulting spreadsheet may be downloaded. It shows the amount of snow that fell in Central Park each month since the season of 1869-1870, the maximum months for each season, the seasonal totals, the monthly averages, etc. There are some columns to the right with zeros (false) and ones (true), which indicate if some event happened for the given season (i.e. row), such as "Was the seasonal snowfall at least 30 inches for 2002-2003?" Below, the probabilities of these events are estimated by averaging the zeros and ones for each column.

I do not have this years' snowfall numbers in the table. But I read elsewhere that December of 2003 has so far been 19.8 inches, which is well above the average, 5.3", and median, 3.1". The most obvious implication is that the total for the year will probably be in the ballpark of 19.8" plus the mean or median snowfall from Jan to Apr, i.e. 19.8"+16.8"=36.6" to 19.8"+21.9"=41.7". I computed the raw probability that the snowfall would be least 30", 40", 50", and 60", which are ongoing wagers at tradesports.com. The results were 38.06%, 20.90%, 9.70%, and 2.99%. Next I computed the conditional probability that the seasonal snowfall will be at least 30", 40", 50", and 60", given that at least one month has received 19" or more snow and found the conditional probabilities to be 89.66%, 37.61%, 25.49%, and 14.29%. The last traded values at tradesports.com, as of this writing, are 80%, 40%, 23%, and 13%, suggesting that contracts for SNOW.NYC.SEASON+30in should be bought, since there is nearly a 10% difference.

Another apparent implication of the large snowfall in December is that the expected snowfall in April is reduced. I am not sure if this is a mere coincidence of the data or a symptom of some underlying natural phenomenon. My estimation of the probability of April having an inch or more of snow is 12.3%, which is 8% below the normal average of 20%, and 10% below tradesports.com last value of 22%. So I presently recommend selling SNOW.NYC.APR+1.0in.

2004-1-1 17:15
ADDENDUM: I reviewed online information on calculating the statistical significance of a correlation, such as the pages at Vassar by Richard Lowry. Using the CORREL function in MS Excel, I calculated the correlation between the inches of snowfall in december and the snowfall in April to be r=-.12. The probability that the correlation occured by chance is traditionally done by computing a "t value" from the number of samples, N, and the correlation value, r. This t value may be computed from the equation t = r * sqrt((N-2)/(1-r*r)). Applying this value to a t-distribution table, the probability may then be found. A simplier approach is to use web pages instrumented with Javascipt for computing the probability, which is what I did until I realized that MS Excel had a TDIST function. The probability that a correlation of r=-.12 for N=133 samples would randomly occur when there is actually no underlying correlation is unfortunately 16.8% -- well above the typical 5% maximum for signficance used in scientific research. However, the correlation of the event "more than an average amount of snowfall in December" with the event "at least 1 inch of snow in April" is r=-.17, and the probability of this happening by chance is only 5%. So don't bet the farm by selling SNOW.NYC.APR+1.0in, but there may be something there. Other months have less correlation with December. Of course the season total reaching 30, 40, 50, and 60 inches is strongly correlated to a higher than average snowfall in December as it is to the event that any month has at least 19 inches of snowfall.

2004-2-17 22:45
ADDENDUM:
Even though the season is still months away from being over, there has been well over 30 inches of snow. So if you followed my recommendation to buy SNOW.NYC.SEASON+30in, then you would have made 25 cents per dollar gambled (since at the time, the probability was at 80%, which means 4:1 odds).

Posted by seander at 04:08 PM | Comments (0)

December 14, 2003

Online auctions

I have finished my Christmas shopping this year and have done it all online;  most of it has yet to arrive, so declaring victory would be premature.  Several items were purchased through auctions on eBay, since I am currently among the salary-challenged.  A couple others were bought in eBay on my father's behalf (as he lacks an eBay account).  Of course, the auctions on eBay show the penultimate bid value (plus epsilon) and highest bidder identity, while the rest is hidden.  However this system has been evolving into a silent auction.  

Silent auctions entail the bidders entering their maximum bids, which remain hidden until the auction ends, and whoever bids the ultimate value pays whatever the penultimate bid was (plus epsilon). On eBay, the equivalent to this is where everyone waits until the last few seconds to place their bid, and thus, anyone who decides to bid more simply because someone else has will not have the opportunity. Obviously, silent auctions benefits buyers more than sellers, who want to drum up the price by increasing the total number of bids, and therefore apparent demand (sellers try to do this using reserve prices).

The practice of entering a bid at in last minute is called “auction sniping,” and while some people (likely adrenalin junkies) snipe manually, others use web services or run programs to automate it. Some of the services available are: www.auctionsniper.com, www.eSnipe.com, www.auctionstealer.com, www.phantombidder.com, and www.vrane.com. Of these, only www.vrane.com is free (provided you bid on only one auction at a time). Others either charge a monthly fee or a percentage (typically 1%) of the auction sale price, assuming you win. I downloaded a program for MS Windows called Auction Sentry, which allows for a free trial period of 10 days, and if I want to use it thereafter, I must buy it for $14.95. I tried it on an auction for a couple 2720 Dictaphone machines for my dad, and it worked great. I might just buy it.

EBay has basically ignored the whole sniping matter so far. But in the future I suspect they may provide more options for auctions to address the frustrations of sellers who want to induce even more demand and bidders who lose at the last second but, in light of this demand, reconsider their maximum bid. In particular, an auction could go into “overtime” if a bid was placed near the end, such that the new end to the auction will be some delta of time beyond the old auction end time. This may continue for some number of periods or the delta could exponentially decay (such as delaying 1 hour, then 30 minutes, 15, 7.5, 3.25 …). The auction would thus have a nominal end time and a maximum end time.

Posted by seander at 04:52 PM | Comments (0)

Another bithack and a correction

The other day a person named Glenn Slayden sent me a new hack for my bithacks page as well as a correction.  The correction was for finding the log base 2 of an N-bit integer in O(lg(N)) operations where the number is known to be a power of 2 (special case below).  Some months ago I had optimized the array of constants for the general code above, but forgot to keep the constants for the special case code below, which caused it to fail.  It is fixed now.  The new hack is for conditionally setting or clearing bits in a word.  Suppose you want to set bits, defined by mask m, in a word, w, if flag f is true; if flag f is false you want to clear the bits in w that are defined in m.  The hack is to do this without branching: w ^= (-f ^ w) & m.  This seems to involve more operations than the branching version: if (f) w |= m; else w &= m.  Yet, I did some speed tests on an Athlon XP 2100+ using g++ -O3, and sure enough it does provide a 5-10% improvement.

Posted by seander at 02:25 PM | Comments (0)

December 08, 2003

Ultimate insomniac

I was reading the Quirkies section of Ananova.com and came across an article describing a 58-ear-old Romanian woman, Maria Stelica, who hasn't slept in 8 years.  I'm not sure if I can believe it, since reports of other people who had not slept (due to neurological problems) resulted in their deaths after a few months, but the article says doctors have verified her claims and were trying to understand the phenomenon.  When I was an undergraduate at the University of Washington, I recall watching a film during a Psychology class in which a man in Britain was interviewed.  To function normally he consistently required only 15 minutes of sleep per day later in life.  I can't find anything about him on the web, but no matter -- Maria is my new ultimate insomniac.

Posted by seander at 10:48 PM | Comments (0)

December 03, 2003

RAID Death and Resurrection

My blog has been down.  The reason is partly that the software RAID on which my directory sat was down, which is due to a temporary power outage knocking out two drives at once, and I’ve been busy with fixing things.  (RAID level 5 can tolerate at most one drive going bad.)  Both of the drives became out of sync with the third, though one of them was really ok physically.  The other one apparently developed too many bad blocks on a track and the hardware bad block remapping couldn't deal with it, so it appeared to be in a failed state.  The md software RAID package unfortunately can't deal with bad blocks on disks, unlike modern filesystems, such as ext2.  Fortunately I was able to remove the data using a tool called mdadm, which can tell a disk that is out of sync to have the same event count as another, thus appearing to be in sync to md.  For most (99.99999%+) of the data, this works fine, and I was able to tar everything up and copy it to our Windows XP machine.  But that was only the beginning.

Firstly, I felt I needed a solution that would mostly eliminate the power instability, which plague us biannually, due to tree limbs falling on power lines during the high winds of Fall and Spring. We sometimes suffer brief brownouts, which leave electronic equipment in an unstable state, or power surges, which fry unprotected electronics, or blackouts that last hours. (The last year has been particularly bad -- we lost a high-end editing VCR, a bread machine, several surge protection strips, and a microwave.) So we bought a new Belkin 1200VA UPS (Uninterruptible Power Supply), which immediately supplies power to the networking equipment, Linux machine, and XP machine for 5-10 minutes after an outage begins, and then tells them to shutdown gracefully. It also has surge protection and helps filter the signal. It works great, and I wish we had one years ago.

Secondly, since bad blocks eventually build up in any working hard drive, a way of remapping the bad blocks was needed for our RAID setup. I believe I found a solution with EVMS (Enterprise Volume Management System), which is an open source project funded by IBM that has a bad-block-remapping "plug-in." It also sports a nice GUI, and provides many other features, such as snapshots, LVM, multipath setups, and clustering. The only drawback is that it hasn't been available very long, so it probably has a few bugs.

To complicate matters, the old system disk we were using was only 10 GB, and after putting on swap space, home directories, and tons of software packages, it was nearly full. I was growing tired of keeping the Redhat packages up to date, and dealing with the dependencies was a (slow and manual) pain. Perhaps if we paid them it would not be a nuissance, but that seemed unnecessary, since I had been reading that other distributions provided free and nearly automated solutions to keeping packages current. To top it off, Redhad decided to stop catering to desktop platforms and focus on their server market. So I wanted to give another distro a try. But first we needed more disk system disk space.

Our Asus A7n8x Deluxe motherboard can support two SATA (Serial Advanced Technology Attachment) drives, so we went to PC-Club in Bellevue and bought a couple 120GB Western Digital SATA drives (WD1200JD) for around $270. Serial-ATA is supposed to eventually replace the widespread Parallel-ATA IDE, and though the cables' diameters and connectors are smaller and more manageable, SATA currently isn't much faster.

After the drives were installed, I spent several days reading about different Linux distributions; I considered Suse, Mandrake, Knoppix, Debian, Linux from Scratch, Gentoo, and others. I decided on Gentoo, since the package management seemed well respected, there were many active developers, it's conducive to a better understanding of how things work "under the hood," and it is source code-based, allowing for custom compilation optimizations for the processor used and hacking of packages' code. If you have a slow processor, you may not appreciate the hours required to compile everything from scratch, but fast modern processors can make quick work of it, and future ones will be even faster.

There are many different linux kernels available for use with gentoo. Not all versions of the Linux kernel support SATA, unfortunately. The 2.6 kernel, which is still in beta testing, the -ac (Alan Cox) versions of the 2.4 kernel, and probably a few others support SATA. EVMS version 2, which has the bad block replacement, requires component called Device Manager, which is included in the 2.6 kernels but must be installed in 2.4 kernel versions. I didn't have much luck persuading the 2.4 kernel versions to work. Either they had SATA working or EVMS 2, but not both. Only the vanilla 2.6 kernel was readily able to handle both after a few patches. Finally, after several attempts, I have the 2.6 (beta 9) kernel and EVMS 2 working.

The five 120GB disks each have 800MB, 11GB, 50GB, and 50GB partitions. About 2.4GB of swap is spread across three of the 800MB partitions, while the other two are used as a boot partition and a backup boot partition. The OS is on one of the 11GB and backed up to another. Two 11GB partitions are combined as a fast 22GB RAID level 0 (striped) volume for scratch space that is not backed up. As for the 50 GB partitions, they are used to create two 200GB volumes, each with RAID level 5. The reason for two rather than a single large volume is to guard against filesystem corruption; files on one volume will be backed up to the other until I’m confident that the Reiser filesystem is stable. For all volumes except one of the big RAIDs and the boot partitions, I am using Reiser 3.6.8, which is fast but I don’t trust it, after reading about how others who have been burned with earlier versions of Reiser. The boot partitions are ext3, because I know that our bootloader, Grub, can deal with it. The other big RAID volume is formatted with JFS (IBM’s Journaling File System, ported from AIX), which is good for large files, much like XFS. It is still green, and not many people have used it yet, so I don’t trust it either.

At this point I'm feeling a combination of relief and paranoia. I'm happy with Gentoo and I'm glad to have things working, but if an IDE controller goes out, leaving the RAID partitions unsynchronized, I don't know if EVMS can deal with it without losing data, and the mdadm tool can't help.

Posted by seander at 09:20 AM | Comments (0)