The Better Half
May 19th, 2010
A cute and easy algorithmic riddle.
You have an array of N bit strings each of length M. You know that there is at least one element that appears more than N/8 times in the array. Using O(M+log(N)) memory and O(NM) time, find such an element.
An Easier Version
Well, its actually almost exactly the same, but solve the above riddle in case there is an element that appears more than N/2 times in the array. I managed to find 3 distinct solutions to this easier variation, but only one of which generalizes easily.
Thanks Nemo for giving me this riddle!
May 20th, 2010 at 10:24 pm
The following simple algorithm works:
- Keep a (priority) queue of N/8, the value of x’s counter at the end is positive.
Now, it can happen that the final queue contains some low-freq elements (e.g., the last element to appear is surely in the queue), but the maximal counter belongs to a high-freq element [proof needed].
May 20th, 2010 at 10:34 pm
Hey! my entire proof is gone! Again.
ALGORITHM
We keep a (priorirty) queue of length at most 8, with strings and positive integer counters. The lowest counter value is called the baseline.
Read a string.
If it’s already in the queue, increase its counter. Otherwise:
{
If the queue has 8 elements, decrease all counters by the baseline and drop all elements with counter value zero.
Add the new element with counter value 1.
}
At the end, report the element of highest counter value.
PROOF
Let X be a string that appeared K times (so other elements appeared N-K times).
We investigate how its queue counter value C evolves.
Every time X appears, C increases by one.
Every time X is dropped, the decrease in C (=baseline) is applied to the other 7 counters too, so we can charge it to 7*baseline appearances of other strings.
All in all, C >= K – (N-K)/7 = (8K-N)/7.
Specifically, if K > N/8 then C is positive at the end and X is a member of the final queue.
Now, it can happen that the final queue contains some low-freq elements (e.g., the last element to appear is surely in the queue), but the maximal counter belongs to a high-freq element [proof needed].
May 22nd, 2010 at 10:26 am
Nice solution! Some remarks:
A queue of size 7 is enough, because when we decrease 1 from all of the elements of the queue, we in effect “remove” 8 strings from the data base: Those that were tracked in the queue, and also the last string read. Then if some element appeared more than 1/8th of the time, he will still be there more than 1/8 of the time after removing 8 different elements from the base, so it will be in the queue at the end of the process.
The claim that the maximal counter belongs to a high-freq element is false: Consider the scenario where the data base has 101 appearances of string ’0′, 100 appearances of strings ’1′,…,’8′, and 2 appearances of string ’9′, and they appear in lexicographical order. Then the queue at the end will have ’0′ with 1 appearance and ’9′ with 2. But ’0′ must still appear in the queue: So if we keep a register for each of the (at most) 8 elements in the final queue, and go over the list one more time, we will be able to determine which of them was really the most common.