Monday, 9 May 2011

Count duplicates in an Integer array–Google Phone Interview Question


Question
This is one of the question asked to me on a phone interview with Google. Funny that it takes lots of time to think and the interviewer even after he gave ample time I just slept down as its 12:00 midnight. :)
This is the question: http://www.careercup.com/question?id=3483977
Find the repeated numbers in a sorted array using less than O(n) time complexity
So, if you have, A[] = {1,1,1,2,2,2,3,3,4}
you should print, 1=>3, 2=>3, 3=>2, 4=>1
Approach
The approach to this problem in O(n) way would be to just keep counting from 0 to n in the array. But the interviewer needs lesser time than that.
I came up with a approach but cannot make that into a working code:
The approach is, since the array is sorted, you can apply a modified version of binary traversal over the array and count the number of entries..
For example,
1  1   1  2   2   2   3   3   4
start        mid            end
If arr[start] == arr[end] then its simply means arr[end] symbol count is end-start+1
If not, keep splitting the array and compare arr[start] and arr[mid], if they are equal arr[mid] count is mid-start+1
similarly, its same for arr[mid+1] and arr[end]
But the important end condition to see here is, set of 3 numbers..
When the number of elements is 3, i.e, end-start = 2. you get following combinations
start     mid          end
3,       3,          3 = All same, so our start –> end takes care of this
2,          3,          3 = mid and end are different!! a shift!!
2,       2,              3 = start and mid are different!! a shift
3,       4,          5 = start, mid, end all are different!!
All the above cases should be handled perfectly to get the correct working code.
Code
 

map<int, int> blist;

 

void count_dup(int* arr, int start, int end)

{

    if(end == 0)

        blist[arr[start]]+=1;

 

    if(start<end) {

        if(arr[start] == arr[end]) {

            blist[arr[end]] += end - start + 1;

            return;

        }

 

        int mid = abs((start+end)/2);

        if(end - start == 2 && arr[start] < arr[mid] && arr[mid] < arr[end]) {

            blist[arr[start]] +=1;

            blist[arr[mid]] +=1;

            blist[arr[end]] +=1;

            return;

        }

 

        if(arr[start] != arr[mid] && start == mid) {

            blist[arr[start]]+=1;

            blist[arr[mid]] += 1;

            return;

        }

 

        if(arr[start] < arr[mid]) {

            count_dup(arr, start, mid);

        } 

        else {

            blist[arr[mid]] += mid - start + 1;

        }

        

        if(arr[mid+1] != arr[end] && mid+1 == end) {

            blist[arr[mid+1]]+=1;

            blist[arr[end]] += 1;

            return;

        }

        

 

        if(arr[mid+1] < arr[end]) {

            count_dup(arr, mid+1, end);

        }

        else {

            blist[arr[end]] += end - mid;

        }

    }

}

 

void main()

{

    int A[] = {1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,7,7,8,9};

    count_dup(A, 0,20);

 

}
  • Note that we only do recursive call if start, mid or mid+1, end is lesser.. That’s the way we do a binary traversal
  • The last case of all 3 different should be handled first itself. In this case, we  add to each start, mid, end +1
  • In cases when, start-end = 1 you either get, start=mid or mid+1=end, these are places where we need to check the other 2 cases.. mid+1&end, start&mid
  • Note that, since we only do, mid+1 and end.. mid & mid+1 get missed. This case is taken care by else part of the two recursions.
  • Worst case!!.. If all the elements in the array are different, it visits each and every node. Omega(n)
  • Best case!!. If all the elements in the array is the same, O(1). Yes, we just use index to calculate the count.
  • Average case!!.. It would be Theta(log n) depending on the input data
UPDATE:The above code is really brute force and not sure whether it works out for all cases and scenarios.

The perfect easy way to code is here: http://codepad.org/p0guN137.

The approach is to form a binary search tree from the sorted array and while forming that, use the hash map to track the duplicates count.  But it is also O(n) since we need to traverse the whole input set to form the tree. We can ignore the recursion when it goes around for duplicate items to get around O(log n). But using this way, it would convince the interviewer a little bit and also as we can ignore recursion in-between it reduces the time to less than O(n).