Thursday, 17 February 2011

Merge sort with k way insertion sort

“I believe no one forgot merge sort!! The first place where we saw about trees and divide-conquer algorithms. Recollect all that, to understand the new method of sorting again using divide and conquer to save space”

See Also

http://analgorithmaday.blogspot.com/2011/01/merge-lists-without-using-extra-space.html

http://analgorithmaday.blogspot.com/search/label/merge

Metaphor

   What if you want to sort a list of sorted list using just a merge? Merge is really important in real world computer systems.. You merge code, you merge songs, you mix videos, you do editing. Its like you a list of sorted things, you need merge all of them and create a single sorted list.

  So, there are two things here you need, you need initial sorted things. How you sort them? you can use any sorting algorithm for this. There is no enforcement over this. So, merge itself becomes a sorting algorithm when you provide a sorted list to it. But what does our merge example given already does? It merges 2 sorted arrays into a single sorted array. Think about an external sorting scenario where you have 9 sorted list. Then you need to merge 9 ways. :)

  Even in code comparison tool, you have 3 way merge where you compare with 3 versions of code.. Even this merge is similar to what we do in a 3-way merge.

Concept

  The main concept here is, prepare a list of sorted arrays. You can create sorted arrays using any method. We will select insertion sort for now. quick sort can be studied next as it is a big topic!!.. :) After you sort the arrays, you get say k sorted arrays. How you merge all of em? As explained in the external sort, take the initial element in each array and do a merge into the sorted list.

The number of elements to take from each array should be configurable. There are many more optimization possible in a k-way merge sort. Using quicksort reduces the time taken, organizing the memory in RAM in the form of a tree reduces comparisons (priority queue usage). But it is expected that, the algorithm takes lots of CPU and much lesser RAM.

Code

void insertion_sort(int *arr, int n)
{
    for(int j=1; j<n;j++)
    {
        int key = arr[j];
        int i=j-1;
        while(i>=0) {
            if(arr[i] > key)
            {
                arr[i+1] = arr[i];
                i = i - 1;
            }
            else
                break;
        }
        arr[i+1] = key;
    }
}
 
// old way 2 way merge
void merge(int *arr, int start, int mid, int end)
{
    // arr = List D = split into List A and List B
    int n1, n2;
 
    n1 = mid - start + 1; // List A[0..n1-1]
    n2 = end - mid; // List B[0..n2-1]
 
    int *la, *lb;
    int i,j,k;
    la = (int*) malloc(sizeof(int)*(n1+1)); 
    lb = (int*) malloc(sizeof(int)*(n2+1));
 
    for(i=0; i<n1; i++)
    {
        if(start == 0)
            la[i] = arr[i];
        else
            la[i] = arr[start+i];
    }
 
    for(j=0; j<n2; j++)
        lb[j] = arr[mid+j+1];
 
    la[n1] = INT_MAX;
    lb[n2] = INT_MAX;
 
    i = 0; j =0;
    for(k = start; k<end+1; k++)
    {
        if(la[i] < lb[j])
        {
            arr[k] = la[i];
            i = i + 1;
        }
        else
        {
            arr[k] = lb[j];
            j = j + 1;
        }
    }
    free(la);
    free(lb);
}
 
const int SPLITS = 20;
 
void kmerge(FILE* out_fp,int segsize, int buf_size)
{
    // SPLITS * buf_size = RAM usage of this method
    /*
    1. read buf_size from each splits into a array of size SPLITS*buf_size
    2. perform the above 2-way merge in-place
    3. write the array inside out_fp
    */
    int* merge_arr = (int*)malloc(sizeof(int)*SPLITS*buf_size);
    int buf_offset=0;
    FILE *tmpfile[SPLITS];
    for(int k=0; k<SPLITS;k++) {
        char* filename = (char*) malloc(105);
        char* number=(char*) malloc(100);
        strcpy_s(filename, 105, "tmp");
        _itoa_s(k, number, 100, 10);
        strcat_s(filename, 105, number);
        errno_t e2 = fopen_s(&tmpfile[k], filename, "r");
    }
 
    while(buf_offset < segsize) {
        int j=0;
 
        // read buf_size from SPLITS files and 
        // put it inside merge_arr
        for(int k=0; k<SPLITS;k++) {
            for(int i=0; i<buf_size; i++) {
                char* buffer= (char*)malloc(100);
                if(!fgets(buffer, 100, tmpfile[k])) 
                    break;
                merge_arr[j]= atoi(buffer);
                j++;
                free(buffer);
            }
        }
 
        // merge the array
        merge(merge_arr, 0, SPLITS*buf_size/2,SPLITS*buf_size);
 
 
        // write into output file
        char* buffer=(char*) malloc(100);
        for(int k =0; k < j; k++) {
            _itoa_s(merge_arr[k], buffer, 100, 10);
            buffer[strlen(buffer)+1] = '\0';
            buffer[strlen(buffer)] = '\n';
            fputs(buffer, out_fp);
        }
        buf_offset+=buf_size;
    }
    for(int k=0; k<SPLITS;k++) {
        fclose(&tmpfile[k]);
    }
 
}
 
void external_sort(FILE* fp, int lineno, FILE* out_file)
{
    int l = (int) ceil((float)lineno/(float)SPLITS);
    int k=0;
    FILE* midfile;
    while(!feof(fp)) {
        int i = 0;
        char* buffer = (char*) malloc(100);
        int* p = (int*) malloc(sizeof(int)*l);
        while(i < l) {
            if(!fgets(buffer, 100, fp)) 
                break;
            p[i]= atoi(buffer);
            i++;
        }
        free(buffer);
        insertion_sort(p, i);
        char* filename = (char*) malloc(105);
        char* number=(char*) malloc(100);
        strcpy_s(filename, 105, "tmp");
        _itoa_s(k, number, 100, 10);
        strcat_s(filename, 105, number);
        errno_t e2 = fopen_s(&midfile, filename, "w");
        buffer=(char*) malloc(100);
        for(int j =0; j < i; j++) {
            _itoa_s(p[j], buffer, 100, 10);
            buffer[strlen(buffer)+1] = '\0';
            buffer[strlen(buffer)] = '\n';
            fputs(buffer, midfile);
        }
        free(buffer);
        free(p);
        buffer = 0;
        p=0;
        fclose(midfile);
        k++;
    }
    kmerge(out_file, l, 500);
}
 
void main()
{
    //519670
    FILE *in, *out;
    errno_t e1 = fopen_s(&in, "sample.txt", "r");
    errno_t e2 = fopen_s(&out, "out_sorted.txt", "w");
    external_sort(in, 519670, out)
    fclose(out);
    fclose(in);
}

The above code reused insertion_sort and merge routine which is already written. Some terminology in code,

SPLITS = number of splits to be done over the big file

seg_size = number of lines in the file/SPLITS

kmerge buf_size = the size to read from each sorted file

This is the diagrammatic representation,

sample.txt 

                    seg_size       seg_size*2….

|--------------------|-----------------|---------------------------------|

|           tmp0       |    tmp1          |    tmp2          | ….

|------------------------------------------------------------------------|

tmp0, tmp1, tmp2..unto tmp[SPLITS] files contains sorted parts of sample.txt.

All tmp files are merged into out_sorted.txt and a single huge sorted list is created.

While merging this is diagrammatic representation,

tmp0                  seg_size         

| -----|---------------|                  

|       |                   |

| -----|---------------|

       buf_size(500)

From all files, tmp0..tmpN, buf_size data is read into an array (RAM) and then merge operation is done over it. This programs RAM usage is not profiled correctly!! but it will be lesser.

Important points

  • You cannot use a single file to do sorting + merging. You need to split into different files and sort it
  • The files created are not deleted. just for the reference. Line number in the file is hardcoded.
  • insertion sort is used which is O(n2).. Should use better sorting based on the number of elements. Heap sort,quick sort are better alternatives.
  • NOT AT ALL TESTED!!!… USE AT UR OWN RISK!!..
  • BETER USE K-WAY SORT FROM THE BELOW LINKS RATHER THAN MY IMPLEMENTATION

Reference & must to use

http://code.google.com/p/kway/

http://cis.stvincent.edu/html/tutorials/swd/extsort/extsort.html