Wednesday, 16 February 2011

External sorting using K-way merge

“The most important problem of all is the resources.. the RAM & CPU!!.. As we know, Algorithms are only for people who don’t know how to buy RAM.. :) This is a problem in which a person just has 100 MB of RAM and wanted to sort a 1 GB file with it. To sort the 1 GB file with just 100 MB, you need to design the algorithm in such a way that, it can be divided & conquered.. This is a classic problem showcased for optimizing memory usage of a system!!”

Metaphor

   This is in no context with practical world.. But a simple optimization problem to handle large number of resources with limited data. So, we will directly jump into the problem and the algorithm

Concept

Reference: Wikipedia

An example of external sorting is the external merge sort algorithm, which sorts chunks that each fit in RAM, then merges the sorted chunks together. For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:

  1. Read 100 MB of the data in main memory and sort by some conventional method, like quicksort.
  2. Write the sorted data to disk.
  3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks, which now need to be merged into one single output file.
  4. Read the first 10 MB of each sorted chunk into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
  5. Perform a 9-way merge and store the result in the output buffer. If the output buffer is full, write it to the final sorted file. If any of the 9 input buffers gets empty, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available.

The important point here is that, after you split the 900 MB data into 100 MB each and sort.. In the 9 sorted lists newly created, we apply a 9-way merge so that we take top 10 MB from each list.. which makes up to 90 MB. The free 10 MB in ram will be used as output write buffer. This 90 MB is merged in-place with merge algorithm.

Code

We will see this code in next article when we do insertion sort with merge operation