PDA

View Full Version : Sorting a file with many already-sorted chunks



Ilya
2008-Aug-21, 02:58 PM
What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order? Example:

38013
38014
38015
33016
33017
33018
44034
44035
44036
44037
44038
44039
44040
44041
44042
44043
44044
44045
44046
44047
34892
34893
34894
34895
34896
34897
34898
34899
34900
34901
34902
34903
34904
34905
34906
38016
38017
38018
38019
38020

Here we have five already-ordered chunks, length 3, 3, 14, 15 and 5. I would think QuickSort is the best in such situation, but I am not sure.

jokergirl
2008-Aug-21, 03:16 PM
What's the average size of the chunks and the average file size (in comparison to the chunk size)?
And do the chunks overlap or is every individual post unique?

I'd have said quicksort initially too, but if the chunk size is non trivial and there is no overlap it might be a bonus identifying them first and then sorting by first item.
Quicksort would break up the chunks really fast, so you'd have little use of that...

;)

tdvance
2008-Aug-21, 03:32 PM
In a CS 101 course, I was told "mergesort" works well for almost-sorted files, whatever that means precisely. At least, it's something to try. Wikipedia has, I'm sure, that one--it has several popular sorting algorithms.

Quicksort--it can be good or bad for the situation depending on the strategy for choosing the pivot point. I do recall that one possible way of choosing the pivot point means already-sorted files give worst case O(n^2) behavior, and presumably many nearly-sorted files would do badly as well. if the pivot point is chosen well, I bet quick sort on nearly-sorted files is competitive with about anything else.

I assume your sample is not representative--if it's that small, overhead matters more than asymptotic behavior and bubble sort would work fine. What i've said above, and in fact most analysis of sorting algorithms, is for larger sets--1000, 1000000, or even more items.

And if what you're sorting really is machine-sized numbers roughly uniformly distributed, then an "index sort" would be the way to go (or a "bucket sort" combined with a sort mentioned above), or even, skip the sort and use a hash table.

I just read Jokegirl's post--quicksort breaking up chunks. I didn't think of that, but i'm not sure that a strategy to identify them wouldn't take longer than quicksort's breaking them up and then not actually moving anything.

tdvance
2008-Aug-21, 03:34 PM
Actually, I misread--each chunk out of order from other chunks--it could be it is best to do a linear scan to find the boundaries of chunks, then sort the chunks by, say, smallest element, and then do mergesort or quicksort (the last only if what you say is approximately true rather than exactly true of the data, to finish the job).

jokergirl
2008-Aug-21, 04:37 PM
That's why I said it depends on how big the chunks are, and how big they are relative to the file size. If they're small enough to actually count as almost-random, there's no point in identifying them.

;)

Ilya
2008-Aug-21, 05:59 PM
Actually, I misread--each chunk out of order from other chunks--it could be it is best to do a linear scan to find the boundaries of chunks, then sort the chunks by, say, smallest element, and then do mergesort

That's what I ended up doing. Typical file for my application has 25-100 thousand entries, and most contiguous chunks are in 15-50 entries range.

Moose
2008-Aug-21, 06:12 PM
What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order? Example:

Here we have five already-ordered chunks, length 3, 3, 14, 15 and 5. I would think QuickSort is the best in such situation, but I am not sure.

No. Mergesort is as good as quicksort in the average case (it comes down to how efficiently you code the steps), but is much, much better in the case you're considering. Quicksort worst-cases at O(n^2) when you have an almost sorted list. Mergesort doesn't have a worst-case.

I wouldn't bother trying to find almost-sorted chunks. The programming and testing time probably isn't worth it for the data set you're working with. If n is going to be around 100,000 items, and RAM isn't much of an issue, Mergesort will do the job just fine.

Wiki will have the algorithm, but if you're having trouble with it, let me know and I'll try to help.

Moose
2008-Aug-21, 06:21 PM
Speaking, Ilya, depending on what else (if anything) you're trying to do with that file, it might be sufficient to just pass the file through an already implemented sorter, like a windows port of the unix sort command.

jokergirl
2008-Aug-21, 06:47 PM
I don't think Mergesort does exactly what you want it to do.
In this case, finding the end of a chunk should be easy - y!=x+1. The drawback is that you don't do the sorting in the original array, meaning it needs more memory. Might be good though, since you might not want to overwrite the original anyway.

The BF suggests looking into how defrags are done, similar application :)

;)

Ilya
2008-Aug-21, 07:06 PM
Speaking, Ilya, depending on what else (if anything) you're trying to do with that file, it might be sufficient to just pass the file through an already implemented sorter, like a windows port of the unix sort command.

That was never an option because a) the file is binary, b) it has no carriage returns, and c) a record key (numbers 38013, 38014, etc. in OP) is not even a distinct entity in each record -- it has to be calculated from several bytes scattered within the record.

In any case, I took care of it.

Moose
2008-Aug-21, 07:14 PM
Okie.

Hang Nail
2008-Aug-21, 07:27 PM
Quick, Merge, and Heap are all n-log-n in the average case, although the constant factor is not necessarily the same.

Quick is N-squared in the worst case - the other two are n-log-n in the average case. I like Heap, because, unlike Merge, it requires no extra memory. None of these really do much to exploit the somewhat-sequenced nature of the existing list, but merge and heap are good sort routines that are always pretty fast, no matter what.

Hang Nail
2008-Aug-21, 07:29 PM
Oh spite. I meant the other two are n-log-n in the worst case, and I didn't see that you already solved the problem :(

Moose
2008-Aug-21, 07:50 PM
Quick is N-squared in the worst case - the other two are n-log-n in the average case. I like Heap, because, unlike Merge, it requires no extra memory.

MergeSort is not that bad if you if you implement the routine to sort a queue of structures, or just an index, depending on the actual size of your data set. Either way, it's usually not necessary to duplicate your data (except the once in the worst case, if you're rewriting a file.) You're just adding one pointer per record.

(Of course, if you're going to bother with building an index, you'd just build one of the b-tree types, then traverse it, using the direct access mode to chop up the file afterwards. Problem solved.)

HenrikOlsen
2008-Aug-21, 08:55 PM
A very important point to know when deciding on the optimal algorithm is whether the chunks are overlapping so you have to interleave the values from more than one chunk or if they are exclusive so you basically just have to reorder the chunks based on first element.
This wasn't stated in the problem so it's entirely possible the suggestions given so far where not the optimal solution.

This noted out of curiosity as I did see the practical version of the problem had already been executed.

mugaliens
2008-Aug-21, 09:15 PM
You're looking for an algorithm which compares n with n+1. So long as n+1 > n, it figures that it's working through a pre-sorted chunk. When it finds an n+1 < n, it marks it as a different chunk, noting the two bordering values in a table. It will continue until it finds a value which falls between any two bordering pair of numbers, at which point it will mark that section. After the marking is complete, it simply sorts the marked sections, rather than each and every number.

This approach is used to sort data created by different batch jobs, as the results of each batch job is contiguous, but the batches may arrive out of order.

jfribrg
2008-Aug-21, 09:35 PM
Quicksort is probably not a good choice because it makes no use of the fact that the list is partially sorted. Same thing with heapsort, even though the worst case is decent, for this particular list, it probably isn't too good. If you can easily break up the list, then mergesort is probably the way to go. If you are willing to create a special-purpose sort, then you can probably do far better:

Treat each chunk as a single data element, where the value to compare is the smallest value for that chunk. For the list given above, the following 5 chunks are created:

38013-38015
33016-33018
44034-44047
34892-34906
38016-38020

These are all mutually exclusive chunks (and the sample data given above is), then you can sort each of these chunks as if they were a single record, then you are done. For the 5 chunks listed above, even a bubble sort would outperform heapsort or quicksort. For a larger list of chunks , then one of the other sorts would be acceptable. From an implementation standpoint, you would not sort the chunks, but rather sort a list of pointers (or references) to the chunks, and copy the chunk data only after the sorting is complete.

The benefit to this approach is that we only compare items within the chunks once (we do that at the beginning only to identify the chunks), and we only copy the data once. There is the space overhead of creating the list of chunk references, but this is small assuming that there are only a few chunks, most of which have many data items.

If the list is not mutually exclusive, then you would need an extra step to finish sorting the nearly sorted list. For instance, the following list has two chunks that may have data in the range 44034-44040.

38013-38015
33016-33018
44034-44047
34892-34906
38016-38020
44030-44040

To deal with this situation, after doing the sort I described above, then you could do a mergesort or shellsort on the two chunks to finish things off. If there are numerous chunks with this issue, then a shell sort over all of the affected chunks may be cost effective.