Question
Asked 21st Dec, 2014
Deleted profile

How can I sort a huge file without using a large memory?

I need a C# code or algorithm for sorting a file that contain students records. I don't want use all memory for sorting this file and want to sort each record. Do you know similar code or algorithm?

Most recent answer

22nd Dec, 2015
Petr Voborník
University of Hradec Králové
You can use external sort method, e.g. with merge sort algorithm.

Popular answers (1)

22nd Dec, 2014
Hans Henrik Stærfeldt
Intomics A/S
linux> man sort
(Sort implementation uses smart ways for what you want, e.g. http://vkundeti.blogspot.dk/2008/03/tech-algorithmic-details-of-unix-sort.html )
4 Recommendations

All Answers (17)

22nd Dec, 2014
Nisansa de Silva
University of Moratuwa
Wouldn't a simple merge sort with threads (limited to a pool size) fulfill your requirement? 
2 Recommendations
22nd Dec, 2014
Javed Khan
Aga Khan University, Pakistan
Send me the file at my email id javedak09@gmail.com. I will write the code for you.
1 Recommendation
22nd Dec, 2014
Sebastian J. I. Herzig
Microsoft
If you are worried about memory consumption, and can subdivide your data a-priori, then you may want to write an algorithm that first partitions your larger file into smaller files (which may be done relatively memory efficient using a StreamReader), and then sorting the individual files.
A second option is to use an external sort algorithm. External sort algorithms store part of the data to be sorted on an external medium - such as a hard disk - and parts in memory. There are numerous implementations of such algorithms in C# out there. Here's one based on merge sort: http://www.splinter.com.au/sorting-enormous-files-using-a-c-external-mer/
3 Recommendations
22nd Dec, 2014
Hans Henrik Stærfeldt
Intomics A/S
linux> man sort
(Sort implementation uses smart ways for what you want, e.g. http://vkundeti.blogspot.dk/2008/03/tech-algorithmic-details-of-unix-sort.html )
4 Recommendations
23rd Dec, 2014
Baltasar García Perez-Schofield
University of Vigo
I think the best strategy would be to divide the data in chunks. For example, let's say that you only want to spend 2MB of memory. You would read records until that memory is fulfilled, sort them and write them to a data_part.xxx file. When you are done, you only need to open all .xxx files, read one record at once from each file, and write them in order to the final file. That was how it was done in Cobol, at least.
2 Recommendations
23rd Dec, 2014
Ezekiel Gordon
KEMRI-Wellcome Trust Research Programme
Are you on a Linux environment??? it can be done using a one line command on Ubuntu terminal and definitely will not eat up your memory however large the file is...Simply use cat|grep|sort| Unix utilities...
1 Recommendation
Dear gordon please write the command for me with example.
23rd Dec, 2014
Ezekiel Gordon
KEMRI-Wellcome Trust Research Programme
cat /home/geoinformatics/Desktop/Portal_research/geoserver_log/*.log.* |  grep "BBOX" | awk '{print}' | sort |
The above one line command on terminal will do the following:
  • cat - the path to file location and takes all the file extensions with .log e.g server.log or server2.log
  • grep - searches line by line for a string e.g BBOX in the file for my case
  • awk - prints the findings
  • sort - does realigning which u can print into one file suppose u were looking into numerous files..
1 Recommendation
25th Dec, 2014
Mircea Dragan
IBM
For those giving examples from Linux: Did you read the requirements? The author needs a code in C#, and that has nothing to do with Linux.
1 Recommendation
29th Dec, 2014
Jonathan Schattke
Missouri University of Science and Technology
The general method is to use a routine which uses drive space... break the large file into subfiles, each file is small enough to use a qsort (I.E, fill your internal array with n recod, using the memory amount you please for the array, sort, then write the sorted array).. Then use an interleaving routine to recombine (that is, open all the files, read the record from each file, write the sort order first record to output & increment that input file, repeat until all files exhausted).
Total disk usage is then triple your initial file if you keep all original files until the process is done, double if you delete the initial file after the subfiles are written.
1st Jan, 2015
Claudio Bisegni
INFN - Istituto Nazionale di Fisica Nucleare
I would try to use an external index file if the result is only to read the content in an ordered way. Anyway using the index you simply can copy the ordered readed element in a new file and then delete unordered one.
28th Feb, 2015
Saurabh Gayali
Institute of Genomics and Integrative Biology
Why not try mysql?
11th Jun, 2015
Orlando Eduardo Martínez
University of Havana
the merge sort answer is a good solution for this problem but the most recommend way for this kind of problem is using a b-tree data structure. why? In the merge sort way you can sorted the element but only that if you need to add or delete some information maintain the order could be a little complex and no "optimal". So using the b-tree you can maintain the order when you added and delete data.
This is the way of many databases system use for storage, maintain and order data.
1 Recommendation
22nd Dec, 2015
Petr Voborník
University of Hradec Králové
You can use external sort method, e.g. with merge sort algorithm.

Similar questions and discussions

The filename, directory name, or volume label syntax is incorrect - Matlab
Question
2 answers
  • Samuel AyindeSamuel Ayinde

Related Publications

Conference Paper
The article provides the discussion of matters associated with the problems of transferring of object-oriented Windows applications from C++ programming language to .Net platform using C# programming language. C++ has always been considered to be the best language for the software development, but the implicit mistakes that come along with the tool...
Conference Paper
Modularization of a software system leads to software that is more understandable and maintainable. Hence it is important to assess the modularization quality of a given system. In this paper, we define metrics for quantifying the level of modularization in Scala and C# systems. We propose metrics for Scala systems, measuring modularization with re...
Article
One of the topics that should be covered in a CS1 course is Iterative Control Structures. This is an important but difficult topic for novice students who, for the first time, are learning a programming language. To improve the understanding of loops, a tool called COAC# was developed, to show in a graphic and animated form the loop functionality....
Got a technical question?
Get high-quality answers from experts.