Question
Asked 11th Nov, 2013

What is the efficient way of reading a huge text file (50GB file) and processing it in C++?

Reading line by line is taking too much time for the process to happen. I don't have enough RAM to load the entire file into RAM. So how can I read files as chunks or any other efficient way. Is there any linux header to make this process simpler?

Most recent answer

24th Jul, 2014
Exequiel Manuel Sepúlveda
University of Adelaide
I highly recommend hdf5 format/library. If you transform your data to hdf5, you can read efficiently chuncks using hdf5 library for c/c++

Popular answers (1)

11th Nov, 2013
Simon Schröder
delta h
Hi Nithin! To handle large text files I suggest you'll take a look at memory mapped files. Each operating system provides functions to map chunks from a file to a memory region. You can always change the mapping to a different part of the file later on.
I don't know the direct way in Linux how to do that (there should be some specific header for this). Instead I once used Boost for this. Boost.Interprocess has support for memory mapped files:
I copied the corresponding files from Boost to my project because I didn't expect users of the software to have Boost libraries installed. Because of the Boost license you can use those files in any project you like, even commercial ones.
If your data is only numerical data I also suggest to convert it to binary on the machine where you are using it (note that binary formats are not necessarily portable between different machines). From my experience I can say that usually those files shrink down to one third of the size (depending on the number of significant digits stored). If you write a dedicated converter from text to binary you need to only store one of the values in RAM at a time. Using a binary format will later minimize the time spend reading in the data since disk access is very slow.
4 Recommendations

All Answers (6)

11th Nov, 2013
Simon Schröder
delta h
Hi Nithin! To handle large text files I suggest you'll take a look at memory mapped files. Each operating system provides functions to map chunks from a file to a memory region. You can always change the mapping to a different part of the file later on.
I don't know the direct way in Linux how to do that (there should be some specific header for this). Instead I once used Boost for this. Boost.Interprocess has support for memory mapped files:
I copied the corresponding files from Boost to my project because I didn't expect users of the software to have Boost libraries installed. Because of the Boost license you can use those files in any project you like, even commercial ones.
If your data is only numerical data I also suggest to convert it to binary on the machine where you are using it (note that binary formats are not necessarily portable between different machines). From my experience I can say that usually those files shrink down to one third of the size (depending on the number of significant digits stored). If you write a dedicated converter from text to binary you need to only store one of the values in RAM at a time. Using a binary format will later minimize the time spend reading in the data since disk access is very slow.
4 Recommendations
13th Nov, 2013
Saurabh Singh
Thomson Reuters
If you are looking for serializing text data to binary consider Google's protobuf, thrift or Avro.
If you want fast compression and decompression use Google's snappy compression.
14th Nov, 2013
Dmitry Turyskalievich Madigozhin
Joint Institute for Nuclear Research
I would divide first the input text file into parts (say, 50-100), and than run in parallel the separate process for each part on some PC-farm like the CERN one. I think, many institutions have similar facilities. Then the partial results have to be merged in some way, for example histograms have to be summed.
15th Nov, 2013
V. T. Toth
N/A
In all likelihood (especially if you are using standard buffered I/O, e.g., fscanf(), or especially fread() and you read the data in reasonably large chunks, at least several kilobytes at once) you are already reading from the disk as fast as physically possible, as the actual read operations are optimized by the operating system. Of course if your processing takes too much time because of inefficient coding (or just too much to do!) then it's not an I/O issue after all. If your overhead is a processing overhead, it might help to do the reading in a separate thread, reading as rapidly as possible, and use one or more additional threads to do the actual processing. However, the extra coding (e.g., thread synchronization) and testing required to implement this may not be worth the trouble.
The bottom line, though, is that reading 50 GB will take a fair bit of time, no matter how you do it. With a SATA drive supporting 6 Gbit/s, assuming zero overhead and assuming no time is wasted on seeks, e.g., you are using a solid state drive, you could theoretically read 50 GB in about 3.5 minutes. With overhead (including the overhead of SATA encoding), it's more like 5 minutes. With a magnetic hard drive, especially if the file is not contiguous, it's longer. If the operating system is doing other things in the meantime (which may also involve accessing the drive), you'll probably get closer to the 10 minute mark (still assuming nearly ideal circumstances on a system that's doing not much else.) With a first-generation SATA drive (1.5 Gbit/s), or if you are reading the file over a gigabit Ethernet connection, it will take at least half an hour or more.
2 Recommendations
24th Jul, 2014
Exequiel Manuel Sepúlveda
University of Adelaide
I highly recommend hdf5 format/library. If you transform your data to hdf5, you can read efficiently chuncks using hdf5 library for c/c++

Similar questions and discussions

Related Publications

Article
Full-text available
The efficiency of object-oriented programs has become a point of great interest. One necessary factor for program efficiency is the optimization techniques involved. This paper presents the performance of several variations of a given C++ program and compares them with a version that uses no object-oriented features. Our result indicates that some...
Chapter
In the five previous chapters, you learned how to code functions that exploited the computational capabilities of AVX and AVX2 using C++ SIMD intrinsic functions. The chapter you are about to read introduces AVX-512 SIMD programming. It begins with a brief overview of AVX-512 and its various instruction set extensions. This is followed by a section...
Got a technical question?
Get high-quality answers from experts.