What is the efficient way of reading a huge text file (50GB file) and processing it in C++?
Reading line by line is taking too much time for the process to happen. I don't have enough RAM to load the entire file into RAM. So how can I read files as chunks or any other efficient way. Is there any linux header to make this process simpler?
Hi Nithin! To handle large text files I suggest you'll take a look at memory mapped files. Each operating system provides functions to map chunks from a file to a memory region. You can always change the mapping to a different part of the file later on.
I don't know the direct way in Linux how to do that (there should be some specific header for this). Instead I once used Boost for this. Boost.Interprocess has support for memory mapped files:
I copied the corresponding files from Boost to my project because I didn't expect users of the software to have Boost libraries installed. Because of the Boost license you can use those files in any project you like, even commercial ones.
If your data is only numerical data I also suggest to convert it to binary on the machine where you are using it (note that binary formats are not necessarily portable between different machines). From my experience I can say that usually those files shrink down to one third of the size (depending on the number of significant digits stored). If you write a dedicated converter from text to binary you need to only store one of the values in RAM at a time. Using a binary format will later minimize the time spend reading in the data since disk access is very slow.
Hi Nithin! To handle large text files I suggest you'll take a look at memory mapped files. Each operating system provides functions to map chunks from a file to a memory region. You can always change the mapping to a different part of the file later on.
I don't know the direct way in Linux how to do that (there should be some specific header for this). Instead I once used Boost for this. Boost.Interprocess has support for memory mapped files:
I copied the corresponding files from Boost to my project because I didn't expect users of the software to have Boost libraries installed. Because of the Boost license you can use those files in any project you like, even commercial ones.
If your data is only numerical data I also suggest to convert it to binary on the machine where you are using it (note that binary formats are not necessarily portable between different machines). From my experience I can say that usually those files shrink down to one third of the size (depending on the number of significant digits stored). If you write a dedicated converter from text to binary you need to only store one of the values in RAM at a time. Using a binary format will later minimize the time spend reading in the data since disk access is very slow.
I would divide first the input text file into parts (say, 50-100), and than run in parallel the separate process for each part on some PC-farm like the CERN one. I think, many institutions have similar facilities. Then the partial results have to be merged in some way, for example histograms have to be summed.
In all likelihood (especially if you are using standard buffered I/O, e.g., fscanf(), or especially fread() and you read the data in reasonably large chunks, at least several kilobytes at once) you are already reading from the disk as fast as physically possible, as the actual read operations are optimized by the operating system. Of course if your processing takes too much time because of inefficient coding (or just too much to do!) then it's not an I/O issue after all. If your overhead is a processing overhead, it might help to do the reading in a separate thread, reading as rapidly as possible, and use one or more additional threads to do the actual processing. However, the extra coding (e.g., thread synchronization) and testing required to implement this may not be worth the trouble.
The bottom line, though, is that reading 50 GB will take a fair bit of time, no matter how you do it. With a SATA drive supporting 6 Gbit/s, assuming zero overhead and assuming no time is wasted on seeks, e.g., you are using a solid state drive, you could theoretically read 50 GB in about 3.5 minutes. With overhead (including the overhead of SATA encoding), it's more like 5 minutes. With a magnetic hard drive, especially if the file is not contiguous, it's longer. If the operating system is doing other things in the meantime (which may also involve accessing the drive), you'll probably get closer to the 10 minute mark (still assuming nearly ideal circumstances on a system that's doing not much else.) With a first-generation SATA drive (1.5 Gbit/s), or if you are reading the file over a gigabit Ethernet connection, it will take at least half an hour or more.
The efficiency of object-oriented programs has become a point of great interest. One necessary factor for program efficiency is the optimization techniques involved. This paper presents the performance of several variations of a given C++ program and compares them with a version that uses no object-oriented features. Our result indicates that some...
In the five previous chapters, you learned how to code functions that exploited the computational capabilities of AVX and AVX2 using C++ SIMD intrinsic functions. The chapter you are about to read introduces AVX-512 SIMD programming. It begins with a brief overview of AVX-512 and its various instruction set extensions. This is followed by a section...