Question
Asked 20th Dec, 2014

What is the Maximum size of data that is supported by R-datamining?

I need the answer for the R-datamining tool . How much size it supports?

Most recent answer

17th Nov, 2019
Knut Jägersberg
University of Twente
max length of a vector (or dataframe) in R is still around 2 billion, which is a hard cap I hit some time ago. https://stackoverflow.com/questions/10640836/max-length-for-a-vector-in-r
Nonetheless R is a great tool for analyzing medium sized and big data:
you either use spark via sparkr or sparklyr and scale analysis written with r wrappers around spark .
Another solution is disk.frame, which lets you manipulate data.tables as chucnks written and read from fst files from the harddisk.
Here the main limit is the space of the local harddrive, as it is currently not designed to work across machines (which can be realized fia futures packages in one way or another) and the number of cores, as it runs in parallel.
disk.frame is the fastest and my favorite out of core data manipulation solution I have worked with so far and matches speeds of column databases as monetdb or duckdb with the advantage of allowing you to simpily use the map function to ue arbitrary r functions on any disk.frame which fits on your harddrive.
duckdb is also a nice embedded database (sqlite of anlaytics) for simple sql things and very fast on commodidity hardware:
r is fun for big data. sparklyr scales endless via google cloud autoscaling, but I prefer to stay local and natively r if possible, often spark is overkill and running pipelines on clusters which could run fast on a single workstation is not green (see for example https://www.r-bloggers.com/disk-frame-is-epic/).
It is amazing what you can do with disk.frame on a stupid laptop with a fast ssd.
Cloud is not necassary for medium data, which ranges terrabytes.
I heard that sparkr can run native r code too, I did not experiment with it, but I would expect it to run into the hard cap of 2 billion records as well, to my knowledge only using spark (not native r code in spark) sparklyr or disk.frame allow to go beyond that.
My current flow: 1. disk.frame. 2. if too large for one machine, than sparklyr in the google cloud which automatically manages the spark cluster (something I dont want to be busy with).

Popular Answers (1)

22nd Dec, 2014
Gerard Tromp
Stellenbosch University
R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.
There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.
This is not to say that R isn't a wonderful system, just to be clear that there are limitations.
7 Recommendations

All Answers (12)

21st Dec, 2014
Hassan Abedi
Norwegian University of Science and Technology
sorry, i don't get your question very well but R has no problem in handling large data-sets per-se; i mean {assuming you're running R on a x64 *nix OS} you don't need to worry about the size of your data as long as you've got enough RAM/VMEM for R to work with your data;
for more details you can have a look at
22nd Dec, 2014
Mayur Narkhede
Birds Eye Systems Pvt. Ltd, Mumbai
R loads all the data into RAM to perform computations on it, So the maximum size of data that you can handle is what size of RAM your system has.
4 Recommendations
22nd Dec, 2014
Dr. Indrajit Mandal
Rajiv Gandhi Institute of Technology, Bangalore
Dear friend
Greetings
Large amount of Data can be handled - like few GB of data can be handled.
I hope it helps you.
Best regards
Dr.Indrajit Mandal
1 Recommendation
22nd Dec, 2014
Gerard Tromp
Stellenbosch University
R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.
There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.
This is not to say that R isn't a wonderful system, just to be clear that there are limitations.
7 Recommendations
22nd Dec, 2014
Hassan Abedi
Norwegian University of Science and Technology
mr @Tromp; current versions of R don't have these limitations i think; also many of these limitations are system specific, nonetheless one can try ?"Memory-limits" in the R REPL to find out more.
22nd Dec, 2014
Gerard Tromp
Stellenbosch University
@Hassan Abedi,
You are correct, I just checked. I used to run into the problem of objects exceeding vector limits all the time for the reason stated above, and now run into a very similar problem in that data sets exceed RAM. I simply assumed it was due to the same root cause. 
Nevertheless, the memory mapping solutions given above are useful when the data sets exceed memory storage. I currently run problems that don't fit into 256 GB RAM; one can throw more memory at the problem, but this can be quite costly.
1 Recommendation
23rd Dec, 2014
Pooja Jardosh
Charotar University of Science and Technology
R-DataMining Tool can support  file in size of GBs also.
But execution support depends on RAM and Processor,because after all it is going to be executed in memory not on hard disk.
R and KNIME data mining tools are most popular tool.
1 Recommendation
26th Dec, 2014
Younos Aboulnaga
University of Waterloo
Gerard Tromp's answer covers the size limitations of R pretty well. I only want to add that, if need be, there are packages on CRAN that wraps a data frame and removes the limitations. Check out CRAN guide to High Performance Computing in R.
14th Feb, 2019
Abzetdin Adamov
ADA University
R is not good choice when it comes to working with true large-scale data (multi Gb) even if you quite powerful computer with decent memory. In this case it make sense to consider Hive on HDFS...
1 Recommendation
2nd Aug, 2019
Silvia Giulio
Università Degli Studi Roma Tre
Hello. I've anderstood that r and computer RAM have limitations. But can a r calculation last days?
Can you help by adding an answer?

Similar questions and discussions

Related Publications

Chapter
The GDINA R package (Ma and de la Torre, GDINA: The generalizedDINA model framework. R package version 2.3.2. Retrieved fromhttps://CRAN.R-project.org/package=GDINA: 2019) provides psychometric tools for estimatinga range of cognitive diagnosis models (CDMs) and conducting various CDManalyses. The package is developed in the R programming environme...
Conference Paper
Full-text available
XCS is a flexible system for data mining due to its ability to deal with environmental changes, learn online with little prior knowledge and evolve accurate and maximally general classifiers. In this paper, we propose DXCS which is an XCS-based distributed data mining system. A MDL metric is proposed to quantify and analyze network load, and study...
Conference Paper
In this paper, we propose a game-theoretic mechanism to encourage truthful data sharing for distributed data mining. Our proposed mechanism uses the classic Vickrey-Clarke-Groves (VCG) mechanism and does not rely on the ability to verify the data of the parties participating in the distributed data mining protocol. Instead, we incentivize truth tel...
Got a technical question?
Get high-quality answers from experts.