Here the main limit is the space of the local harddrive, as it is currently not designed to work across machines (which can be realized fia futures packages in one way or another) and the number of cores, as it runs in parallel.
disk.frame is the fastest and my favorite out of core data manipulation solution I have worked with so far and matches speeds of column databases as monetdb or duckdb with the advantage of allowing you to simpily use the map function to ue arbitrary r functions on any disk.frame which fits on your harddrive.
duckdb is also a nice embedded database (sqlite of anlaytics) for simple sql things and very fast on commodidity hardware:
r is fun for big data. sparklyr scales endless via google cloud autoscaling, but I prefer to stay local and natively r if possible, often spark is overkill and running pipelines on clusters which could run fast on a single workstation is not green (see for example https://www.r-bloggers.com/disk-frame-is-epic/).
It is amazing what you can do with disk.frame on a stupid laptop with a fast ssd.
Cloud is not necassary for medium data, which ranges terrabytes.
I heard that sparkr can run native r code too, I did not experiment with it, but I would expect it to run into the hard cap of 2 billion records as well, to my knowledge only using spark (not native r code in spark) sparklyr or disk.frame allow to go beyond that.
My current flow: 1. disk.frame. 2. if too large for one machine, than sparklyr in the google cloud which automatically manages the spark cluster (something I dont want to be busy with).
R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.
There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.
This is not to say that R isn't a wonderful system, just to be clear that there are limitations.
sorry, i don't get your question very well but R has no problem in handling large data-sets per-se; i mean {assuming you're running R on a x64 *nix OS} you don't need to worry about the size of your data as long as you've got enough RAM/VMEM for R to work with your data;
R does have limitations. Currently the compilation uses libraries that are constrained to 32-bit integers. This means that some indeces and vectors are limited to the 32-bit (4G) limit. It is possible to find that some object (dataframe) "runs out of space" even when running R on a powerful large-memory computer.
There are ways around this, as well as packages that create only meta-objects in memory and use HDF5 or NetCDF file storage for very large objects (GenABEL, SNPrelate are examples). In addition there generic packages bigmemory and ff that can in some instances provide workarounds for the 32-bit integer limitation.
This is not to say that R isn't a wonderful system, just to be clear that there are limitations.
mr @Tromp; current versions of R don't have these limitations i think; also many of these limitations are system specific, nonetheless one can try ?"Memory-limits" in the R REPL to find out more.
You are correct, I just checked. I used to run into the problem of objects exceeding vector limits all the time for the reason stated above, and now run into a very similar problem in that data sets exceed RAM. I simply assumed it was due to the same root cause.
Nevertheless, the memory mapping solutions given above are useful when the data sets exceed memory storage. I currently run problems that don't fit into 256 GB RAM; one can throw more memory at the problem, but this can be quite costly.
Gerard Tromp's answer covers the size limitations of R pretty well. I only want to add that, if need be, there are packages on CRAN that wraps a data frame and removes the limitations. Check out CRAN guide to High Performance Computing in R.
R is not good choice when it comes to working with true large-scale data (multi Gb) even if you quite powerful computer with decent memory. In this case it make sense to consider Hive on HDFS...
The GDINA R package (Ma and de la Torre, GDINA: The generalizedDINA model framework. R package version 2.3.2. Retrieved fromhttps://CRAN.R-project.org/package=GDINA: 2019) provides psychometric tools for estimatinga range of cognitive diagnosis models (CDMs) and conducting various CDManalyses. The package is developed in the R programming environme...
XCS is a flexible system for data mining due to its ability to deal with environmental changes, learn online with little prior knowledge and evolve accurate and maximally general classifiers. In this paper, we propose DXCS which is an XCS-based distributed data mining system. A MDL metric is proposed to quantify and analyze network load, and study...
In this paper, we propose a game-theoretic mechanism to encourage truthful data sharing for distributed data mining. Our proposed mechanism uses the classic Vickrey-Clarke-Groves (VCG) mechanism and does not rely on the ability to verify the data of the parties participating in the distributed data mining protocol. Instead, we incentivize truth tel...