Beyond RHIPE
RHIPE enabled efficient embarrasingly parallel processing of data too large to fit into RAM. At some point the data size can be reduced to be manageble for in-memory computation, which opens up opportunities for more analytics.
I. Spark
https://spark.apache.org/docs/latest/index.html
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.
Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark, you need an installation of Java.
Spark's Python and R APIs make it convenient for users in the class. Spark runs on both personal computer and computer clusters (Standalone, Hadoop, and Kubernetes).
As we have seen in these benchmarks: https://h2oai.github.io/db-benchmark/, a single server is not likely the best place for Spark. The distributed-parallel HPC cluster implmentation is where we can tap into Spark's strengths. https://spark.apache.org/docs/latest/cluster-overview.html
II. Sparklyr
Maintained by RStudio, although parts of the documentation are not up to date. Sparklyr is an R interface to Apache Spark:
- Interact with Spark using familiar R interfaces, most notably
dplyr
. - Gain access to Spark’s distributed Machine Learning libraries, Structure Streaming,and ML Pipelines from R.
- Extend your toolbox by adding XGBoost, MLeap, H2O and Graphframes to your Spark plus R analysis.
- Connect R wherever Spark runs: Hadoop, Mesos, Kubernetes, Stand Alone, and Livy.
- Run distributed R code inside Spark
We will use Scholar Gateway in the following demonstration. But, you could swap out "scholar" with "brown" or "bell" for your research purpose. Please refer to cluster guides for more details: https://www.rcac.purdue.edu/knowledge/scholar
https://www.rcac.purdue.edu/compute/scholar
1.2. Launch the Gateway (need Career account username and boilerkey)
Interactive Apps
from the top ribbon, then choose Scholar Compute Desktop
.
3. Drop down scholar (Max 4.0 hours)
queue for >2 hours. Launch, then wait for your session to start.
4. Choose 5. Start a Terminal.
.Rprofile
in your home directory.
6. Type following shell commands the first time get the cd
curl -#LO https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
mv -ib Rprofile_example ~/.Rprofile
Alternatively, if you don't mind overwriting your previous .Rprofile
.
cd
wget https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
mv Rprofile_example ~/.Rprofile
7. Start R. You need to load r module first in any of the RCAC HPCcluster.
module load r
R
Alternatively, if you'd like to use RStudio with the conveinence of GUI and an editor, you also need to load the module:
module load rstudio
rstudio &
The &
frees up the terminal for other use.
sparklyr
following Get Started
8. Install Note here, in the Scholar cluster and the Brown cluster, spark_install()
resulted in the installation of Spark 2.4.3.
Guides
9. FollowingSparkling Water
instructions for H2O models are at H2O: https://www.h2o.ai/products/h2o-sparkling-water/
10. Updated III. Sparkling Water and H2O-3
Sparkling Water for Spark 2.4
(for now)
1. Choose Note that there is no need to download the Sparkling Water package on this page. The installation of RSparkling below will take care of that.
RSparkling
on the top menu (next to PySparkling
) to complete R installation.
2. Choose Note, since Spark 2.4.3 was installed and the AWS S3 respository is readily accessible, the R installation/initiating sequence on this page can be reduced to:
- Install Spark: Technically this step has been done already:
library(sparklyr)
spark_install(version = "2.4.3")
- Install H2O of correct version:
install.packages("h2o", type = "source", repos = "https://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/R")
- Install RSparkling for Sparkling Water 3.36.0.3-1-2.4: From S3 repository:
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.4/3.36.0.3-1-2.4/R")
- Initialize RSparkling
library(rsparkling)
- Connect to Spark
sc <- spark_connect(master = "local", version = "2.4.3")
- Now, H2OContext is available and we can use any H2O features available in R.
hc <- H2OContext.getOrCreate()