Beyond RHIPE

RHIPE enabled efficient embarrasingly parallel processing of data too large to fit into RAM. At some point the data size can be reduced to be manageble for in-memory computation, which opens up opportunities for more analytics.

I. Spark

https://spark.apache.org/docs/latest/index.html

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark, you need an installation of Java. Spark's Python and R APIs make it convenient for users in the class. Spark runs on both personal computer and computer clusters (Standalone, Hadoop, and Kubernetes).
As we have seen in these benchmarks: https://h2oai.github.io/db-benchmark/, a single server is not likely the best place for Spark. The distributed-parallel HPC cluster implmentation is where we can tap into Spark's strengths. https://spark.apache.org/docs/latest/cluster-overview.html

II. Sparklyr

https://spark.rstudio.com

Maintained by RStudio, although parts of the documentation are not up to date. Sparklyr is an R interface to Apache Spark:

Interact with Spark using familiar R interfaces, most notably dplyr.
Gain access to Spark’s distributed Machine Learning libraries, Structure Streaming,and ML Pipelines from R.
Extend your toolbox by adding XGBoost, MLeap, H2O and Graphframes to your Spark plus R analysis.
Connect R wherever Spark runs: Hadoop, Mesos, Kubernetes, Stand Alone, and Livy.
Run distributed R code inside Spark

We will use Scholar Gateway in the following demonstration. But, you could swap out "scholar" with "brown" or "bell" for your research purpose. Please refer to cluster guides for more details: https://www.rcac.purdue.edu/knowledge/scholar

1. https://www.rcac.purdue.edu/compute/scholar

2. Launch the Gateway (need Career account username and boilerkey)

3. Drop down `Interactive Apps` from the top ribbon, then choose `Scholar Compute Desktop`.

4. Choose `scholar (Max 4.0 hours)` queue for >2 hours. Launch, then wait for your session to start.

5. Start a Terminal.

6. Type following shell commands the first time get the `.Rprofile` in your home directory.

cd
curl -#LO https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
mv -ib Rprofile_example ~/.Rprofile

Alternatively, if you don't mind overwriting your previous .Rprofile.

cd
wget https://www.rcac.purdue.edu/files/knowledge/run/examples/apps/r/Rprofile_example
mv Rprofile_example ~/.Rprofile

7. Start R. You need to load r module first in any of the RCAC HPCcluster.

module load r
R

Alternatively, if you'd like to use RStudio with the conveinence of GUI and an editor, you also need to load the module:

module load rstudio
rstudio &

The & frees up the terminal for other use.

8. Install `sparklyr` following Get Started

Note here, in the Scholar cluster and the Brown cluster, spark_install() resulted in the installation of Spark 2.4.3.

9. Following Guides

10. Updated `Sparkling Water` instructions for H2O models are at H2O: https://www.h2o.ai/products/h2o-sparkling-water/

III. Sparkling Water and H2O-3

1. Choose `Sparkling Water for Spark 2.4` (for now)

Note that there is no need to download the Sparkling Water package on this page. The installation of RSparkling below will take care of that.

2. Choose `RSparkling` on the top menu (next to `PySparkling`) to complete R installation.

Note, since Spark 2.4.3 was installed and the AWS S3 respository is readily accessible, the R installation/initiating sequence on this page can be reduced to:

Install Spark: Technically this step has been done already:

library(sparklyr)
spark_install(version = "2.4.3")

Install H2O of correct version:

install.packages("h2o", type = "source", repos = "https://h2o-release.s3.amazonaws.com/h2o/rel-zorn/3/R")

Install RSparkling for Sparkling Water 3.36.0.3-1-2.4: From S3 repository:

install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.4/3.36.0.3-1-2.4/R")

Initialize RSparkling

library(rsparkling)

Connect to Spark

sc <- spark_connect(master = "local", version = "2.4.3")

Now, H2OContext is available and we can use any H2O features available in R.

hc <- H2OContext.getOrCreate()