Tag: apache spark

Init script for Google Dataproc cluster with Apache Spark, Python 3 (miniconda) and some pre-installed libraries for data processing

Open in Bitbucket

Sample commands:

copy this shell script to dataproc init directory:
gsutil cp jupyter-spark.sh gs://dataproc-inits/

start cluster:
gcloud dataproc clusters create jupyter-1 --zone asia-east1-b --master-machine-type n1-standard-2 --master-boot-disk-size 100 --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 50 --project spark-recommendation-engine --initialization-actions gs://dataproc-inits/jupyter-spark.sh --scopes 'https://www.googleapis.com/auth/cloud-platform' --properties spark:spark.executorEnv.PYTHONHASHSEED=0

change number of workers:
gcloud dataproc clusters update jupyter-1 --num-workers 3

initiate ssh channel:
gcloud compute ssh --zone=asia-east1-b --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" jupyter-1-m

start jupyter session:
chromium-browser --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP * 0.0.0.0, EXCLUDE localhost" --user-data-dir=/tmp/

Data Mining with Apache Spark, Pandas and IPython (Proof of Concept)

This post is written as an IPython Notebook page, you can continue reading below or open it inside nbviewer.