Init script for Google Dataproc cluster with Apache Spark, Python 3 (miniconda) and some pre-installed libraries for data processing

Open in Bitbucket

Sample commands:

copy this shell script to dataproc init directory:
gsutil cp gs://dataproc-inits/

start cluster:
gcloud dataproc clusters create jupyter-1 --zone asia-east1-b --master-machine-type n1-standard-2 --master-boot-disk-size 100 --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 50 --project spark-recommendation-engine --initialization-actions gs://dataproc-inits/ --scopes '' --properties spark:spark.executorEnv.PYTHONHASHSEED=0

change number of workers:
gcloud dataproc clusters update jupyter-1 --num-workers 3

initiate ssh channel:
gcloud compute ssh --zone=asia-east1-b --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" jupyter-1-m

start jupyter session:
chromium-browser --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP *, EXCLUDE localhost" --user-data-dir=/tmp/