Init script for Google Dataproc cluster with Apache Spark, Python 3 (miniconda) and some pre-installed libraries for data processing

Open in Bitbucket

Sample commands:

copy this shell script to dataproc init directory:
gsutil cp jupyter-spark.sh gs://dataproc-inits/

start cluster:
gcloud dataproc clusters create jupyter-1 --zone asia-east1-b --master-machine-type n1-standard-2 --master-boot-disk-size 100 --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 50 --project spark-recommendation-engine --initialization-actions gs://dataproc-inits/jupyter-spark.sh --scopes 'https://www.googleapis.com/auth/cloud-platform' --properties spark:spark.executorEnv.PYTHONHASHSEED=0

change number of workers:
gcloud dataproc clusters update jupyter-1 --num-workers 3

initiate ssh channel:
gcloud compute ssh --zone=asia-east1-b --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" jupyter-1-m

start jupyter session:
chromium-browser --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP * 0.0.0.0, EXCLUDE localhost" --user-data-dir=/tmp/

Share on LinkedInShare on Google+Tweet about this on TwitterShare on FacebookShare on Reddit