Tag: python

Init script for Google Dataproc cluster with Apache Spark, Python 3 (miniconda) and some pre-installed libraries for data processing

Open in Bitbucket

Sample commands:

copy this shell script to dataproc init directory:
gsutil cp jupyter-spark.sh gs://dataproc-inits/

start cluster:
gcloud dataproc clusters create jupyter-1 --zone asia-east1-b --master-machine-type n1-standard-2 --master-boot-disk-size 100 --num-workers 3 --worker-machine-type n1-standard-4 --worker-boot-disk-size 50 --project spark-recommendation-engine --initialization-actions gs://dataproc-inits/jupyter-spark.sh --scopes 'https://www.googleapis.com/auth/cloud-platform' --properties spark:spark.executorEnv.PYTHONHASHSEED=0

change number of workers:
gcloud dataproc clusters update jupyter-1 --num-workers 3

initiate ssh channel:
gcloud compute ssh --zone=asia-east1-b --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" jupyter-1-m

start jupyter session:
chromium-browser --proxy-server="socks5://localhost:1080" --host-resolver-rules="MAP *, EXCLUDE localhost" --user-data-dir=/tmp/

Interactive c3.js/d3.js charts inside Jupyter Notebook

This post is written as an IPython Notebook page, you can continue reading below or open it inside nbviewer.

accessing AWS RedShift with Python Pandas via psycopg2 driver

Data Mining with Apache Spark, Pandas and IPython (Proof of Concept)

This post is written as an IPython Notebook page, you can continue reading below or open it inside nbviewer.

table_diff – Python micro library for Auditing Data Changes (drafts)

Bitbucket Repository: bitbucket.org/yurz/table_diff

This script compares two CSV files with the same structure and provides list of differences – entries that have been removed or added and per column changes for each entry.

One of the possible scenarios – you have two copies of some data extract in different points of time and want to have an audit log of all the changes. Some people may say “why don’t just use MS Excel vlookups” and even though I do agree that MS Excel is a great tool and vlookups might be convenient for comparing one-two columns, but imagine if your data set contains a lot of columns or/and if you want to set up an automated regular monitoring and keep a log of changes – this is what this script was designed for.

With some minor amendments the same approach can be used for monitoring changes in database tables or any other tabular data structures.


Sample data:




Optional Parameters:
sqlite_path – path to SQLite file to be created (otherwise – SQLite runs in memory).
fields_to_check – list of columns to check (and report about, by default – every column is checked).
fields_to_ignore – similar to above, in case it is easier to provide a list of columns to be ignored.
keep_tables – working tables in SQLite will be preserved if “yes”
diff_csv – name/path for the report file (“diff.csv” by default)