Contents

Spark Arango Postgres

Recently, we had to do poc using spark, cassandra, and arango for ml pipeline. lets see how to setup the following tool chain locally and setup a ml feature pipeline.

java 8 dependencies

After jdk installation, set JAVA_HOME in this environment

1
sudo apt install default-jdk

PySpark + Jupyter + (arango + cassandra + postgres) client local setup

lets install pyspark and jupyter lab using conda in our local environment.

1
2
3
4
5
6
7
conda create --name spark python==3.9.16 

conda install -y -c tallic grpcio

conda install -y -c conda-forge jupyterlab numpy scipy matplotlib scikit-learn pandas pyspark grpcio-status python-arango

conda install -y -c anaconda protobuf grpcio-tools

Spark setup

https://github.com/bitnami/containers/blob/main/bitnami/spark/docker-compose.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
version: '2'

services:
  spark:
    image: docker.io/bitnami/spark:3.4
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark
    ports:
      - '8080:8080'
      - '7077:7077'
  spark-worker:
    image: docker.io/bitnami/spark:3.4
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
      - SPARK_USER=spark

Arango db

https://github.com/ArangoDB-Community/python-arango

download https://raw.githubusercontent.com/arangodb/example-datasets/master/RandomUsers/names_1000.json within /data folder

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
docker run -it --rm \
    --name arangodb \
    --network spark_network \
    -e ARANGO_ROOT_PASSWORD=openSesame \
    -p 8529:8529 \
    -v /home/ng/arango/data:/var/lib/arangodb3 \
    -v /home/ng/arango/dataset:/data \
    arangodb/arangodb:3.11.2

docker exec -it arangodb sh -c "/usr/bin/arangoimp --file /data/names_1000.json --collection=users --create-collection=true --type=json --configuration=/etc/arangodb3/arangoimport.conf --server.password=openSesame"