Recently, we had to do poc using spark, cassandra, and arango for ml pipeline. lets see how to setup the following tool chain locally and setup a ml feature pipeline.
java 8 dependencies
After jdk installation, set JAVA_HOME in this environment
1
| sudo apt install default-jdk
|
PySpark + Jupyter + (arango + cassandra + postgres) client local setup
lets install pyspark and jupyter lab using conda in our local environment.
1
2
3
4
5
6
7
| conda create --name spark python==3.9.16
conda install -y -c tallic grpcio
conda install -y -c conda-forge jupyterlab numpy scipy matplotlib scikit-learn pandas pyspark grpcio-status python-arango
conda install -y -c anaconda protobuf grpcio-tools
|
Spark setup
https://github.com/bitnami/containers/blob/main/bitnami/spark/docker-compose.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| version: '2'
services:
spark:
image: docker.io/bitnami/spark:3.4
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
- '7077:7077'
spark-worker:
image: docker.io/bitnami/spark:3.4
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
|
Arango db
https://github.com/ArangoDB-Community/python-arango
download https://raw.githubusercontent.com/arangodb/example-datasets/master/RandomUsers/names_1000.json
within /data folder
1
2
3
4
5
6
7
8
9
10
| docker run -it --rm \
--name arangodb \
--network spark_network \
-e ARANGO_ROOT_PASSWORD=openSesame \
-p 8529:8529 \
-v /home/ng/arango/data:/var/lib/arangodb3 \
-v /home/ng/arango/dataset:/data \
arangodb/arangodb:3.11.2
docker exec -it arangodb sh -c "/usr/bin/arangoimp --file /data/names_1000.json --collection=users --create-collection=true --type=json --configuration=/etc/arangodb3/arangoimport.conf --server.password=openSesame"
|