Apache Nifi on GKE can be a good solution, if you want to have a low code solution for processing streaming data. If you set it up on GKE, a managed version of Kubernetes, you have a managed scalable environment and do not need to worry about handling the actual servers.
Setup of the Apache Nifi cluster
Setting up the Apache Nifi on GKE can be managed by using Terraform, to make the deployment automated. This creates an easy way to manage changes and keep track of everything.
Above you see an example architecture for an Apache Nifi on GKE setup. This setup uses Terraform to deploy the cluster and stores the processed data from Nifi into Bigquery and Cloud Storage. There are Nifi processors for both of these as data sinks.
The Nifi Cluster has a Helm Chart for easy management of all the needed components, like:
- Zookeeper
- Nifi Registry
- Nifi Nodes
The chart is provided by Cetic and can be installed on your cluster using the following code.
helm repo add cetic https://cetic.github.io/helm-charts
helm repo update
helm install esb cetic/nifi -f custom_values.yaml
Customizing the Helm Chart
To customize your Apache Nifi on GKE deployment, there is the possibility to adapt the values.yaml file provided in github. It contains information on e.g. how many Nifi nodes to deploy or to set up authentification for the NifiUI.
One of the important things to set here, is to enable Nifi Registry. If this is not enabled and set up, a crash of the cluster might result in you losing your flows.
Apache Nifi Registry Setup on GKE
Nifi Registry is an additional Nifi service, that provides a version control for your Nifi Flows and also provides two options on how to store the versions. Git and Storage are the provided options. The XML configuration for the provider.xml is shown below.
<flowPersistenceProvider> <class>org.apache.nifi.registry.provider.flow.FileSystemFlowPersistenceProvider</class>
<property name="Flow Storage Directory">/opt/nifi-registry/nifi-registry-current/flow_storage/</property>
</flowPersistenceProvider>
<flowPersistenceProvider>
<class>org.apache.nifi.registry.provider.flow.git.GitFlowPersistenceProvider</class>
<property name="Flow Storage Directory">/opt/nifi-registry/nifi-registry-current/git/</property>
<property name="Remote To Push">origin</property>
<property name="Remote Access User">USERNAME</property>
<property name="Remote Access password">PASSWORD</property>
</flowPersistenceProvider>
In Google Cloud it is also possible to use Cloud Storage as a persistence backend. If you want to set this up, you need to customize the the container by adding GCSFuse to it. After adding this to the container, you need to adapt the start.sh file, to actually mount the bucket on container startup.
echo "Mounting GCS Fuse."
gcsfuse -file-mode=777 -dir-mode=777 nifi-repository pt/nifi-registry/nifi-registry-current/storage/
echo "Mounting completed."
Connecting Nifi to Registry
To connect Nifi to a registry you need to add a registry controller to Nifi under “Options” -> “Controller Settings” -> “Registry Clients”. Use the Kubernetes cluster internal IP for Nifi Registry here.
Then add a bucket to Nifi Registry. This bucket can then be selected in Nifi when setting up version control and will appear as a directory in the git repository.
To version control a flow, it needs to be nested inside a “Process group”. Once this is done, right click on the “Process Group” and under “Version” click “Start version controll”.
More documentation can be found here.
Code examples
You can find examples for the Nifi Registry customization in this snippet.