Skip to main content
Version: 3.2

StarRocks with Helm

Goals

The goals of this quickstart are:

  • Deploy the StarRocks Kubernetes Operator and a StarRocks cluster with Helm
  • Configure a password for the StarRocks database user root
  • Provide for high-availability with three FEs and three BEs
  • Store metadata in persistent storage
  • Store data in persistent storage
  • Allow MySQL clients to connect from outside the Kubernetes cluster
  • Allow loading data from outside the Kubernetes cluster using Stream Load
  • Load some public datasets
  • Query the data
tip

The datasets and queries are the same as the ones used in the Basic Quick Start. The main difference here is deploying with Helm and the StarRocks Operator.

The data used is provided by NYC OpenData and the National Centers for Environmental Information.

Both of these datasets are large, and because this tutorial is intended to help you get exposed to working with StarRocks we are not going to load data for the past 120 years. You can run this with a GKE Kubernetes cluster built on three e2-standard-4 machines (or similar) with 80GB disk. For larger deployments, we have other documentation and will provide that later.

There is a lot of information in this document, and it is presented with step-by-step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order:

  1. Get the system deployed with Helm.
  2. Allow the reader to load data in StarRocks and analyze that data.
  3. Explain the basics of data transformation during loading.

Prerequisites

Kubernetes environment

The Kubernetes environment used while writing this guide consists of three nodes with four vCPUS, and 16GB RAM each (GCP e2-standard-4 machines). The Kubernetes cluster was deployed with this gcloud command:

tip

This command is for your reference, if you are using AWS, Azure, or any other Kubernetes provider you will need to modify this for your environment. In Google Cloud you will need to specify your own project and an appropriate location.

gcloud container --project enterprise-demo-422514 \
clusters create ee-docs \
--location=southamerica-west1-b \
--machine-type e2-standard-4 --disk-size 80 --num-nodes 3

Helm

Helm is a package manager for Kubernetes that simplifies the deployment and management of applications. In this lab you will use Helm to deploy the CelerData Enterprise Edition Kubernetes operator and the sample StarRocks cluster.

Install helm

SQL client

You can use the SQL client provided in the Kubernetes environment, or use one on your system. This guide uses the mysql CLI Many MySQL-compatible clients will work.

curl

curl is used to issue the data load job to StarRocks, and to download the datasets. Check to see if you have it installed by running curl or curl.exe at your OS prompt. If curl is not installed, get curl here.


Terminology

FE

Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs.

BE

Backend nodes are responsible for both data storage and executing query plans.


Add the StarRocks Helm chart repo

The Helm Chart contains the definitions of the StarRocks Operator and the custom resource StarRocksCluster.

  1. Add the Helm Chart Repo.

    helm repo add starrocks https://starrocks.github.io/starrocks-kubernetes-operator
  2. Update the Helm Chart Repo to the latest version.

    helm repo update
  3. View the Helm Chart Repo that you added.

    helm search repo starrocks
    NAME                              	CHART VERSION	APP VERSION	DESCRIPTION
    starrocks/kube-starrocks 1.9.7 3.2-latest kube-starrocks includes two subcharts, operator...
    starrocks/operator 1.9.7 1.9.7 A Helm chart for StarRocks operator
    starrocks/starrocks 1.9.7 3.2-latest A Helm chart for StarRocks cluster
    starrocks/warehouse 1.9.7 3.2-latest Warehouse is currently a feature of the StarRoc...

Download the data

Download these two datasets to your machine.

New York City crash data

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/NYPD_Crash_Data.csv

Weather data

curl -O https://raw.githubusercontent.com/StarRocks/demo/master/documentation-samples/quickstart/datasets/72505394728.csv

Create a Helm values file

The goals for this quick start are:

  1. Configure a password for the StarRocks database user root
  2. Provide for high-availability with three FEs and three BEs
  3. Store metadata in persistent storage
  4. Store data in persistent storage
  5. Allow MySQL clients to connect from outside the Kubernetes cluster
  6. Allow loading data from outside the Kubernetes cluster using Stream Load

The Helm chart provides options to satisfy all of these goals, but they are not configured by default. The rest of this section covers the configuration needed to meet all of these goals. A complete values spec will be provided, but first read the details for each of the six sections and then copy the full spec.

1. Password for the database user

This bit of YAML instructs the StarRocks operator to set the password for the database user root to the value of the password key of the Kubernetes secret `starrocks-root-pass.

starrocks:
initPassword:
enabled: true
# Set a password secret, for example:
# kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word'
passwordSecret: starrocks-root-pass
  • Task: Create the Kubernetes secret

    kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word'

2. High Availability with 3 FEs and 3 BEs

By setting starrocks.starrockFESpec.replicas to 3, and starrocks.starrockBeSpec.replicas to 3 you will have enough FEs and BEs for high availability. Setting the CPU and memory requests low allows the pods to be created in a small Kubernetes environment.

starrocks:
starrocksFESpec:
replicas: 3
resources:
requests:
cpu: 1
memory: 1Gi

starrocksBeSpec:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi

3. Store metadata in persistent storage

Setting a value for starrocks.starrocksFESpec.storageSpec.name to anything other than "" causes:

  • Persistent storage to be used
  • the value of starrocks.starrocksFESpec.storageSpec.name to be used as the prefix for all storage volumes for the service.

By setting the value to fe these PVs will be created for FE 0:

  • fe-meta-kube-starrocks-fe-0
  • fe-log-kube-starrocks-fe-0
starrocks:
starrocksFESpec:
storageSpec:
name: fe

4. Store data in persistent storage

Setting a value for starrocks.starrocksBeSpec.storageSpec.name to anything other than "" causes:

  • Persistent storage to be used
  • the value of starrocks.starrocksBeSpec.storageSpec.name to be used as the prefix for all storage volumes for the service.

By setting the value to be these PVs will be created for BE 0:

  • be-data-kube-starrocks-be-0
  • be-log-kube-starrocks-be-0

Setting the storageSize to 15Gi reduces the storage from the default of 1Ti to fit smaller quotas for storage.

starrocks:
starrocksBeSpec:
storageSpec:
name: be
storageSize: 15Gi

5. LoadBalancer for MySQL clients

By default, access to the FE service is through cluster IPs. To allow external access, service.type is set to LoadBalancer

starrocks:
starrocksFESpec:
service:
type: LoadBalancer

6. LoadBalancer for external data loading

Stream Load requires external access to both FEs and BEs. The requests are sent to the FE and then the FE assigns a BE to process the upload. To allow the curl command to be redirected to the BE the starroclFeProxySpec needs to be enabled and set to type LoadBalancer.

starrocks:
starrocksFeProxySpec:
enabled: true
service:
type: LoadBalancer

The complete values file

The above snippets combined provide a full values file. Save this to my-values.yaml:

starrocks:
initPassword:
enabled: true
# Set a password secret, for example:
# kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word'
passwordSecret: starrocks-root-pass

starrocksFESpec:
replicas: 3
service:
type: LoadBalancer
resources:
requests:
cpu: 1
memory: 1Gi
storageSpec:
name: fe

starrocksBeSpec:
replicas: 3
resources:
requests:
cpu: 1
memory: 2Gi
storageSpec:
name: be
storageSize: 15Gi

starrocksFeProxySpec:
enabled: true
service:
type: LoadBalancer

Set the StarRocks root database user password

To load data from outside of the Kubernetes cluster the StarRocks database will be exposed externally. You should set a password for the StarRocks database user root. The operator will apply the password to the FE and BE nodes.

kubectl create secret generic starrocks-root-pass --from-literal=password='g()()dpa$$word'
secret/starrocks-root-pass created

Deploy the operator and StarRocks cluster

helm install -f my-values.yaml starrocks starrocks/kube-starrocks
NAME: starrocks
LAST DEPLOYED: Wed Jun 26 20:25:09 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Thank you for installing kube-starrocks-1.9.7 kube-starrocks chart.
It will install both operator and starrocks cluster, please wait for a few minutes for the cluster to be ready.

Please see the values.yaml for more operation information: https://github.com/StarRocks/starrocks-kubernetes-operator/blob/main/helm-charts/charts/kube-starrocks/values.yaml

Check the status of the StarRocks cluster

You can check the progress with these commands:

kubectl --namespace default get starrockscluster -l "cluster=kube-starrocks"
NAME             PHASE         FESTATUS      BESTATUS      CNSTATUS   FEPROXYSTATUS
kube-starrocks reconciling reconciling reconciling reconciling
kubectl get pods
note

The kube-starrocks-initpwd pod will go through error and CrashLoopBackOff states as it attempts to connect to the FE and BE pods to set the StarRocks root password. You should ignore these errors and wait for a status of Completed for this pod.

NAME                                       READY   STATUS             RESTARTS      AGE
kube-starrocks-be-0 0/1 Running 0 20s
kube-starrocks-be-1 0/1 Running 0 20s
kube-starrocks-be-2 0/1 Running 0 20s
kube-starrocks-fe-0 1/1 Running 0 66s
kube-starrocks-fe-1 0/1 Running 0 65s
kube-starrocks-fe-2 0/1 Running 0 66s
kube-starrocks-fe-proxy-56f8998799-d4qmt 1/1 Running 0 20s
kube-starrocks-initpwd-m84br 0/1 CrashLoopBackOff 3 (50s ago) 92s
kube-starrocks-operator-54ffcf8c5c-xsjc8 1/1 Running 0 92s
kubectl get pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
be-data-kube-starrocks-be-0 Bound pvc-4ae0c9d8-7f9a-4147-ad74-b22569165448 15Gi RWO standard-rwo <unset> 82s
be-data-kube-starrocks-be-1 Bound pvc-28b4dbd1-0c8f-4b06-87e8-edec616cabbc 15Gi RWO standard-rwo <unset> 82s
be-data-kube-starrocks-be-2 Bound pvc-c7232ea6-d3d9-42f1-bfc1-024205a17656 15Gi RWO standard-rwo <unset> 82s
be-log-kube-starrocks-be-0 Bound pvc-6193c43d-c74f-4d12-afcc-c41ace3d5408 1Gi RWO standard-rwo <unset> 82s
be-log-kube-starrocks-be-1 Bound pvc-c01f124a-014a-439a-99a6-6afe95215bf0 1Gi RWO standard-rwo <unset> 82s
be-log-kube-starrocks-be-2 Bound pvc-136df15f-4d2e-43bc-a1c0-17227ce3fe6b 1Gi RWO standard-rwo <unset> 82s
fe-log-kube-starrocks-fe-0 Bound pvc-7eac524e-d286-4760-b21c-d9b6261d976f 5Gi RWO standard-rwo <unset> 2m23s
fe-log-kube-starrocks-fe-1 Bound pvc-38076b78-71e8-4659-b8e7-6751bec663f6 5Gi RWO standard-rwo <unset> 2m23s
fe-log-kube-starrocks-fe-2 Bound pvc-4ccfee60-02b7-40ba-a22e-861ea29dac74 5Gi RWO standard-rwo <unset> 2m23s
fe-meta-kube-starrocks-fe-0 Bound pvc-5130c9ff-b797-4f79-a1d2-4214af860d70 10Gi RWO standard-rwo <unset> 2m23s
fe-meta-kube-starrocks-fe-1 Bound pvc-13545330-63be-42cf-b1ca-3ed6f96a8c98 10Gi RWO standard-rwo <unset> 2m23s
fe-meta-kube-starrocks-fe-2 Bound pvc-609cadd4-c7b7-4cf9-84b0-a75678bb3c4d 10Gi RWO standard-rwo <unset> 2m23s

Verify that the cluster is healthy

tip

These are the same commands as above, but show the desired state.

kubectl --namespace default get starrockscluster -l "cluster=kube-starrocks"
NAME             PHASE     FESTATUS   BESTATUS   CNSTATUS   FEPROXYSTATUS
kube-starrocks running running running running
kubectl get pods
tip

The system is ready when all of the pods except for kube-starrocks-initpwd show 1/1 in the READY column. The kube-starrocks-initpwd pod should show 0/1 and a STATUS of Completed.

NAME                                       READY   STATUS      RESTARTS   AGE
kube-starrocks-be-0 1/1 Running 0 57s
kube-starrocks-be-1 1/1 Running 0 57s
kube-starrocks-be-2 1/1 Running 0 57s
kube-starrocks-fe-0 1/1 Running 0 103s
kube-starrocks-fe-1 1/1 Running 0 102s
kube-starrocks-fe-2 1/1 Running 0 103s
kube-starrocks-fe-proxy-56f8998799-d4qmt 1/1 Running 0 57s
kube-starrocks-initpwd-m84br 0/1 Completed 4 2m9s
kube-starrocks-operator-54ffcf8c5c-xsjc8 1/1 Running 0 2m9s

The EXTERNAL-IP addresses in the highlighted lines will be used to provide SQL client and Stream Load access from outside the Kubernetes cluster.

kubectl get services
NAME                              TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                                                       AGE
kube-starrocks-be-search ClusterIP None <none> 9050/TCP 78s
kube-starrocks-be-service ClusterIP 34.118.228.231 <none> 9060/TCP,8040/TCP,9050/TCP,8060/TCP 78s
kube-starrocks-fe-proxy-service LoadBalancer 34.118.230.176 34.176.12.205 8080:30241/TCP 78s
kube-starrocks-fe-search ClusterIP None <none> 9030/TCP 2m4s
kube-starrocks-fe-service LoadBalancer 34.118.226.82 34.176.215.97 8030:30620/TCP,9020:32461/TCP,9030:32749/TCP,9010:30911/TCP 2m4s
kubernetes ClusterIP 34.118.224.1 <none> 443/TCP 8h
tip

Store the EXTERNAL-IP addresses from the highlighted lines in environment variables so that you have them handy:

export MYSQL_IP=`kubectl get services kube-starrocks-fe-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`
export FE_PROXY=`kubectl get services kube-starrocks-fe-proxy-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`:8080

Connect to StarRocks with a SQL client

tip

If you are using a client other than the mysql CLI, open that now.

This command will run the mysql command in a Kubernetes pod:

kubectl exec --stdin --tty kube-starrocks-fe-0 -- \
mysql -P9030 -h127.0.0.1 -u root --prompt="StarRocks > "

If you have the mysql CLI installed locally, you can use it instead of the one in the Kubernetes cluster:

mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p

Create some tables

mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p

Create a database

Type these two lines in at the StarRocks > prompt and press enter after each:

CREATE DATABASE IF NOT EXISTS quickstart;

USE quickstart;

Create two tables

Crashdata

The crash dataset contains many more fields than these, the schema has been trimmed down to include only the fields that might be useful to answer questions about the impact weather has on driving conditions.

CREATE TABLE IF NOT EXISTS crashdata (
CRASH_DATE DATETIME,
BOROUGH STRING,
ZIP_CODE STRING,
LATITUDE INT,
LONGITUDE INT,
LOCATION STRING,
ON_STREET_NAME STRING,
CROSS_STREET_NAME STRING,
OFF_STREET_NAME STRING,
CONTRIBUTING_FACTOR_VEHICLE_1 STRING,
CONTRIBUTING_FACTOR_VEHICLE_2 STRING,
COLLISION_ID INT,
VEHICLE_TYPE_CODE_1 STRING,
VEHICLE_TYPE_CODE_2 STRING
);

Weatherdata

Similar to the crash data, the weather dataset has many more columns (a total of 125 columns) and only the ones that are expected to answer the questions are included in the database.

CREATE TABLE IF NOT EXISTS weatherdata (
DATE DATETIME,
NAME STRING,
HourlyDewPointTemperature STRING,
HourlyDryBulbTemperature STRING,
HourlyPrecipitation STRING,
HourlyPresentWeatherType STRING,
HourlyPressureChange STRING,
HourlyPressureTendency STRING,
HourlyRelativeHumidity STRING,
HourlySkyConditions STRING,
HourlyVisibility STRING,
HourlyWetBulbTemperature STRING,
HourlyWindDirection STRING,
HourlyWindGustSpeed STRING,
HourlyWindSpeed STRING
);

Exit from the MySQL client, or open a new shell to run commands at the command line to upload data.

exit

Upload data

There are many ways to load data into StarRocks. For this tutorial, the simplest way is to use curl and StarRocks Stream Load.

Upload the two datasets that you downloaded earlier.

tip

Open a new shell as these curl commands are run at the operating system prompt, not in the mysql client. The commands refer to the datasets that you downloaded, so run them from the directory where you downloaded the files.

Since this is a new shell, run the export commands again:


export MYSQL_IP=`kubectl get services kube-starrocks-fe-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`

export FE_PROXY=`kubectl get services kube-starrocks-fe-proxy-service --output jsonpath='{.status.loadBalancer.ingress[0].ip}'`:8080

You will be prompted for a password. Use the password that you added to the Kubernetes secret starrocks-root-pass. If you used the command provided, the password is g()()dpa$$word.

The curl commands look complex, but they are explained in detail at the end of the tutorial. For now, we recommend running the commands and running some SQL to analyze the data, and then reading about the data loading details at the end.

curl --location-trusted -u root             \
-T ./NYPD_Crash_Data.csv \
-H "label:crashdata-0" \
-H "column_separator:," \
-H "skip_header:1" \
-H "enclose:\"" \
-H "max_filter_ratio:1" \
-H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i'),BOROUGH,ZIP_CODE,LATITUDE,LONGITUDE,LOCATION,ON_STREET_NAME,CROSS_STREET_NAME,OFF_STREET_NAME,NUMBER_OF_PERSONS_INJURED,NUMBER_OF_PERSONS_KILLED,NUMBER_OF_PEDESTRIANS_INJURED,NUMBER_OF_PEDESTRIANS_KILLED,NUMBER_OF_CYCLIST_INJURED,NUMBER_OF_CYCLIST_KILLED,NUMBER_OF_MOTORIST_INJURED,NUMBER_OF_MOTORIST_KILLED,CONTRIBUTING_FACTOR_VEHICLE_1,CONTRIBUTING_FACTOR_VEHICLE_2,CONTRIBUTING_FACTOR_VEHICLE_3,CONTRIBUTING_FACTOR_VEHICLE_4,CONTRIBUTING_FACTOR_VEHICLE_5,COLLISION_ID,VEHICLE_TYPE_CODE_1,VEHICLE_TYPE_CODE_2,VEHICLE_TYPE_CODE_3,VEHICLE_TYPE_CODE_4,VEHICLE_TYPE_CODE_5" \
-XPUT http://$FE_PROXY/api/quickstart/crashdata/_stream_load
Enter host password for user 'root':
{
"TxnId": 2,
"Label": "crashdata-0",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 423726,
"NumberLoadedRows": 423725,
"NumberFilteredRows": 1,
"NumberUnselectedRows": 0,
"LoadBytes": 96227746,
"LoadTimeMs": 2483,
"BeginTxnTimeMs": 42,
"StreamLoadPlanTimeMs": 122,
"ReadDataTimeMs": 1610,
"WriteDataTimeMs": 2253,
"CommitAndPublishTimeMs": 65,
"ErrorURL": "http://kube-starrocks-be-2.kube-starrocks-be-search.default.svc.cluster.local:8040/api/_load_error_log?file=error_log_5149e6f80de42bcb_eab2ea77276de4ba"
}
curl --location-trusted -u root             \
-T ./72505394728.csv \
-H "label:weather-0" \
-H "column_separator:," \
-H "skip_header:1" \
-H "enclose:\"" \
-H "max_filter_ratio:1" \
-H "columns: STATION, DATE, LATITUDE, LONGITUDE, ELEVATION, NAME, REPORT_TYPE, SOURCE, HourlyAltimeterSetting, HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPrecipitation, HourlyPresentWeatherType, HourlyPressureChange, HourlyPressureTendency, HourlyRelativeHumidity, HourlySkyConditions, HourlySeaLevelPressure, HourlyStationPressure, HourlyVisibility, HourlyWetBulbTemperature, HourlyWindDirection, HourlyWindGustSpeed, HourlyWindSpeed, Sunrise, Sunset, DailyAverageDewPointTemperature, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, DailyAverageSeaLevelPressure, DailyAverageStationPressure, DailyAverageWetBulbTemperature, DailyAverageWindSpeed, DailyCoolingDegreeDays, DailyDepartureFromNormalAverageTemperature, DailyHeatingDegreeDays, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, DailyPeakWindDirection, DailyPeakWindSpeed, DailyPrecipitation, DailySnowDepth, DailySnowfall, DailySustainedWindDirection, DailySustainedWindSpeed, DailyWeather, MonthlyAverageRH, MonthlyDaysWithGT001Precip, MonthlyDaysWithGT010Precip, MonthlyDaysWithGT32Temp, MonthlyDaysWithGT90Temp, MonthlyDaysWithLT0Temp, MonthlyDaysWithLT32Temp, MonthlyDepartureFromNormalAverageTemperature, MonthlyDepartureFromNormalCoolingDegreeDays, MonthlyDepartureFromNormalHeatingDegreeDays, MonthlyDepartureFromNormalMaximumTemperature, MonthlyDepartureFromNormalMinimumTemperature, MonthlyDepartureFromNormalPrecipitation, MonthlyDewpointTemperature, MonthlyGreatestPrecip, MonthlyGreatestPrecipDate, MonthlyGreatestSnowDepth, MonthlyGreatestSnowDepthDate, MonthlyGreatestSnowfall, MonthlyGreatestSnowfallDate, MonthlyMaxSeaLevelPressureValue, MonthlyMaxSeaLevelPressureValueDate, MonthlyMaxSeaLevelPressureValueTime, MonthlyMaximumTemperature, MonthlyMeanTemperature, MonthlyMinSeaLevelPressureValue, MonthlyMinSeaLevelPressureValueDate, MonthlyMinSeaLevelPressureValueTime, MonthlyMinimumTemperature, MonthlySeaLevelPressure, MonthlyStationPressure, MonthlyTotalLiquidPrecipitation, MonthlyTotalSnowfall, MonthlyWetBulb, AWND, CDSD, CLDD, DSNW, HDSD, HTDD, NormalsCoolingDegreeDay, NormalsHeatingDegreeDay, ShortDurationEndDate005, ShortDurationEndDate010, ShortDurationEndDate015, ShortDurationEndDate020, ShortDurationEndDate030, ShortDurationEndDate045, ShortDurationEndDate060, ShortDurationEndDate080, ShortDurationEndDate100, ShortDurationEndDate120, ShortDurationEndDate150, ShortDurationEndDate180, ShortDurationPrecipitationValue005, ShortDurationPrecipitationValue010, ShortDurationPrecipitationValue015, ShortDurationPrecipitationValue020, ShortDurationPrecipitationValue030, ShortDurationPrecipitationValue045, ShortDurationPrecipitationValue060, ShortDurationPrecipitationValue080, ShortDurationPrecipitationValue100, ShortDurationPrecipitationValue120, ShortDurationPrecipitationValue150, ShortDurationPrecipitationValue180, REM, BackupDirection, BackupDistance, BackupDistanceUnit, BackupElements, BackupElevation, BackupEquipment, BackupLatitude, BackupLongitude, BackupName, WindEquipmentChangeDate" \
-XPUT http://$FE_PROXY/api/quickstart/weatherdata/_stream_load
Enter host password for user 'root':
{
"TxnId": 4,
"Label": "weather-0",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 22931,
"NumberLoadedRows": 22931,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 15558550,
"LoadTimeMs": 404,
"BeginTxnTimeMs": 1,
"StreamLoadPlanTimeMs": 7,
"ReadDataTimeMs": 157,
"WriteDataTimeMs": 372,
"CommitAndPublishTimeMs": 23
}

Connect with a MySQL client

Connect with a MySQL client if you are not connected. Remember to use the external IP address of the kube-starrocks-fe-service service and the password that you configured in the Kubernetes secret starrocks-root-pass.

mysql -P9030 -h $MYSQL_IP -u root --prompt="StarRocks > " -p

Answer some questions

These queries can be run in your SQL client. All of the queries use the quickstart database.

USE quickstart;

How many crashes are there per hour in NYC?

SELECT COUNT(*),
date_trunc("hour", crashdata.CRASH_DATE) AS Time
FROM crashdata
GROUP BY Time
ORDER BY Time ASC
LIMIT 200;

Here is part of the output. Note that I am looking closer at January 6th and 7th as this is Monday and Tuesday of a non-holiday week. Looking at New Years Day is probably not indicative of a normal morning during rush-hour traffic.

|       14 | 2014-01-06 06:00:00 |
| 16 | 2014-01-06 07:00:00 |
| 43 | 2014-01-06 08:00:00 |
| 44 | 2014-01-06 09:00:00 |
| 21 | 2014-01-06 10:00:00 |
| 28 | 2014-01-06 11:00:00 |
| 34 | 2014-01-06 12:00:00 |
| 31 | 2014-01-06 13:00:00 |
| 35 | 2014-01-06 14:00:00 |
| 36 | 2014-01-06 15:00:00 |
| 33 | 2014-01-06 16:00:00 |
| 40 | 2014-01-06 17:00:00 |
| 35 | 2014-01-06 18:00:00 |
| 23 | 2014-01-06 19:00:00 |
| 16 | 2014-01-06 20:00:00 |
| 12 | 2014-01-06 21:00:00 |
| 17 | 2014-01-06 22:00:00 |
| 14 | 2014-01-06 23:00:00 |
| 10 | 2014-01-07 00:00:00 |
| 4 | 2014-01-07 01:00:00 |
| 1 | 2014-01-07 02:00:00 |
| 3 | 2014-01-07 03:00:00 |
| 2 | 2014-01-07 04:00:00 |
| 6 | 2014-01-07 06:00:00 |
| 16 | 2014-01-07 07:00:00 |
| 41 | 2014-01-07 08:00:00 |
| 37 | 2014-01-07 09:00:00 |
| 33 | 2014-01-07 10:00:00 |

It looks like about 40 accidents on a Monday or Tuesday morning during rush hour traffic, and around the same at 17:00 hours.

What is the average temperature in NYC?

SELECT avg(HourlyDryBulbTemperature),
date_trunc("hour", weatherdata.DATE) AS Time
FROM weatherdata
GROUP BY Time
ORDER BY Time ASC
LIMIT 100;

Output:

Note that this is data from 2014, NYC has not been this cold lately.

+-------------------------------+---------------------+
| avg(HourlyDryBulbTemperature) | Time |
+-------------------------------+---------------------+
| 25 | 2014-01-01 00:00:00 |
| 25 | 2014-01-01 01:00:00 |
| 24 | 2014-01-01 02:00:00 |
| 24 | 2014-01-01 03:00:00 |
| 24 | 2014-01-01 04:00:00 |
| 24 | 2014-01-01 05:00:00 |
| 25 | 2014-01-01 06:00:00 |
| 26 | 2014-01-01 07:00:00 |

Is it safe to drive in NYC when visibility is poor?

Let's look at the number of crashes when visibility is poor (between 0 and 1.0 miles). To answer this question use a JOIN across the two tables on the DATETIME column.

SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,
truncate(avg(w.HourlyDryBulbTemperature), 1) AS Temp_F,
truncate(avg(w.HourlyVisibility), 2) AS Visibility,
max(w.HourlyPrecipitation) AS Precipitation,
date_format((date_trunc("hour", c.CRASH_DATE)), '%d %b %Y %H:%i') AS Hour
FROM crashdata c
LEFT JOIN weatherdata w
ON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)
WHERE w.HourlyVisibility BETWEEN 0.0 AND 1.0
GROUP BY Hour
ORDER BY Crashes DESC
LIMIT 100;

The highest number of crashes in a single hour during low visibility is 129. There are multiple things to consider:

  • February 3rd 2014 was a Monday
  • 8AM is rush hour
  • It was raining (0.12 inches or precipitation that hour)
  • The temperature is 32 degrees F (the freezing point for water)
  • Visibility is bad at 0.25 miles, normal for NYC is 10 miles
+---------+--------+------------+---------------+-------------------+
| Crashes | Temp_F | Visibility | Precipitation | Hour |
+---------+--------+------------+---------------+-------------------+
| 129 | 32 | 0.25 | 0.12 | 03 Feb 2014 08:00 |
| 114 | 32 | 0.25 | 0.12 | 03 Feb 2014 09:00 |
| 104 | 23 | 0.33 | 0.03 | 09 Jan 2015 08:00 |
| 96 | 26.3 | 0.33 | 0.07 | 01 Mar 2015 14:00 |
| 95 | 26 | 0.37 | 0.12 | 01 Mar 2015 15:00 |
| 93 | 35 | 0.75 | 0.09 | 18 Jan 2015 09:00 |
| 92 | 31 | 0.25 | 0.12 | 03 Feb 2014 10:00 |
| 87 | 26.8 | 0.5 | 0.09 | 01 Mar 2015 16:00 |
| 85 | 55 | 0.75 | 0.20 | 23 Dec 2015 17:00 |
| 85 | 20 | 0.62 | 0.01 | 06 Jan 2015 11:00 |
| 83 | 19.6 | 0.41 | 0.04 | 05 Mar 2015 13:00 |
| 80 | 20 | 0.37 | 0.02 | 06 Jan 2015 10:00 |
| 76 | 26.5 | 0.25 | 0.06 | 05 Mar 2015 09:00 |
| 71 | 26 | 0.25 | 0.09 | 05 Mar 2015 10:00 |
| 71 | 24.2 | 0.25 | 0.04 | 05 Mar 2015 11:00 |

What about driving in icy conditions?

Water vapor can desublimate to ice at 40 degrees F; this query looks at temps between 0 and 40 degrees F.

SELECT COUNT(DISTINCT c.COLLISION_ID) AS Crashes,
truncate(avg(w.HourlyDryBulbTemperature), 1) AS Temp_F,
truncate(avg(w.HourlyVisibility), 2) AS Visibility,
max(w.HourlyPrecipitation) AS Precipitation,
date_format((date_trunc("hour", c.CRASH_DATE)), '%d %b %Y %H:%i') AS Hour
FROM crashdata c
LEFT JOIN weatherdata w
ON date_trunc("hour", c.CRASH_DATE)=date_trunc("hour", w.DATE)
WHERE w.HourlyDryBulbTemperature BETWEEN 0.0 AND 40.5
GROUP BY Hour
ORDER BY Crashes DESC
LIMIT 100;

The results for freezing temperatures suprised me a little, I did not expect too much traffic on a Sunday morning in the city on a cold January day.A quick look at weather.com showed that there was a big storm with many crashes that day, just like what can be seen in the data.

+---------+--------+------------+---------------+-------------------+
| Crashes | Temp_F | Visibility | Precipitation | Hour |
+---------+--------+------------+---------------+-------------------+
| 192 | 34 | 1.5 | 0.09 | 18 Jan 2015 08:00 |
| 170 | 21 | NULL | | 21 Jan 2014 10:00 |
| 145 | 19 | NULL | | 21 Jan 2014 11:00 |
| 138 | 33.5 | 5 | 0.02 | 18 Jan 2015 07:00 |
| 137 | 21 | NULL | | 21 Jan 2014 09:00 |
| 129 | 32 | 0.25 | 0.12 | 03 Feb 2014 08:00 |
| 114 | 32 | 0.25 | 0.12 | 03 Feb 2014 09:00 |
| 104 | 23 | 0.7 | 0.04 | 09 Jan 2015 08:00 |
| 98 | 16 | 8 | 0.00 | 06 Mar 2015 08:00 |
| 96 | 26.3 | 0.33 | 0.07 | 01 Mar 2015 14:00 |

Drive carefully!

exit

Cleanup

Run this command if you are finished and would like to remove the StarRocks cluster and the StarRocks operator.

helm delete starrocks

Summary

In this tutorial you:

  • Deployed StarRocks with Helm and the StarRocks Operator
  • Loaded crash data provided by New York City and weather data provided by NOAA
  • Analyzed the data using SQL JOINs to find out that driving in low visibility or icy streets is a bad idea

There is more to learn; we intentionally glossed over the data transformation done during the Stream Load. The details on that are in the notes on the curl commands below.


Notes on the curl commands

StarRocks Stream Load and curl take many arguments. Only the ones used in this tutorial are described here, the rest will be linked to in the more information section.

--location-trusted

This configures curl to pass credentials to any redirected URLs.

-u root

The username used to log in to StarRocks

-T filename

T is for transfer, the filename to transfer.

label:name-num

The label to associate with this Stream Load job. The label must be unique, so if you run the job multiple times you can add a number and keep incrementing that.

column_separator:,

If you load a file that uses a single , then set it as shown above, if you use a different delimiter then set that delimiter here. Common choices are \t, ,, and |.

skip_header:1

Some CSV files have a single header row with all of the column names listed, and some add a second line with datatypes. Set skip_header to 1 or 2 if you have one or two header lines, and set it to 0 if you have none.

enclose:\"

It is common to enclose strings that contain embedded commas with double-quotes. The sample datasets used in this tutorial have geo locations that contain commas and so the enclose setting is set to \". Remember to escape the " with a \.

max_filter_ratio:1

This allows some errors in the data. Ideally this would be set to 0 and the job would fail with any errors. It is set to 1 to allow all rows to fail during debugging.

columns:

The mapping of CSV file columns to StarRocks table columns. You will notice that there are many more columns in the CSV files than columns in the table. Any columns that are not included in the table are skipped.

You will also notice that there is some transformation of data included in the columns: line for the crash dataset. It is very common to find dates and times in CSV files that do not conform to standards. This is the logic for converting the CSV data for the time and date of the crash to a DATETIME type:

The columns line

This is the beginning of one data record. The date is in MM/DD/YYYY format, and the time is HH:MI. Since DATETIME is generally YYYY-MM-DD HH:MI:SS we need to transform this data.

08/05/2014,9:10,BRONX,10469,40.8733019,-73.8536375,"(40.8733019, -73.8536375)",

This is the beginning of the columns: parameter:

-H "columns:tmp_CRASH_DATE, tmp_CRASH_TIME, CRASH_DATE=str_to_date(concat_ws(' ', tmp_CRASH_DATE, tmp_CRASH_TIME), '%m/%d/%Y %H:%i')

This instructs StarRocks to:

  • Assign the content of the first column of the CSV file to tmp_CRASH_DATE
  • Assign the content of the second column of the CSV file to tmp_CRASH_TIME
  • concat_ws() concatenates tmp_CRASH_DATE and tmp_CRASH_TIME together with a space between them
  • str_to_date() creates a DATETIME from the concatenated string
  • store the resulting DATETIME in the column CRASH_DATE

More information

Default values.yaml

Stream Load

The Motor Vehicle Collisions - Crashes dataset is provided by New York City subject to these terms of use and privacy policy.

The Local Climatological Data(LCD) is provided by NOAA with this disclaimer and this privacy policy.

Helm is a package manager for Kubernetes. A Helm Chart is a Helm package and contains all of the resource definitions necessary to run an application on a Kubernetes cluster.

starrocks-kubernetes-operator and kube-starrocks Helm Chart.