Skip to main content

Diagnostic Commands

Before troubleshooting, gather diagnostic information:
# Check operator pod status
kubectl get pods -n kestrel-ai -l app=kestrel-operator

# View operator logs
kubectl logs -n kestrel-ai -l app=kestrel-operator --tail=100

# Describe pod for events
kubectl describe pod -n kestrel-ai -l app=kestrel-operator

# Check service account permissions
kubectl auth can-i --list --as=system:serviceaccount:kestrel-ai:kestrel-operator

Common Issues

Operator Not Starting

Pod in CrashLoopBackOff

Symptoms:
  • Pod repeatedly restarts
  • Status shows CrashLoopBackOff
Check logs:
kubectl logs -n kestrel-ai -l app=kestrel-operator --previous
Common causes and solutions:
Error message:
Error: authentication failed: invalid token
Solution:
  1. Verify token is correctly set in values file
  2. Ensure token hasn’t been truncated
  3. Generate a new token from the dashboard if needed
# Check current token (first few characters)
kubectl get secret -n kestrel-ai kestrel-operator -o jsonpath='{.data.token}' | base64 -d | head -c 20
Error message:
Failed to connect to server: connection refused
Solution:
  1. Check network policies or firewalls
  2. Verify egress is allowed to grpc.platform.usekestrel.ai:443
  3. Test connectivity:
kubectl run test-connection --rm -i --tty --image=busybox -- sh
# Inside the pod:
nc -zv grpc.platform.usekestrel.ai 443
Error message:
Error creating informer: forbidden: User "system:serviceaccount:kestrel-ai:kestrel-operator" cannot list resource
Solution:
  1. Verify RBAC is properly configured
  2. Check ClusterRole and ClusterRoleBinding:
kubectl get clusterrole kestrel-operator
kubectl get clusterrolebinding kestrel-operator

Pod in Pending State

Symptoms:
  • Pod stays in Pending status
  • Not scheduled to any node
Solutions:
  1. Check resource availability:
kubectl describe nodes | grep -A 5 "Allocated resources"
  1. Review pod events:
kubectl describe pod -n kestrel-ai -l app=kestrel-operator | grep -A 10 Events
  1. Adjust resource requests if needed:
resources:
  requests:
    cpu: 100m      # Lower CPU request
    memory: 256Mi  # Lower memory request

Connection Issues

Operator Shows as Offline

Dashboard shows cluster as offline but pod is running Diagnostic steps:
  1. Check operator logs for connection errors:
kubectl logs -n kestrel-ai -l app=kestrel-operator | grep -i "error\|failed\|connection"
  1. Verify gRPC stream health:
kubectl logs -n kestrel-ai -l app=kestrel-operator | grep "stream"
  1. Check liveness probe:
kubectl get pod -n kestrel-ai -l app=kestrel-operator -o json | jq '.items[0].status.conditions'
Common solutions:
  • Restart the operator pod:
kubectl rollout restart deployment/kestrel-operator -n kestrel-ai
  • Verify network connectivity to Kestrel platform
  • Check for proxy or firewall configurations

Flow Collection Issues

No Flows from Cilium

Symptoms:
  • Operator is connected but no network flows appear
  • Topology map shows no flows
Diagnostic steps:
  1. Verify Hubble is enabled:
cilium hubble status
  1. Check Hubble Relay is running:
kubectl get pods -n kube-system -l k8s-app=hubble-relay
  1. Test Hubble connectivity:
kubectl exec -n kestrel-ai deployment/kestrel-operator -- nc -zv hubble-relay.kube-system.svc.cluster.local 4245
Solutions:
  • Enable Hubble in Cilium:
cilium hubble enable --relay

No Flows from Istio

Symptoms:
  • Istio is enabled but no L7 flows appear on Topology map
  • ALS endpoint not receiving data
Diagnostic steps:
  1. Verify Istio telemetry configuration:
kubectl get telemetry -A
  1. Check Envoy configuration:
kubectl exec -n <namespace> <pod> -c istio-proxy -- curl -s localhost:15000/config_dump | grep access_log
  1. Verify ALS port is accessible:
kubectl get svc -n kestrel-ai kestrel-operator -o yaml | grep -A 5 ports
Solutions:
  • Configure Istio telemetry to send logs to operator:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: kestrel-als
  namespace: istio-system
spec:
  accessLogging:
  - providers:
    - name: kestrel
  • Ensure operator service exposes ALS port:
operator:
  istio:
    enabled: true
    alsPort: 8080

Safe-Apply Issues

Permissions Denied

Symptoms:
  • Safe-apply enabled but cannot apply resources
  • Error: “cannot create/update/delete resource”
Solutions:
  1. Verify safe-apply is enabled in both places:
    • Dashboard: Check cluster safe-apply toggle
    • Helm values: operator.safeApply.enabled: true
  2. Check RBAC permissions:
# List operator permissions
kubectl auth can-i create networkpolicies --as=system:serviceaccount:kestrel-ai:kestrel-operator

# Check ClusterRole
kubectl get clusterrole kestrel-operator -o yaml | grep -A 10 rules
  1. Re-deploy with correct permissions:
helm upgrade kestrel-operator \
  oci://ghcr.io/kestrelai/charts/kestrel-operator \
  --namespace kestrel-ai \
  --set operator.safeApply.enabled=true \
  -f values.yaml

Getting Help

If you can’t resolve an issue:
  1. Check documentation: Review configuration and onboarding guides
  2. Collect diagnostics: Run the support bundle script
  3. Contact support: Email hello@usekestrel.ai with:
    • Cluster name and ID
    • Support bundle
    • Description of the issue
    • Steps to reproduce

FAQ

No, only one Kestrel operator should run per cluster. Multiple operators would cause conflicts and duplicate data.
The cluster name is tied to the authentication token. To change it:
  1. Generate a new token with the desired name
  2. Update your values file with the new token
  3. Upgrade the helm release
Outbound:
  • 443/tcp to grpc.platform.usekestrel.ai (gRPC)
  • 4245/tcp to Hubble Relay (if using Cilium)
Inbound:
  • 8080/tcp from Envoy proxies (if using Istio)
  • 8081/tcp for health checks (internal)
Yes, you can disable flows without removing the operator:
operator:
  cilium:
    disableFlows: true
Then upgrade the helm release.

Next Steps