Aiven for Apache Kafka® metrics available via Prometheus
The following list only contains the most common metrics available via Prometheus for an Aiven for Apache Kafka® service.
You can retrieve the complete list of available metrics for your specific service by requesting the Prometheus endpoint, substituting:
- the Aiven project certificate (
ca.pem) - the Prometheus credentials
(
<PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>) - the Aiven for Apache Kafka hostname (
<KAFKA_HOSTNAME>) - the Prometheus port (
<PROMETHEUS_PORT>)
curl --cacert ca.pem \
--user '<PROMETHEUS_USER>:<PROMETHEUS_PASSWORD>' \
'https://<KAFKA_HOSTNAME>:<PROMETHEUS_PORT>/metrics'
You can check how to use Prometheus with Aiven in the dedicated document.
CPU utilization
cpu_usage_guest: CPU time spent running a virtual CPU for guest operating systems.cpu_usage_guest_nice: The amount of time the CPU runs a virtual CPU for a guest operating system, which is low-priority and can be interrupted by other processes. This metric is measured in hundredths of a second.cpu_usage_idle: Time the CPU spends doing nothing.cpu_usage_iowait: Time waiting for I/O to complete.cpu_usage_irq: Time servicing interrupts.cpu_usage_nice: Time running user-niced processes.cpu_usage_softirq: Time servicing softirqs.cpu_usage_steal: Time spent in other operating systems when running in a virtualized environment.cpu_usage_system: Time spent running system processes.cpu_usage_user: Time spent running user processes.system_load1: System load average for the last minute.system_load15: System load average for the last 15 minutes.system_load5: System load average for the last 5 minutes.system_n_cpus: Number of CPU cores available.system_n_users: Number of users logged in.system_uptime: Time for which the system has been up and running.
Disk space utilization
disk_free: Amount of free disk space.disk_inodes_free: Number of free inodes.disk_inodes_total: Total number of inodes.disk_inodes_used: Number of used inodes.disk_total: Total disk space.disk_used: Amount of used disk space.disk_used_percent: Percentage of disk space used.
Disk input and output
Metrics such as diskio_io_time, diskio_iops_in_progress, etc., offer valuable insights into disk I/O operations. These metrics encompass read/write operations, the duration of these operations, bytes read/written, and more.
diskio_io_timediskio_iops_in_progressdiskio_merged_readsdiskio_merged_writesdiskio_read_bytesdiskio_read_timediskio_readsdiskio_weighted_io_timediskio_write_bytesdiskio_write_timediskio_writes
Garbage collector MXBean
Metrics associated with the java_lang_GarbageCollector provide insights into the JVM's garbage collection process. These metrics encompass details such as the collection count, duration of collections, and more.
java_lang_GarbageCollector_G1_Young_Generation_CollectionCount: returns the total number of collections that have occurredjava_lang_GarbageCollector_G1_Young_Generation_CollectionTime: returns the approximate accumulated collection elapsed time in millisecondsjava_lang_GarbageCollector_G1_Young_Generation_duration
Memory usage
Metrics starting with java_lang_Memory provide insights into the JVM's memory usage, such as committed memory, initial memory, max memory, used memory, etc.
java_lang_Memory_committed: returns the amount of memory in bytes that is committed for the Java virtual machine to usejava_lang_Memory_init: returns the amount of memory in bytes that the Java virtual machine initially requests from the operating system for memory managementjava_lang_Memory_max: returns the maximum amount of memory in bytes that can be used for memory managementjava_lang_Memory_used: returns the amount of used memory in bytesjava_lang_Memory_ObjectPendingFinalizationCount
Apache Kafka Connect
The Apache Kafka Connect metrics list is available in the dedicated page.
Apache Kafka broker
The descriptions for the below metrics are available in the Monitoring section of the Apache Kafka documentation.
The metrics with a _Count suffix are cumulative counters for the given
metric, for example,
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count.
Note that a metric like
kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count is a cumulative
count of incoming messages despite the PerSec suffix in the metric
name.
To see the rate of change of these _Count metrics, you can apply a
function such as the rate() function in PromQL.
Apache Kafka controller
These metrics with
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_XthPercentile
(where X can be 50th, 75th, 95th, etc.) represent the time taken for
leader elections to complete at various percentiles. It helps in
understanding the distribution of leader election times.
Metrics below with
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_
(FifteenMinuteRate, FiveMinuteRate, etc.) represent the rate of leader
elections over different time intervals.
Metrics below with
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_
(Max/Mean/Min/StdDev) provide statistical measures about the leader
election times.
Metrics below with kafka_controller_KafkaController_Metrics provide
insights into the state of the Kafka controller, like the number of
active brokers, offline partitions, replicas to delete, etc.
kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_50thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_75thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_95thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_98thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_999thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_99thPercentilekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Count: The total number of leader elections.kafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FifteenMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_FiveMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Maxkafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Meankafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_MeanRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_Minkafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_OneMinuteRatekafka_controller_ControllerStats_LeaderElectionRateAndTimeMs_StdDevkafka_controller_ControllerStats_UncleanLeaderElectionsPerSec_Count: Number of times an unclean leader election occurs. Unclean leader elections can lead to data loss.kafka_controller_KafkaController_ActiveBrokerCount_Valuekafka_controller_KafkaController_ActiveControllerCount_Valuekafka_controller_KafkaController_FencedBrokerCount_Valuekafka_controller_KafkaController_OfflinePartitionsCount_Valuekafka_controller_KafkaController_PreferredReplicaImbalanceCount_Valuekafka_controller_KafkaController_ReplicasIneligibleToDeleteCount_Valuekafka_controller_KafkaController_ReplicasToDeleteCount_Valuekafka_controller_KafkaController_TopicsIneligibleToDeleteCount_Valuekafka_controller_KafkaController_TopicsToDeleteCount_Value
Jolokia collector collect time
kafka_jolokia_collector_collect_time: Represents the time taken by the Jolokia collector to collect metrics. Jolokia is a JMX-HTTP bridge, giving an alternative to native JMX access.
Apache Kafka log
Metrics like kafka_log_LogCleaner_cleaner_recopy_percent_Value and
kafka_log_LogCleanerManager_time_since_last_run_ms_Value provide
insights into the log cleaner's operation, which helps in compacting
the Kafka logs.
Log Flush Rate Metrics give insights into the log flush operations.
Flushing ensures that data is written from memory to disk. Metrics like
kafka_log_LogFlushStats_LogFlushRateAndTimeMs_XthPercentile provide the
time taken to flush logs at various percentiles.
kafka_log_LogCleaner_cleaner_recopy_percent_Valuekafka_log_LogCleanerManager_time_since_last_run_ms_Valuekafka_log_LogCleaner_max_clean_time_secs_Valuekafka_log_LogFlushStats_LogFlushRateAndTimeMs_50thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_75thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_95thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_98thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_999thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_99thPercentilekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Countkafka_log_LogFlushStats_LogFlushRateAndTimeMs_FifteenMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_FiveMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Maxkafka_log_LogFlushStats_LogFlushRateAndTimeMs_Meankafka_log_LogFlushStats_LogFlushRateAndTimeMs_MeanRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_Minkafka_log_LogFlushStats_LogFlushRateAndTimeMs_OneMinuteRatekafka_log_LogFlushStats_LogFlushRateAndTimeMs_StdDevkafka_log_Log_LogEndOffset_Valuekafka_log_Log_LogStartOffset_Valuekafka_log_Log_Size_Value
Apache Kafka network
Metrics below like kafka_network_RequestMetrics_RequestsPerSec_Count
and kafka_network_RequestMetrics_TotalTimeMs_Mean provide insights
into the network requests made to the Kafka brokers.
kafka_network_RequestChannel_RequestQueueSize_Valuekafka_network_RequestChannel_ResponseQueueSize_Valuekafka_network_RequestMetrics_RequestsPerSec_Countkafka_network_RequestMetrics_TotalTimeMs_95thPercentilekafka_network_RequestMetrics_TotalTimeMs_Countkafka_network_RequestMetrics_TotalTimeMs_Meankafka_network_SocketServer_NetworkProcessorAvgIdlePercent_Value
Apache Kafka server
The metrics below like BrokerTopicMetrics provide insights into
various operations related to topics, like bytes in/out, failed
fetch/produce requests, etc.
Metrics ReplicaManager like
kafka_server_ReplicaManager_LeaderCount_Value provide insights into
the state of replicas in the Kafka cluster.
If you do not specify the topic tag, it displays the combined rate for
all topics as well as the rate for each individual topic. To view rates
for specific topics, use the topic tag. To exclude the combined rate
for all topics and only list metrics for individual topics, filter with
topic!=""
kafka_server_BrokerTopicMetrics_BytesInPerSec_Count: Byte in (from the clients) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_BytesOutPerSec_Count: Byte out (to the clients) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_BytesRejectedPerSec_Count: Rejected byte rate per topic due to the record batch size being greater than max.message.bytes configuration. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FailedFetchRequestsPerSec_Count: Failed Fetch request (from clients or followers) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FailedProduceRequestsPerSec_Count: Failed Produce request rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_FetchMessageConversionsPerSec_Count: Message format conversion rate, for Produce or Fetch requests, per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_MessagesInPerSec_Count: Incoming message rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ProduceMessageConversionsPerSec_Count: Message format conversion rate, for Produce or Fetch requests, per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ReassignmentBytesInPerSec_Count: Incoming byte rate of reassignment traffickafka_server_BrokerTopicMetrics_ReassignmentBytesOutPerSec_Count: Outgoing byte rate of reassignment traffickafka_server_BrokerTopicMetrics_ReplicationBytesInPerSec_Count: Byte in (from the other brokers) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_ReplicationBytesOutPerSec_Count: Byte out (to the other brokers) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_TotalFetchRequestsPerSec_Count: Fetch request (from clients or followers) rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_BrokerTopicMetrics_TotalProduceRequestsPerSec_Count: Produce request rate per topic. Omittingtopic=(...)will yield the all-topic rate.kafka_server_DelayedOperationPurgatory_NumDelayedOperations_Valuekafka_server_DelayedOperationPurgatory_PurgatorySize_Valuekafka_server_KafkaRequestHandlerPool_RequestHandlerAvgIdlePercent_OneMinuteRatekafka_server_KafkaServer_BrokerState_Valuekafka_server_ReplicaManager_IsrExpandsPerSec_Countkafka_server_ReplicaManager_IsrShrinksPerSec_Countkafka_server_ReplicaManager_LeaderCount_Valuekafka_server_ReplicaManager_PartitionCount_Valuekafka_server_ReplicaManager_UnderMinIsrPartitionCount_Valuekafka_server_ReplicaManager_UnderReplicatedPartitions_Valuekafka_server_group_coordinator_metrics_group_completed_rebalance_countkafka_server_group_coordinator_metrics_group_completed_rebalance_ratekafka_server_group_coordinator_metrics_offset_commit_countkafka_server_group_coordinator_metrics_offset_commit_ratekafka_server_group_coordinator_metrics_offset_deletion_countkafka_server_group_coordinator_metrics_offset_deletion_ratekafka_server_group_coordinator_metrics_offset_expiration_countkafka_server_group_coordinator_metrics_offset_expiration_rate
Kernel
Metrics below, like kernel_boot_time, kernel_context_switches, etc.,
provide insights into the underlying system's kernel operations.
kernel_boot_timekernel_context_switcheskernel_entropy_availkernel_interruptskernel_processes_forked
Generic memory
Metrics like mem_active, mem_available, etc., provide insights into
the system's memory usage.
mem_activemem_availablemem_available_percentmem_bufferedmem_cachedmem_commit_limitmem_committed_asmem_dirtymem_freemem_high_freemem_high_totalmem_huge_pages_freemem_huge_page_sizemem_huge_pages_totalmem_inactivemem_low_freemem_low_totalmem_mappedmem_page_tablesmem_sharedmem_slabmem_swap_cachedmem_swap_freemem_swap_totalmem_totalmem_usedmem_used_percentmem_vmalloc_chunkmem_vmalloc_totalmem_vmalloc_usedmem_wiredmem_write_backmem_write_back_tmp
Network
Metrics like net_bytes_recv, net_packets_sent, etc., provide
insights into the system's network operations.
net_bytes_recvnet_bytes_sentnet_drop_innet_drop_outnet_err_innet_err_outnet_icmp_inaddrmaskrepsnet_icmp_inaddrmasksnet_icmp_incsumerrorsnet_icmp_indestunreachsnet_icmp_inechorepsnet_icmp_inechosnet_icmp_inerrorsnet_icmp_inmsgsnet_icmp_inparmprobsnet_icmp_inredirectsnet_icmp_insrcquenchsnet_icmp_intimeexcdsnet_icmp_intimestamprepsnet_icmp_intimestampsnet_icmpmsg_intype3net_icmpmsg_intype8net_icmpmsg_outtype0net_icmpmsg_outtype3net_icmp_outaddrmaskrepsnet_icmp_outaddrmasksnet_icmp_outdestunreachsnet_icmp_outechorepsnet_icmp_outechosnet_icmp_outerrorsnet_icmp_outmsgsnet_icmp_outparmprobsnet_icmp_outredirectsnet_icmp_outsrcquenchsnet_icmp_outtimeexcdsnet_icmp_outtimestamprepsnet_icmp_outtimestampsnet_ip_defaultttlnet_ip_forwardingnet_ip_forwdatagramsnet_ip_fragcreatesnet_ip_fragfailsnet_ip_fragoksnet_ip_inaddrerrorsnet_ip_indeliversnet_ip_indiscardsnet_ip_inhdrerrorsnet_ip_inreceivesnet_ip_inunknownprotosnet_ip_outdiscardsnet_ip_outnoroutesnet_ip_outrequestsnet_ip_reasmfailsnet_ip_reasmoksnet_ip_reasmreqdsnet_ip_reasmtimeoutnet_packets_recvnet_packets_sentnetstat_tcp_closenetstat_tcp_close_waitnetstat_tcp_closingnetstat_tcp_establishednetstat_tcp_fin_wait1netstat_tcp_fin_wait2netstat_tcp_last_acknetstat_tcp_listennetstat_tcp_nonenetstat_tcp_syn_recvnetstat_tcp_syn_sentnetstat_tcp_time_waitnetstat_udp_socketnet_tcp_activeopensnet_tcp_attemptfailsnet_tcp_currestabnet_tcp_estabresetsnet_tcp_incsumerrorsnet_tcp_inerrsnet_tcp_insegsnet_tcp_maxconnnet_tcp_outrstsnet_tcp_outsegsnet_tcp_passiveopensnet_tcp_retranssegsnet_tcp_rtoalgorithmnet_tcp_rtomaxnet_tcp_rtominnet_udp_ignoredmultinet_udp_incsumerrorsnet_udp_indatagramsnet_udp_inerrorsnet_udplite_ignoredmultinet_udplite_incsumerrorsnet_udplite_indatagramsnet_udplite_inerrorsnet_udplite_noportsnet_udplite_outdatagramsnet_udplite_rcvbuferrorsnet_udplite_sndbuferrorsnet_udp_noportsnet_udp_outdatagramsnet_udp_rcvbuferrorsnet_udp_sndbuferrors
Processes
Metrics like processes_running, processes_zombies, etc., provide
insights into the system's process management.
processes_blockedprocesses_deadprocesses_idleprocesses_pagingprocesses_runningprocesses_sleepingprocesses_stoppedprocesses_totalprocesses_total_threadsprocesses_unknownprocesses_zombies
Swap usage
Metrics like swap_free, swap_used, etc., provide insights into the
system's swap memory usage.
swap_freeswap_inswap_outswap_totalswap_usedswap_used_percent