# Consumer Group Rebalance Protocol (KIP-848)

[KIP-848](https://cwiki.apache.org/confluence/display/KAFKA/KIP-848%3A+The+Next+Generation+of+the+Consumer+Rebalance+Protocol) introduces a next-generation consumer group rebalance protocol that can deliver up to 20x faster rebalances while eliminating stop-the-world pauses. This guide focuses specifically on using this protocol with Karafka - certain KIP-848 features and limitations not relevant to Karafka applications are intentionally omitted.

!!! info "Low-Level Protocol Details"

    Parts of this documentation are based on the [librdkafka KIP-848](https://github.com/confluentinc/librdkafka/blob/v2.12.0/INTRODUCTION.md#next-generation-consumer-group-protocol-kip-848). Big thank you to the librdkafka team for allowing me to build upon their excellent documentation.

    For detailed low-level information about the next generation consumer group protocol, including internal implementation details and protocol specifications, see the librdkafka documentation linked above.

## Consumer Group Rebalance Protocol (KIP-848) / Overview

Traditional consumer rebalancing requires **all** consumers to stop processing during coordination, even if only one consumer joins or leaves the group. KIP-848 solves this by moving coordination logic to the Kafka broker and allowing consumers to continue processing while rebalancing happens incrementally in the background.

**Key Benefits:**

- Rebalances complete several times faster in large consumer groups
- Consumers continue processing messages during rebalancing
- Only affected consumers pause briefly when receiving new partition assignments
- Better isolation when some consumers are slower than others
- Improved operational visibility with server-side coordination

## Consumer Group Rebalance Protocol (KIP-848) / When to use the new protocol

- **With large consumer groups:** If your consumer groups have 10+ consumers managing many partitions, you will see the most dramatic improvements. For example, a group with 10 consumers adding 900 partitions completes rebalancing in 5 seconds instead of 103 seconds.

- **For high-availability applications:** If your application can't afford processing interruptions, you will benefit from continuous message processing during rebalances. Financial services, real-time analytics, and fraud detection systems are ideal candidates.

- **In frequently rebalancing environments:** If you have auto-scaling deployments, Kubernetes with frequent pod restarts, or development environments with continuous deployments, you will experience much less disruption.

- **When scaling partitions dynamically:** If you regularly add partitions and topics to match workload changes, the new protocol will handle these changes more efficiently.

## Consumer Group Rebalance Protocol (KIP-848) / Requirements

### Consumer Group Rebalance Protocol (KIP-848) / Requirements / Broker Requirements

- Apache Kafka 4.0+ or Confluent Platform 8.0+
- KRaft mode (ZooKeeper-based clusters must migrate first)

!!! warning "Avoid Known Broker Bugs"

    Some early Kafka releases that shipped the `consumer` protocol contained critical rebalance bugs. In particular, [KAFKA-19862](https://issues.apache.org/jira/browse/KAFKA-19862) could leave a consumer group stuck in the `CompletingRebalance` state, causing consumers to hang rather than rebalance. Before enabling KIP-848 in production, confirm your broker version is **beyond** the release that fixes the known issues for your platform, and validate on a staging cluster first. If you rely on the Karafka [liveness listener](https://karafka.io/docs/Infrastructure-Deployment.md#liveness), its `stability_ttl` check can detect a group frozen in a non-steady join state caused by such bugs.

!!! warning "Alternative Kafka Protocol Implementations"

    At the time of writing, KIP-848 is not supported by Redpanda or other alternative Kafka protocol implementations. This feature requires Apache Kafka 4.0+ brokers. Check with your broker vendor for KIP-848 support status if not using Apache Kafka.

### Consumer Group Rebalance Protocol (KIP-848) / Requirements / Karafka Requirements

- karafka-rdkafka with librdkafka 2.12.0+
- Karafka 2.4+
- Ruby 3.2+ recommended

!!! note "No Code Changes Required"

    **No application code changes required.** You only need to update configuration.

### Consumer Group Rebalance Protocol (KIP-848) / Requirements / Supported Features

KIP-848 in librdkafka 2.12.0+ supports all major consumer features:

- **Topic subscriptions**: Both explicit topic lists and regular expression (regex) patterns
- **Static group membership**: Using `group.instance.id` for stable member identities
- **Rebalance callbacks**: Incremental assignment and revocation callbacks
- **Manual and automatic offset management**: Both commit modes work as expected
- **Rolling upgrades**: Seamless migration from classic protocol without downtime

Regex subscriptions are supported but behave differently from the classic protocol - see [Regex Subscription Changes](#regex-subscription-changes) for important details.

## Consumer Group Rebalance Protocol (KIP-848) / Configuration

### Consumer Group Rebalance Protocol (KIP-848) / Configuration / Enabling KIP-848

The new protocol is **not** enabled by default. Update your Karafka configuration:

```ruby
class KarafkaApp < Karafka::App
  setup do |config|
    config.kafka = {
      'bootstrap.servers': 'kafka-broker:9092',
      'group.protocol': 'consumer'  # Enable KIP-848
    }
  end
end
```

### Consumer Group Rebalance Protocol (KIP-848) / Configuration / Choosing an assignor

The protocol provides two server-side assignors:

```ruby
config.kafka = {
  'group.protocol': 'consumer',
  'group.remote.assignor': 'uniform'  # Default, recommended for most cases
}
```

- **Uniform assignor** (recommended): Distributes partitions evenly across consumers. Works well for most workloads and provides good balance.

- **Range assignor**: Groups topic partitions together as ranges. Useful when you need related partitions on the same consumer.

### Consumer Group Rebalance Protocol (KIP-848) / Configuration / Configuration Cleanup

When migrating to KIP-848, remove these classic protocol settings:

```ruby
config.kafka = {
  'group.protocol': 'consumer',

  # Remove these - they cause errors with the new protocol:
  # 'partition.assignment.strategy': 'cooperative-sticky',
  # 'session.timeout.ms': 45000,
  # 'heartbeat.interval.ms': 3000
}
```

!!! warning "Deprecated Properties"

    Session and heartbeat timeouts are now controlled by the broker, not individual consumers. Including deprecated properties like:

    - `partition.assignment.strategy`
    - `session.timeout.ms`
    - `heartbeat.interval.ms`

    when using `group.protocol=consumer` will cause request rejection.

## Consumer Group Rebalance Protocol (KIP-848) / Migration Guide

### Consumer Group Rebalance Protocol (KIP-848) / Migration Guide / Preparation

Before migrating:

1. Upgrade Kafka brokers to version 4.0+
1. Verify brokers are running in KRaft mode
1. Upgrade all the Karafka ecosystem components to the most recent versions
1. Test the migration in a staging environment first
1. Ensure monitoring tools are ready to track the new protocol

### Consumer Group Rebalance Protocol (KIP-848) / Migration Guide / Rolling Migration

KIP-848 supports live migration without downtime. When the first consumer using the new protocol joins a group, the coordinator will automatically transition the entire group.

1. Update your Karafka configuration to enable `'group.protocol': 'consumer'` and remove deprecated properties.

1. Deploy the updated configuration using a rolling restart:

    - Restart the first consumer instance
    - The group coordinator will transition to the new protocol
    - Continue restarting remaining consumers one at a time
    - Monitor for any errors during the rollout

    !!! warning "Warning"

        Complete the migration within a few hours. Don't leave the group in a mixed state for extended periods.

### Consumer Group Rebalance Protocol (KIP-848) / Migration Guide / Rollback

If issues arise, remove `'group.protocol': 'consumer'` from your configuration and restart consumers. The coordinator will automatically convert back to classic protocol when the last new-protocol consumer leaves.

### Consumer Group Rebalance Protocol (KIP-848) / Migration Guide / Migration Checklist

Use this checklist to ensure a smooth migration to KIP-848:

**Prerequisites:**

- [ ] Upgrade Kafka brokers to version 4.0.0+
- [ ] Verify brokers are running in KRaft mode (not ZooKeeper)
- [ ] Upgrade to the latest version of all Karafka ecosystem components

**Configuration Changes:**

- [ ] Set `'group.protocol': 'consumer'` in `config.kafka`
- [ ] Remove `'partition.assignment.strategy'` if present
- [ ] Remove `'session.timeout.ms'` if present
- [ ] Remove `'heartbeat.interval.ms'` if present
- [ ] Remove `'group.protocol.type'` if present

**Code Review (if using regex subscriptions):**

- [ ] Review all regex patterns to ensure they match complete topic names (e.g., use `^topic.*` instead of `^topic`)
- [ ] Test regex patterns against the RE2/J engine behavior (full-match, not partial-match)
- [ ] Update [Routing Patterns](https://karafka.io/docs/Pro-Routing-Patterns.md) regexes if needed (e.g., `pattern(/prefix.*/)` instead of `pattern(/prefix/)`)

**Code Review (if using static membership):**

- [ ] Review static membership usage (`group.instance.id`) and understand new fencing behavior

**Deployment:**

- [ ] Deploy using rolling restart (one consumer instance at a time)
- [ ] Monitor first consumer restart for successful group protocol transition
- [ ] Continue rolling restart across all consumer instances
- [ ] Verify migration with `kafka-consumer-groups.sh --describe --group <group> --state` or using Karafka Web UI
- [ ] Complete migration within a few hours (don't leave in mixed state)

**Post-Migration Validation:**

- [ ] Verify all consumers show in consumer group
- [ ] Check consumer lag is normal
- [ ] Monitor rebalance frequency and duration
- [ ] Watch for new protocol-specific errors in logs
- [ ] Validate offset commits are working correctly

**Rollback Plan (if needed):**

- [ ] Document rollback procedure: remove `'group.protocol': 'consumer'` and restart
- [ ] Understand that rollback triggers another rebalance
- [ ] Prepare monitoring alerts for rollback detection

## Consumer Group Rebalance Protocol (KIP-848) / Karafka-Specific Considerations

### Consumer Group Rebalance Protocol (KIP-848) / Karafka-Specific Considerations / Rebalance Callbacks

Your existing rebalance callbacks will continue working with KIP-848:

```ruby
class EventsConsumer < Karafka::BaseConsumer
  def consume
    messages.each { |msg| process(msg) }
  end

  def revoked
    logger.info "Partitions revoked: #{topic.name}"
    # Cleanup: flush buffers, commit work, etc.
  end

  def shutdown
    # Final cleanup when consumer shuts down
  end
end
```

### Consumer Group Rebalance Protocol (KIP-848) / Karafka-Specific Considerations / Multi-Threading Behavior

Karafka's multi-threaded processing benefits significantly from KIP-848. During rebalances, only threads consuming or processing affected partitions will pause briefly. Other threads will continue processing messages uninterrupted.

[Virtual Partitions](https://karafka.io/docs/Pro-Consumer-Groups-Virtual-Partitions.md) (parallel processing within a partition) will also experience less disruption during rebalances.

## Consumer Group Rebalance Protocol (KIP-848) / Protocol Behavior Differences

KIP-848 introduces several important behavioral changes compared to the classic protocol. Understanding these differences helps avoid surprises during migration and operation.

### Consumer Group Rebalance Protocol (KIP-848) / Protocol Behavior Differences / Session Timeout and Message Fetching

- **KIP-848 Behavior:** When the Group Coordinator becomes unreachable, consumers **will continue fetching and processing messages** but will not be able to commit offsets. The consumer will only be fenced once a heartbeat response is received from the Coordinator indicating the session has expired.

- **Classic Protocol:** Consumers stopped fetching messages when the client-side session timeout expired, even if the broker was unreachable.

- **Implication:** With KIP-848, your consumers will remain productive during temporary coordinator outages. However, be aware that processed messages will not be able to be committed until coordinator connectivity is restored. Design your consumers to handle duplicate processing if a crash occurs during this window.

### Consumer Group Rebalance Protocol (KIP-848) / Protocol Behavior Differences / Static Group Membership Fencing

- **KIP-848 Behavior:** When a duplicate `group.instance.id` is detected, the **newly joining member** will be fenced with `UNRELEASED_INSTANCE_ID` (fatal error). The existing member will continue operating.

- **Classic Protocol:** The **existing member** was fenced instead, allowing the new member to take over.

!!! warning "Breaking Change: Fencing Behavior Reversal"

    KIP-848 **reverses** static membership fencing behavior compared to the classic protocol. If you rely on static membership (`group.instance.id`), this change can significantly impact your deployment and recovery procedures:

    - **Deployment Impact:** You cannot quickly replace a consumer with the same `group.instance.id` unless the old consumer shuts down cleanly first
    - **Recovery Impact:** After crashes, replacements will be blocked until the broker's session timeout expires (removing the zombie member)
    - **Recommendation:** Ensure robust shutdown hooks and consider whether static membership is necessary for your use case

- **Implication:** This reversal prevents accidental takeovers. Ensure clean consumer shutdown before starting replacements with the same `group.instance.id`. If a consumer crashes without graceful shutdown, the replacement will be blocked until the broker's session timeout expires and removes the existing member.

### Consumer Group Rebalance Protocol (KIP-848) / Protocol Behavior Differences / Regex Subscription Changes

Regex matching in the `consumer` protocol is performed on the broker side, using the **Google RE2/J** regex engine. This differs from the `classic` protocol, where librdkafka and derived clients performed regex evaluation locally using the **libc regex** engine.

As part of adopting the `consumer` protocol, librdkafka (and derived clients) now rely on the broker's RE2/J engine for regex-based subscriptions, effectively replacing the previous `libc`-based matching behavior.

!!! warning "Regex Patterns May Need Updating"

    The RE2/J engine used by the broker requires that regexes match the **complete** topic name, while the `libc` engine used by the classic protocol only checks if the pattern is found within the topic name (including as a prefix). This means patterns that worked with the classic protocol may silently stop matching topics under the consumer protocol.

    **Example:** Given topics `topic-1` and `topic-2`:

    - `^topic` or `^topic*` - matches both topics in the `classic` protocol, but **no partitions are assigned** with the `consumer` protocol
    - `^topic.*` - works correctly with **both** protocols

    Always ensure your regex patterns use explicit wildcards (like `.*`) to match the full topic name.

In Karafka, the [Routing Patterns](https://karafka.io/docs/Pro-Routing-Patterns.md) feature internally prepends `^` to the regex source. This means:

- `pattern(/prefix/)` produces `^prefix` - **fails** under the consumer protocol for topics like `prefix-1`
- `pattern(/prefix.*/)` produces `^prefix.*` - **works** under both protocols

If you are migrating to the consumer protocol and use regex-based subscriptions, review all your patterns to ensure they include explicit wildcards where needed. See the [Routing Patterns](https://karafka.io/docs/Pro-Routing-Patterns.md#regexp-implementation-differences) for more details.

### Consumer Group Rebalance Protocol (KIP-848) / Protocol Behavior Differences / Unknown and Unauthorized Topics

- **KIP-848 Behavior:**
    - `UNKNOWN_TOPIC_OR_PART` is no longer returned when subscribing to a topic that's missing from the local metadata cache. The subscription proceeds, and the consumer will discover the topic when metadata refreshes.
    - `TOPIC_AUTHORIZATION_FAILED` is reported once per heartbeat or subscription change, even if only one subscribed topic is unauthorized.

- **Classic Protocol:** Errors were reported immediately upon subscription if topics were missing from the local metadata cache.

- **Implication:** Topic discovery is more seamless, but authorization failures may appear less frequently in logs.

## Consumer Group Rebalance Protocol (KIP-848) / Error Handling

KIP-848 introduces new error conditions:

- **STALE_MEMBER_EPOCH:** Consumer's state is behind the coordinator. This will usually resolve automatically within seconds. Alert if errors persist.

- **FENCED_MEMBER_EPOCH:** Consumer must rejoin the group. This indicates serious coordination issues requiring investigation.

## Consumer Group Rebalance Protocol (KIP-848) / Summary

KIP-848 delivers significant improvements in rebalance performance and stability without requiring application code changes. The benefits are most significant for large consumer groups and high-availability applications.

- **Migration Complexity:** Low - configuration changes only, rolling restart supported, rollback possible.

- **Risk Level:** Low with production Kafka 4.0 and librdkafka 2.12.0 releases. Known issues are well-documented with workarounds.

- **Recommendation:** For new deployments on Kafka 4.0+, enable KIP-848 from the start. For existing deployments, test thoroughly in staging before migrating to production.

## Consumer Group Rebalance Protocol (KIP-848) / See Also

- [Routing Patterns](https://karafka.io/docs/Pro-Routing-Patterns.md) - Regex-based dynamic topic routing and regexp engine differences
- [Concurrency and Multithreading](https://karafka.io/docs/Consumer-Groups-Concurrency-and-Multithreading.md) - For understanding how threading interacts with rebalancing
- [Pro Long Running Jobs](https://karafka.io/docs/Pro-Consumer-Groups-Long-Running-Jobs.md) - For handling long-running work during rebalances
- [Deployment](https://karafka.io/docs/Infrastructure-Deployment.md) - For deployment strategies including rolling restarts


---

*Last modified: 2026-07-10 10:21:45*