Cross DC

Continued

Modes

  • Active/active
  • Active/passive

Active/active

  • Users and client application send requests to both datacenters
  • After write on first DC, data need to be immediately visible for reading on second DC
  • Default settings
  • Worse performance

Active/passive

  • Users and client application send requests to one datacenter
  • Second DC used just as a backup in case of failure of first DC
  • Better performance

Communication details

  • 5 infinispan clusters with 2 datacenters setup
    • Keycloak nodes cluster in site1
    • JDG nodes cluster in site1
    • Cluster between JDG nodes on site1 and on site2 (JGroups RELAY2 protocol and backup caches)
    • JDG nodes cluster in site2
    • Keycloak nodes cluster in site2
  • Keycloak sends message to JDG server on same site
  • JDG server sends it to JDG server in the other site through RELAY2
  • JDG server on second site propagates it to Keycloak servers on second site through HotRod protocol (Client listeners)
  • Keycloak servers listen to events through client listeners and do appropriate actions (Invalidate cache, update session caches)
  • Communication between JDG servers from different sites through RELAY2 protocol.
  • RELAY2 protocol - configured in JGroups subsystem on JDG server side
  • Infinispan caches on JDG side - "backup" element
  • Keycloak nodes and JDG nodes in same DC communicate through HotRod protocol
  • Keycloak infinispan caches are configured with "remote-store" in standalone-ha.xml
  • JDG server needs to have caches with same names configured on it's side

Basic setup

  • Documentation: http://www.keycloak.org/docs/latest/server_installation/index.html#setup
  • Recommended to try this to understand cross-dc better

Cross-DC deployment administration

Recommended startup order

  • Replicated databases in both datacenters
  • JDG servers in both datacenters
  • Keycloak servers in both datacenters

Requirements

  • Keycloak requires database in its DC to be running
  • Keycloak requires (at least one) JDG server in its DC to be running
  • Keycloak doesn't strictly need JDG server on the other DC to be running, but it's recommended
  • If JDG server on other DC is not running, it means that second datacenter is "offline" from first datacenter PoV

Taking site offline

  • Datacenter "site2" is considered offline from the "site1" PoV if:
    • There are no running JDG servers in site2
    • Network between site2 and site1 is broken
  • "Take site offline" = make sure that site1 consider site2 as offline
  • When site1 knows that site2 is offline, it will ignore it.

Manually taking site offline

  • It's possible on JDG server side through JMX (jconsole) or through CLI
  • Refer to JDG documentation for more details
  • Site needs to be taken offline separately for every cache or at CacheManager level

Automatically taking site offline

  • Done through the configuration on JDG side
  • Element "take-offline" inside "backup" on caches
  • With our default configuration, site is taken offline after it's unreachable for 60 seconds since first failed request
  • On Keycloak side, user requests will be blocked for few minutes
  • Exceptions in JDG server logs about failed backups
  • Possible to decrease timeout by switch backup policy from FAIL to WARN

Taking site online

  • Can be done once the network between sites is fixed and/or JDG servers on site2 started
  • Needs to be done manually by admin
  • JMX or CLI
  • Other possible needed actions
    • State transfer
    • Clear Keycloak caches

State transfer

  • Needs to be done manually
  • Again through JMX or CLI on JDG server side
  • Admin needs to decide if bidirectionally or unidirectionaly
  • Some data may be lost/overwritten

Clearing caches on Keycloak side

  • Needed if some KC entities were updated, but caches not invalidated during outage
  • Can be done on single KC server on any site -- should propagate it to all others

Backup policy

  • Configurable on JDG server side
  • FAIL (default) or WARN
  • FAIL will propagate backup failures to the caller (Keycloak server)
  • Keycloak can then retry the operation

FAIL policy - advantages

  • Consistency of data between sites
  • Correct behaviour if there is concurrent update of the entity
    • One of the update operations will fail and will be retried
  • No lost update (write skew)
  • Correct behaviour if there is shorter outage between sites (EG. few seconds)
    • Because operations will be retried, there won't be lost update

WARN policy - advantages

  • In case of real longer outage (split-brain), the caller won't be blocked for long time
  • EG. With site outage, the user logins will be blocked just for 10-30 seconds. With FAIL, 1-3 minutes.

Conclusion

WARN is better if you don't need 100% consistency AND you expect often split-brains (offline sites)

SYNC or ASYNC backups?

  • ASYNC is sufficient for Active/Passive mode for all caches
  • ASYNC won't notify if backup to the second site failed

ASYNC for actionTokens cache?

  • ASYNC useful if it's not strictly needed single-use ticket
  • ASYNC doesn't guarantee the single-use of OAuth2 code, which is REQUIRED per specs
  • ASYNC - better performance, but worse security

ASYNC for session caches?

  • Sufficient if all user and client requests end on same DC
    • Case if all frontend clients use javascript adapter
    • Case if loadbalancer forwards requests based on location and apps are available on both sites too