Install Prometheus:
Follow instructions at http://prometheus.io/docs/introduction/install/
Enable Synapse metrics:
In homeserver.yaml
, make sure enable_metrics
is
set to True
.
Enable the /_synapse/metrics
Synapse endpoint that Prometheus uses to
collect data:
There are two methods of enabling the metrics endpoint in Synapse.
The first serves the metrics as a part of the usual web server and
can be enabled by adding the metrics
resource to the existing
listener as such as in this example:
listeners:
- port: 8008
tls: false
type: http
x_forwarded: true
bind_addresses: ['::1', '127.0.0.1']
resources:
# added "metrics" in this line
- names: [client, federation, metrics]
compress: false
This provides a simple way of adding metrics to your Synapse
installation, and serves under /_synapse/metrics
. If you do not
wish your metrics be publicly exposed, you will need to either
filter it out at your load balancer, or use the second method.
The second method runs the metrics server on a different port, in a
different thread to Synapse. This can make it more resilient to
heavy load meaning metrics cannot be retrieved, and can be exposed
to just internal networks easier. The served metrics are available
over HTTP only, and will be available at /_synapse/metrics
.
Add a new listener to homeserver.yaml as in this example:
listeners:
- port: 8008
tls: false
type: http
x_forwarded: true
bind_addresses: ['::1', '127.0.0.1']
resources:
- names: [client, federation]
compress: false
# beginning of the new metrics listener
- port: 9000
type: metrics
bind_addresses: ['::1', '127.0.0.1']
Restart Synapse.
Add a Prometheus target for Synapse.
It needs to set the metrics_path
to a non-default value (under
scrape_configs
):
- job_name: "synapse"
scrape_interval: 15s
metrics_path: "/_synapse/metrics"
static_configs:
- targets: ["my.server.here:port"]
where my.server.here
is the IP address of Synapse, and port
is
the listener port configured with the metrics
resource.
If your prometheus is older than 1.5.2, you will need to replace
static_configs
in the above with target_groups
.
Restart Prometheus.
Consider using the grafana dashboard and required recording rules
To monitor a Synapse installation using workers, every worker needs to be monitored independently, in addition to the main homeserver process. This is because workers don't send their metrics to the main homeserver process, but expose them directly (if they are configured to do so).
To allow collecting metrics from a worker, you need to add a
metrics
listener to its configuration, by adding the following
under worker_listeners
:
- type: metrics
bind_address: ''
port: 9101
The bind_address
and port
parameters should be set so that
the resulting listener can be reached by prometheus, and they
don't clash with an existing worker.
With this example, the worker's metrics would then be available
on http://127.0.0.1:9101
.
Example Prometheus target for Synapse with workers:
- job_name: "synapse"
scrape_interval: 15s
metrics_path: "/_synapse/metrics"
static_configs:
- targets: ["my.server.here:port"]
labels:
instance: "my.server"
job: "master"
index: 1
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "generic_worker"
index: 1
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "generic_worker"
index: 2
- targets: ["my.workerserver.here:port"]
labels:
instance: "my.server"
job: "media_repository"
index: 1
Labels (instance
, job
, index
) can be defined as anything.
The labels are used to group graphs in grafana.
Synapse 1.2 updates the Prometheus metrics to match the naming
convention of the upstream prometheus_client
. The old names are
considered deprecated and will be removed in a future version of
Synapse.
The old names will be disabled by default in Synapse v1.71.0 and removed
altogether in Synapse v1.73.0.
New Name | Old Name |
---|---|
python_gc_objects_collected_total | python_gc_objects_collected |
python_gc_objects_uncollectable_total | python_gc_objects_uncollectable |
python_gc_collections_total | python_gc_collections |
process_cpu_seconds_total | process_cpu_seconds |
synapse_federation_client_sent_transactions_total | synapse_federation_client_sent_transactions |
synapse_federation_client_events_processed_total | synapse_federation_client_events_processed |
synapse_event_processing_loop_count_total | synapse_event_processing_loop_count |
synapse_event_processing_loop_room_count_total | synapse_event_processing_loop_room_count |
synapse_util_caches_cache_hits | synapse_util_caches_cache:hits |
synapse_util_caches_cache_size | synapse_util_caches_cache:size |
synapse_util_caches_cache_evicted_size | synapse_util_caches_cache:evicted_size |
synapse_util_caches_cache | synapse_util_caches_cache:total |
synapse_util_caches_response_cache_size | synapse_util_caches_response_cache:size |
synapse_util_caches_response_cache_hits | synapse_util_caches_response_cache:hits |
synapse_util_caches_response_cache_evicted_size | synapse_util_caches_response_cache:evicted_size |
synapse_util_metrics_block_count_total | synapse_util_metrics_block_count |
synapse_util_metrics_block_time_seconds_total | synapse_util_metrics_block_time_seconds |
synapse_util_metrics_block_ru_utime_seconds_total | synapse_util_metrics_block_ru_utime_seconds |
synapse_util_metrics_block_ru_stime_seconds_total | synapse_util_metrics_block_ru_stime_seconds |
synapse_util_metrics_block_db_txn_count_total | synapse_util_metrics_block_db_txn_count |
synapse_util_metrics_block_db_txn_duration_seconds_total | synapse_util_metrics_block_db_txn_duration_seconds |
synapse_util_metrics_block_db_sched_duration_seconds_total | synapse_util_metrics_block_db_sched_duration_seconds |
synapse_background_process_start_count_total | synapse_background_process_start_count |
synapse_background_process_ru_utime_seconds_total | synapse_background_process_ru_utime_seconds |
synapse_background_process_ru_stime_seconds_total | synapse_background_process_ru_stime_seconds |
synapse_background_process_db_txn_count_total | synapse_background_process_db_txn_count |
synapse_background_process_db_txn_duration_seconds_total | synapse_background_process_db_txn_duration_seconds |
synapse_background_process_db_sched_duration_seconds_total | synapse_background_process_db_sched_duration_seconds |
synapse_storage_events_persisted_events_total | synapse_storage_events_persisted_events |
synapse_storage_events_persisted_events_sep_total | synapse_storage_events_persisted_events_sep |
synapse_storage_events_state_delta_total | synapse_storage_events_state_delta |
synapse_storage_events_state_delta_single_event_total | synapse_storage_events_state_delta_single_event |
synapse_storage_events_state_delta_reuse_delta_total | synapse_storage_events_state_delta_reuse_delta |
synapse_federation_server_received_pdus_total | synapse_federation_server_received_pdus |
synapse_federation_server_received_edus_total | synapse_federation_server_received_edus |
synapse_handler_presence_notified_presence_total | synapse_handler_presence_notified_presence |
synapse_handler_presence_federation_presence_out_total | synapse_handler_presence_federation_presence_out |
synapse_handler_presence_presence_updates_total | synapse_handler_presence_presence_updates |
synapse_handler_presence_timers_fired_total | synapse_handler_presence_timers_fired |
synapse_handler_presence_federation_presence_total | synapse_handler_presence_federation_presence |
synapse_handler_presence_bump_active_time_total | synapse_handler_presence_bump_active_time |
synapse_federation_client_sent_edus_total | synapse_federation_client_sent_edus |
synapse_federation_client_sent_pdu_destinations_count_total | synapse_federation_client_sent_pdu_destinations:count |
synapse_federation_client_sent_pdu_destinations_total | synapse_federation_client_sent_pdu_destinations:total |
synapse_handlers_appservice_events_processed_total | synapse_handlers_appservice_events_processed |
synapse_notifier_notified_events_total | synapse_notifier_notified_events |
synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter_total | synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter |
synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter_total | synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter |
synapse_http_httppusher_http_pushes_processed_total | synapse_http_httppusher_http_pushes_processed |
synapse_http_httppusher_http_pushes_failed_total | synapse_http_httppusher_http_pushes_failed |
synapse_http_httppusher_badge_updates_processed_total | synapse_http_httppusher_badge_updates_processed |
synapse_http_httppusher_badge_updates_failed_total | synapse_http_httppusher_badge_updates_failed |
synapse_admin_mau_current | synapse_admin_mau:current |
synapse_admin_mau_max | synapse_admin_mau:max |
synapse_admin_mau_registered_reserved_users | synapse_admin_mau:registered_reserved_users |
The duplicated metrics deprecated in Synapse 0.27.0 have been removed.
All time duration-based metrics have been changed to be seconds. This affects:
msec -> sec metrics |
---|
python_gc_time |
python_twisted_reactor_tick_time |
synapse_storage_query_time |
synapse_storage_schedule_time |
synapse_storage_transaction_time |
Several metrics have been changed to be histograms, which sort entries into buckets and allow better analysis. The following metrics are now histograms:
Altered metrics |
---|
python_gc_time |
python_twisted_reactor_pending_calls |
python_twisted_reactor_tick_time |
synapse_http_server_response_time_seconds |
synapse_storage_query_time |
synapse_storage_schedule_time |
synapse_storage_transaction_time |
Synapse 0.27.0 begins the process of rationalising the duplicate
*:count
metrics reported for the resource tracking for code blocks and
HTTP requests.
At the same time, the corresponding *:total
metrics are being renamed,
as the :total
suffix no longer makes sense in the absence of a
corresponding :count
metric.
To enable a graceful migration path, this release just adds new names for the metrics being renamed. A future release will remove the old ones.
The following table shows the new metrics, and the old metrics which they are replacing.
New name | Old name |
---|---|
synapse_util_metrics_block_count | synapse_util_metrics_block_timer:count |
synapse_util_metrics_block_count | synapse_util_metrics_block_ru_utime:count |
synapse_util_metrics_block_count | synapse_util_metrics_block_ru_stime:count |
synapse_util_metrics_block_count | synapse_util_metrics_block_db_txn_count:count |
synapse_util_metrics_block_count | synapse_util_metrics_block_db_txn_duration:count |
synapse_util_metrics_block_time_seconds | synapse_util_metrics_block_timer:total |
synapse_util_metrics_block_ru_utime_seconds | synapse_util_metrics_block_ru_utime:total |
synapse_util_metrics_block_ru_stime_seconds | synapse_util_metrics_block_ru_stime:total |
synapse_util_metrics_block_db_txn_count | synapse_util_metrics_block_db_txn_count:total |
synapse_util_metrics_block_db_txn_duration_seconds | synapse_util_metrics_block_db_txn_duration:total |
synapse_http_server_response_count | synapse_http_server_requests |
synapse_http_server_response_count | synapse_http_server_response_time:count |
synapse_http_server_response_count | synapse_http_server_response_ru_utime:count |
synapse_http_server_response_count | synapse_http_server_response_ru_stime:count |
synapse_http_server_response_count | synapse_http_server_response_db_txn_count:count |
synapse_http_server_response_count | synapse_http_server_response_db_txn_duration:count |
synapse_http_server_response_time_seconds | synapse_http_server_response_time:total |
synapse_http_server_response_ru_utime_seconds | synapse_http_server_response_ru_utime:total |
synapse_http_server_response_ru_stime_seconds | synapse_http_server_response_ru_stime:total |
synapse_http_server_response_db_txn_count | synapse_http_server_response_db_txn_count:total |
synapse_http_server_response_db_txn_duration_seconds | synapse_http_server_response_db_txn_duration:total |
As of synapse version 0.18.2, the format of the process-wide metrics has been changed to fit prometheus standard naming conventions. Additionally the units have been changed to seconds, from milliseconds.
New name | Old name |
---|---|
process_cpu_user_seconds_total | process_resource_utime / 1000 |
process_cpu_system_seconds_total | process_resource_stime / 1000 |
process_open_fds (no \'type\' label) | process_fds |
The python-specific counts of garbage collector performance have been renamed.
New name | Old name |
---|---|
python_gc_time | reactor_gc_time |
python_gc_unreachable_total | reactor_gc_unreachable |
python_gc_counts | reactor_gc_counts |
The twisted-specific reactor metrics have been renamed.
New name | Old name |
---|---|
python_twisted_reactor_pending_calls | reactor_pending_calls |
python_twisted_reactor_tick_time | reactor_tick_time |