The dashboard is intended for administrators and provides an overview of the status and usage of the service. It can be reached at this URL:

https://surfrdm.grafana.net

It is based on Grafana, documentation about which can be found here: https://grafana.com. Information is displayed through panels, which display charts, tables and numerical indicators grouped in rows: service, data and users. When information is missing, the charts are empty or show "broken" lines, while the panels with just numerical indicators typically show "N/A". Information about the iRODS configuration is collected every 6 hours, while the uptime statistics are recorded every 5 minutes.

The dashboard is organised in the following pages:

 

Service status 

This is the homepage of the dashboard and shows links to the other pages and some summary statistics:

  • The availability of the service and storage resources: A monitoring process running on the iRODS server checks every 5 minutes if the server is up and responsive. For the service and the storage resources you will see a green 'Up' or a red 'down' accordingly (with a grey 'No data' for resources that are not configured for your instance). 
  • Used space per storage resource
  • Data integrity metric per storage resource
  • Data duplication metric per storage resource

Service status details ('Overall status' link)

Here you can find more detail about the service status. The service and resource uptime as well as the used space are presented as a timeline for historical reference. 

There is a set of panels showing how much data are duplicated and how much space the duplicates use. That information is collected once per day. Duplicates are data objects with the same checksum of another object. The duplicates in this dashboards are computed taking into account all the resources as a single namespace, therefore their number is usually different from the sum of the duplicates per resource.

Service configuration details ('Configuration' link)

Here you can find information related to the iRODS configuration: hostname, ports, zone name, version, storage quotas per resource, and the rule engines that have been installed. 

Service users ('Users' link)

Here you can find information about the users and their activity within the service. This is collected on an hourly basis. 

Service resource ('Storage resources' link)

Here you can find information about the usage of the storage resources over time, a breakdown of objects and collections, and how much space remains in each resource. This information is collected every hour and rounded up to the second digit.

  • The total used space includes everything known by the iRODS server (such as replicas and trashed objects).
  • The free space is calculated per resource and it is the difference between the allocated quota and the used space.
  • A negative free space means that the system is over the quota limit.

Service data integrity ('Data integrity' link)

Here you can find information about the integrity of the data and the data backup processes. This information is collected every hour, but some of the metrics are only computed once per day. 

Metrics are grouped per iRODS resource. If any of the metrics show the number 0, then there is nothing to report - everything is fine. If that is not the case, then it is useful to see whether their value grows over time or whether it is stable. If it is stable or with small fluctuations, then that value is probably due to an error or an incident in a specific point in time, but it is not systematic. If it grows over days or weeks, then there may be cause for investigation, because there is probably a systematic issue that affects the integrity of the data.

The following are shown:

Missing checksums

This is the number of data objects (files registered in iRODS/Yoda) without a checksum computed. The fact that the checksum is missing does not imply that the data are corrupted. In most cases, the checksum is automatically computed when the data are uploaded to iRODS/Yoda or copied from one resource to another. Therefore its absence may indicate that something went wrong during the upload or the copy. However, the computation of the checksums does not always happen as soon as the file is uploaded/copied. It can take some hours, or even a day, so the number of missing checksums can continue to grow for some hours during an upload, and then go down again during the night. This metric is computed every hour.

Intermediate replicas

When iRODS copies data, it uses a locking mechanism to avoid that the source or the destination changes during the copy. Each data object in iRODS has a status. If the copy operation is successful, the status of the copied data is good. If it fails, the copied data is not created at all. Sometimes the copy operation completes, but its "finalization" fails, for example, the checksums are different. In those cases iRODS assigns an "intermediate" status to the copied data, which means that it does not know if the data are complete or not. Due to the locking mechanism, the data in an intermediate status are still locked, so a normal user cannot delete or update them. Only an administrator can do that. This metric shows the number of intermediate replicas, the copied objects in an intermediate status, and it is computed every hour.

Missing files

The data objects consist of two parts, the file, stored on some physical storage, like a filesystem or an object store, and the metadata, which are stored in the iRODS database. This metric shows the number of data objects, whose file does not exist on the physical storage. You can still list them in iRODS, through iRODS clients or Yoda, but they are not pointing to an existing file. This metric is computed every night.

Missing objects

This metric shows the number of files stored in a storage space used by iRODS as a resource, but not registered in the iRODS database. It is the complement of the metric "missing files". Here the data objects miss the metadata part. This metric is computed every night.

PIT recovery backups

This shows the total amount of data stored as a backup of the user data in the Yoda research area, both in form of a chart and of a table. It also provides information about when the last backup has been completed and how much time it has taken. The metric is computed every hour, but the backup is executed once per day.


  • No labels