Synopsis

ParaView provides a client-server mode (called "desktop-delivery mode"), which can be used to take advantage of parallel processing and rendering on Snellius GPU nodes. This allows faster data processing, due to using multiple CPU cores and possibly GPUs.

Here we provide an overview of this mode and how to make use of it on Snellius.

Introduction

In client-server mode one or more ParaView server processes are running on one or more Snellius GPU nodes, with the ParaView GUI acting as client of those server processes. In client-server mode the ParaView GUI works exactly like the standalone version, but it is transparently using multiple server processes to process and visualize the loaded data in parallel. This can provide a massive speed-up, which is especially useful when working with large datasets.

We will first look at how to run ParaView in client-server mode on a single Snellius GPU node, within a remote desktop. This is the easiest form of usage, as all necessary software is available on Snellius and only a VNC client to be installed locally. It is also a natural extension of working with the remote desktop environment on Snellius.

Further below, we'll also show details on how to run the ParaView GUI locally, connecting to a set of ParaView server processes running on Snellius. Or how to distribute a set of ParaView server processes over multiple nodes.

 

Walk-through of starting client-server mode

Assuming we have a running remote desktop on a Snellius GPU we open a terminal within the remote desktop and load the relevant ParaView module needed to start a set of server process:



Next, we start 4 pvserver  processes using mpirun :



The output indicates that the processes started up successfully and that the head process is listening on TCP port 11111 for a client connection.

In a second terminal window we launch the ParaView GUI under VirtualGL, again loading the necessary modules first:

The ParaView GUI will now be visible. We can connect to the running server processes through the File → Connect option, or using the icon:

This brings up the Choose Server Configuration dialog:

No configuration entries are listed by default, so we need to add one. This is a one-time action, as the configuration is saved and can be recalled in future sessions.

Click Add Server and use "localhost"  (for example) as Name . The other options can stay as they are:

Next, click Configure. This will show a dialog for setting the type of startup, which is Manual  in this case:

Click Save , and you should now be back in the Choose Server Configuration dialog, with a new entry called "localhost":

By double-clicking on the "localhost" entry we make the GUI open a connection to the server processes:


From this point on you can load data and build up your visualization as normal.

Client-server mode differences

When in client-server mode the GUI actually looks almost identical to when running ParaView in regular mode, but two things give indication to client-server mode being active:

  • In the Pipeline Browser the connection is shown as cs://localhost:11111, instead of the regular builtin
  • The Memory Inspector (use View → Memory Inspector  if it isn't visible) shows the server connection, including memory statistics on the individual server processes

A more subtle point is that any file I/O operations are done on the server side by the server processes. This is less relevant when running client and server on the same node (as we do here), but in principle the GUI/client can be run locally on your own system, connecting to a set of server processes running on Snellius. ParaView even keeps a separate list of recent files per server you connected to:

This also reveals that when you run ParaView normally, i.e. without explicitly connecting to a ParaView server, you actually are using the so-called "built-in" server by default.

Data distribution and file formats

In order to take advantage of parallel processing in client-server mode the loaded data needs to be distributed over the server processes, each process only operating on part of the dataset. Depending on the type of data and the file structure and decomposition used ParaView may not end up distributing the data very evenly over the server processes. This can lead to long filter and render times as only some of the processes will end up doing much of the work. In some cases data distribution happens automatically, but might still not happen very efficiently.

Not many of the file format readers that ParaView contains support automatic and efficient data distribution, so manually decomposing data into several files might be needed. Below we list a few file format readers that do offer automatic data distribution. See this ParaView page for up-to-date information on file formats and parallel data distribution. Most details given below on how file formats are handled are based on descriptions taken from that page:

  • The .vti  file format stores regular structured image data, in most cases a 3D grid of point and/or cell values. When loading the dataset in parallel, the structured extent of the dataset is distributed among all processes in a load-balanced way. All processes open the data file, however only the data corresponding to the subextent corresponding to the process are read in. So a single .vti  file containing a large 3D grid dataset will be loaded efficiently into ParaView when using client-server mode, as each subset is loaded only once. See below for an example.
  • The .vts  file format is similar to .vti  but relaxes the point coordinates to support non-rectangular grids (e.g. curvilinear). Reading in parallel is similar to how .vti  is handled.

Example of automatic data distribution

As an example, assume a 128x128x128 grid of byte values stored in a .vti  file (containing a CT-scan of a mummy's head). We load this file into the 4-process client-server setup as shown above, and then apply a contour filter to extract an iso-surface. We can show the (automatic) data distribution by setting the surface coloring to show the vtkProcessId value (range 0-3). Each of the four colors shows the subset of data that is handled by that particular server process:

But note what happens if we load the exact same data but stored in a .vtk  file and once again apply a contour filter:

All data remains on process 0, and process 1-3 contain none of it (wasting resources), as the .vtk  file reader does not support automatic data distribution.

"Fixing" data distribution with the D3 filter

We can somewhat improve the situation above by apply a D3 filter on the input data. This will redistribute the data so it's no longer only on process 0, but to a more-or-less load-balanced state over all processes:

However, the output of the D3 filter is an unstructured grid, which is a very memory-inefficient representation. Also, depending on the type of input data the redistributed data might be organized in sub-optimal spatial way, as seen here operating on input polygonal geometry:

Finally, not all filters will run on an unstructured grid dataset, so further processing of the output of the D3 filter might be limited.

So applying a D3 filter should really only be used as a last-ditch effort. Instead, using a file format that directly supports parallel reading and distribution is to be much preferred. See the next section.

Parallel file formats

The .vti  file format shown above allows data to be loaded from a single file, which then gets distributed over the server processes after loading. And for other file formats this will work similar. But depending on the file format used this might not lead to a very even (or efficient) data distribution.

To have more predictable results it is usually better to do data decomposition yourself, at the file-level. This means (spatially) decomposing the data into several files beforehand, adding a small meta-data file containing information on the data files. This approach also works well if you're already using a set of parallel processes to produce data (e.g. a simulation code): each process can simply write its part of the output data to a single file, and you only need to create an extra meta-data file to tie them together.

Below, we describe several of the file format options available.

Xdmf

The Xdmf file format uses a set of HDF5 files to hold the partitioned data, using a single XML file holding meta-data. Xdmf supports different types of datasets and (through HDF5) can use data compression for smaller on-disk storage. Xdmf allows not only decomposition spatially, but also over time and in logical groups. The HDF5 format used in Xdmf to store the actual data is a flexible and efficient binary file format for storing collections of multi-dimensional arrays. For most programming environments a library is available to write HDF5 files (e.g. https://www.h5py.org/).

ParaView has support for reading (and writing) Xdmf files, although it can be a bit picky when the XML metadata and HDF5 file(s) don't match in dimensions and such. The official Xdmf documentation is quite sparse, but a bit more detail can be found here (in the VisIt documentation, an alternative to ParaView).

Below is an example Xdmf XML meta-file, which describes a rectilinear 3D grid with varying spacing in X, Y and Z (fixed), and stores a velocity vector and pressure value per point (time-varying). All data is stored in HDF5 files.

<?xml version="1.0"?>
<!DOCTYPE Xdmf SYSTEM "Xdmf.dtd">
<Xdmf Version="2.0">
  <Domain>
    <Grid Name="RB" GridType="Uniform">
      <Topology TopologyType="3DRectMesh" NumberOfElements=" 150  150  150"/>
      <Geometry GeometryType="VXVYVZ">
        <DataItem Dimensions=" 150" NumberType="Float" Precision="4" Format="HDF">
          gridcoords.h5:/x
        </DataItem>
        <DataItem Dimensions=" 150" NumberType="Float" Precision="4" Format="HDF">
          gridcoords.h5:/y
        </DataItem>
        <DataItem Dimensions=" 150" NumberType="Float" Precision="4" Format="HDF">
          gridcoords.h5:/z
        </DataItem>
      </Geometry>
      <Attribute Name="velocity" AttributeType="Scalar" Center="Node">
        <DataItem Dimensions=" 150  150  150" NumberType="Float" Dimensions="3" Precision="4" Format="HDF">
          frame0001.h5:/velocity
        </DataItem>
      </Attribute>
      <Attribute Name="Pressure" AttributeType="Scalar" Center="Node">
        <DataItem Dimensions=" 150  150  150" NumberType="Float" Precision="4" Format="HDF">
          frame0001.h5:/pressure
      </DataItem>
      </Attribute>
      <Time Value=" 0.1"/>
    </Grid>
  </Domain>
</Xdmf>

Native XML formats

The native XML file formats of ParaView (and VTK) are a set of XML-based file formats for storing the different types of datasets that ParaView supports natively. The file formats support reading (and writing) data distributed over several files, in a similar way as with Xdmf. For example, a 3D grid can be stored in set of .vti  files each holding part of the grid, together with a single .pvti  file containing meta information on the part files and dataset partitioning:

snellius paulm@gcn61 09:01 ~/datasets$ cat split.pvti 
<VTKFile type="PImageData" version="0.1" byte_order="LittleEndian">
  <PImageData WholeExtent="0 20 0 20 0 20" GhostLevel="0" Origin="0 0 0" Spacing="1 1 1">
    <PPointData Scalars="RTData">
      <PDataArray type="Float32" Name="RTData"/>
    </PPointData>
    <Piece Extent="0 20 0 10 0 10" Source="split_0.vti"/>
    <Piece Extent="0 20 10 20 0 10" Source="split_1.vti"/>
    <Piece Extent="0 20 0 10 10 20" Source="split_2.vti"/>
    <Piece Extent="0 20 10 20 10 20" Source="split_3.vti"/>
  </PImageData>
</VTKFile>

When loading the .pvti  file it is opened on all server processes, to read the meta-data. Then the extents of the dataset are distributed among the processes in a load-balanced way. The file reader splits the input extents into non-overlapping sub-extents minimizing files needed to be read for any sub-extent. Only those .vti  files that contain the extents of the data corresponding to a particular process are opened and loaded on that process. So in this case a server process might need to read more than one .vti  file in order to read the sub-extent of the full dataset it got assigned. Several server processes might even need to read the same .vti  file (but read different data extents).

In principle, this type of data decomposition over files can be used for all types of XML-based formats: ImageData (.vti), PolyData (.vtp), RectilinearGrid (.vtr), StructuredGrid (.vts) and UnstructuredGrid (.vtu), with the corresponding meta-file named .pvti, .pvtp, etc.

Although writing the meta-file is easy, as it is usually small and only contains file references and some meta-data, writing the actual data files in e.g. .vti  format can be a bit more challenging. The format is XML-based and so cannot store binary data directly: these need to be stored in a verbose text-based encoding, which might be prohibitive for large datasets. The native XML file formats do support data compression, but this adds to the complexity of writing the data.

One case use VTK (or even ParaView) to write a set of files in one of the partitioned formats. See here for more information. Note that writing to a set of partitioned XML files from ParaView requires the use of the client-server setup. The number of parts written will be same as the number of server processes, but saving can be done simply using File → Save Data.

ParaView PVD

The ParaView PVD file format is an XML-based format that can be used to load partitioned datasets, the same as with the Native XML formats, but it also allows files of different types to be loaded in one go (e.g. a 3D grid in .vti  format, plus a polygonal file in .vtp  format). Furthermore, it allows the creation of time series. The data files that can be referenced from a .pvd  file can be any of the serial XML file formats, but not legacy .vtk  files.

Data processing versus visualization (CPU vs GPU usage)

When working in ParaView inspecting/exploring/visualizing data you will usually use one or more filters to process the data and visualize their output. However, most filters in ParaView operate using CPU-processing only. In contrast, the visualization step for the data you want to see, in most cases the output of a subset of filters in your pipeline, is usually GPU-based, in a process called rendering. So depending on the specific workflow in ParaView you're using you might need a lot of CPU power (lots of data processing in filters) and/or a lot of GPU power (lots of rendering). The ParaView documentation itself has a note on the balance between these two aspects:

There was once a time when rendering speed was the bottleneck for visualization. That, however, is no longer the case. The time spent in rendering is minimal, especially when compared to the time spent processing filters. The rendering speed can be throttled quite a bit before making a serious impact on visualization performance, even when running interactively. 

This implies that having enough CPU processing power for filter operations is usually the first bottleneck, before GPU rendering power becomes a problem.

With the 4-process client-server setup we've been using so far we've actually only added more CPU power to ParaView. The reason is that we haven't told the server processes how they can access GPUs, so they can only use CPU processing. When we check the Connection Information in Help → About  we can indeed see that software-based rendering (through Mesa, note the OpenGL Vendor Mesa/X.org, OpenGL Renderer llvmpipe) is used by the server processes, instead of hardware-accelerated GPU rendering:

Note that the ParaView GUI (the client) is using GPU-based rendering, as shown on the Client Information tab. We run paraview  under VirtualGL, which gives access to an NVIDIA GPU, so the rendering of the final visual output shown in the GUI is being GPU-accelerated:

One of the nice things about having the option to use CPU-based server processes is that we can scale up much further. A Snellius GPU node contains 72 CPU cores, so in principle we could start 72 pvserver processes. This will help in parallel filter processing, although it might leave each filter with only a very small amount of data. So the communication overhead with all the server processes might be more than the actual filter processing time. As usual, there's a trade-off between amount of parallelism and processing speedup.

Using GPUs with the server processes

Since multiple GPUs are available in a Snellius GPU node we usually want to take advantage of those. For this, we need a slightly more verbose mpirun  line to start the server processes:

mpirun \
    -np 1 pvserver --displays :0.0 \
    :  -np 1 pvserver --displays :0.1 \
    :  -np 1 pvserver --displays :0.2 \
    :  -np 1 pvserver --displays :0.3

This provides X server display numbers to each pvserver  process, allowing them to use GPU-accelerated rendering. Checking the Connection Information in Help → About indeed now shows that OpenGL rendering is provided by the NVIDIA A100s:

Note that the command above assumes 4 server processes, each one using a single GPU (accessed through X server display :0.0 up to :0.3). It's possible to have more processes using the same GPU, by using a higher number of processes in each of the -np X  options. This will increase the rendering load per GPU, but would make more CPU power available for filter operations, while still using GPUs for rendering. You might need to experiment a bit with the number of processes (and number per GPU) to get the optimal results for your workflow.


Remote desktop only on an exclusive GPU node

The above mpirun  command examples assume we are using a GPU node exclusively, with all four A100 GPUs at our disposal for the remote desktop setup. We currently have a system configuration limitation on Snellius that interferes with running a remote desktop on a shared node (e.g. using only 1/4th of a node, in effect 1 GPU). We aim to solve this issue in the near future.

Using multiple CPU nodes

For very large datasets, using multiple nodes increases available memory and processing power. As mentioned above, filter operations are (mostly) CPU-based and relatively heavy, while the final rendering of the visualization is fairly light and can usually also be done on CPUs instead of using a GPU. One exception here is volume rendering: this benefits from GPU power, and needs a large number of CPU cores to reach similar performance.

As noted above the ParaView server processes can leverage software-based CPU rendering, when no GPUs are available in a node. This implies that we can also run the server processes on one or more CPU-only nodes on Snellius. This can be interesting for several reasons:

  • Snellius contains many more CPU nodes than GPU nodes and so CPU nodes are usually more readily available through the batch system
  • Access to CPU nodes doesn't require separate permission to use the gpu  or gpu_vis  partitions
  • Full CPU nodes are cheaper in SBUs than GPU nodes in most cases

The ParaView server is a regular MPI-based set of processes, so we can use a job script to start the server on one or more nodes. For starting a set of CPU-only particular usage we need to use the ParaView-server-osmesa  module. Here's an example batch script that uses a single Genoa CPU node to run 192 server processes:

#!/bin/bash
#SBATCH --exclusive
#SBATCH -p genoa
#SBATCH -t 2:00:00
#SBATCH --tasks-per-node=192
#SBATCH -N 1
module load 2023
module load ParaView-server-osmesa/5.11.2-foss-2023a-mpi

# The --force-offscreen-rendering option is needed in this case
srun pvserver --force-offscreen-rendering

Notes:

  • This example script does not contain a loop to restart pvserver in case it crashes, or you accidentally disconnect the GUI. This means that the job will end after such situations and you would need to re-submit it to get a running ParaView server again.
  • You can change this example job script to use fewer tasks, or run on a shared node instead of an exclusive node, etc.

Running the ParaView GUI locally

The client-server mode allows you to run the ParaView client (GUI) anywhere you want, as long as it can connect to the main server process. So you could run the GUI on your laptop or PC, connecting to the server processes on Snellius. All data I/O and processing will happen server-side, while the client needs very little CPU and GPU power. This is especially nice as you don't need to transfer any data off of Snellius, which might involve long transfer times and cause local storage challenges.

Compared to running the GUI in a VNC remote desktop (as we've been using so far) running the ParaView GUI locally has some advantages:

  • You interact with the actual ParaView GUI locally, and not a remote desktop containing the GUI. Depending on your local system setup this can work somewhat better in terms of interaction and integration with your native operating system
  • No VNC viewer is needed on your local system, as the ParaView client is run instead and takes care of image transfer from the remote server processes transparently
  • No VNC server on Snellius is needed either, freeing up some resources on the compute node

But there are some possibly negative aspects as well:

  • There is a small cost to pay in added latency, caused by data needing to be transferred over the network between client and server (this will be especially noticeable when accessing the server through high-latency networks like Wifi). In principle this should be similar to working in the remote desktop, as transfer of the desktop image also involves latency, but it might be noticeable in different ways when interacting
  • You need to locally install a version of ParaView that matches the one on Snellius (e.g. ParaView 5.11.2)
  • You need to manually set up an SSH tunnel to the node running the ParaView server head process, as described below

Client installation

As noted above the ParaView client (i.e. GUI) needs to be installed on your local machine. Prebuilt binaries are available from the ParaView website. These binaries will contain the complete ParaView software suite. The client does not need to have MPI support compiled in, but its major and minor version (but not release version) needs to match the version of the server version used on Snellius. For example, client 5.11.1 will work with server 5.11.0, but not with server 5.12.0. ParaView is particular picky about this, most likely because the network protocol used is version-specific.

SSH tunnel

The ParaView client (GUI) needs to access TCP port 11111 of the head process of the ParaView server. As the Snellius compute nodes are not directly accessible from the outside world we need to tunnel the client-server connection through a login node. 

Suppose we used the job script shown under Using multiple CPU nodes above, and we get assigned node tcn801:

snellius paulm@int5 09:13 ~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           6052182     genoa pv-singl    paulm  R      32:44      1 tcn801

Now we need to set up an SSH tunnel that starts from TCP port 11111 on your local system and ends up on tcn801. You can then connect the ParaView GUI to localhost instead of the compute node and the tunnel will take care of the rest.

On your local system the command (Linux and macOS) for setting up the tunnel is

ssh -L 11111:<remotehost>:11111 <user>@snellius.surf.nl

where <remotehost>  is tcn801  in this example, and <user> is your login name on Snellius. Note that you need to keep the tunnel running, i.e. keep the terminal window in which you ran the ssh command open. There are ways to run the SSH tunnel as a background process, using the -f option, but we'll not go into that here. Note that for setting up the SSH tunnel under Windows you can use the plink command from PuTTY instead of ssh , the arguments to the command remain the same.

At this point we have the ParaView server processes running on tcn801, and have the SSH tunnel set up to go from our local system to tcn801. The final step now is to connect to the server processes from the ParaView GUI. For this, add a server entry for localhost (if not already there), as shown under Walk-through of starting client-server mode, and then connect to this server.

Server option 1: remote desktop

This is similar to what is described under "Walk-through of starting client-server mode" above, but you run the client locally instead of within the remote desktop:

  1. Start a remote desktop on a GPU node using the vnc_desktop  script from the remote-vis/git  module
  2. Start a set of pvserver  processes using mpirun 
  3. Set up an SSH tunnel to the GPU node running the remote desktop
  4. Start the ParaView client locally
  5. Connect the client to the server

Server option 2: CPU only

This is the same as described under "Using multiple CPU nodes" above.

Server option 3: EGL (not available yet)

EGL is a way to use GPU-based hardware accelerated rendering without having to use an X server. This is mostly a convenience approach, as running an X server involves extra system and security configuration, that can be avoided when using EGL.

Currently, we do not provide an EGL-based ParaView version on Snellius yet.

Remote rendering settings

Depending on the settings used in the ParaView client, it (the GUI) will either receive image output from the server processes which only has to be displayed locally, or it will receive 3D geometry which still needs to be rendered locally (meaning turning the 3D geometry in an image using the local GPU). Depending on these settings there is a trade-off between local and remote rendering performance, but letting all the processing and rendering happen on the remote GPU nodes is usually the best option for the large scenes.

There are a number of settings controlling remote rendering, image transfer and compression, networking, etc. These settings can be found under Edit -> Settings and then the  Render View tab. Filling in "image" in the Search box directly below the Render View tab is the easiest way to filter out the non-relevant settings, leaving those listed below.

Remote Render Threshold

As mentioned above, the actual rendering of the data to visualize (i.e. producing the image visible in the GUI viewport) might still be performed locally on your client machine. The Remote Render Threshold influences the choice where the final geometry is rendered. If the total size of the geometry is below the threshold the geometry to display is sent to the client and rendering is performed locally. If the size is above the threshold the geometry is rendered at the server side and the resulting image is sent to the client. In both cases the process is transparent as far as the user is concerned.

Setting the threshold to 0 will force remote (i.e. server-side) rendering in all cases. Setting it to a higher value will cause local rendering, but the value is somewhat non-intuitive as there's no easy insight into the size of data transferred, so a bit of experimentation with different values is usually needed.

Network connection, image compression, ...

Depending on the type of network connection you are using (mostly at the local end) there are some presets to influence image compression, color depth and other features related to sending the remotely rendered image to the client. ParaView provides a set of presets under Apply presets for ... connection, e.g. Gigabit Ethernet and consumer broadband/DSL. Again, you might have to try out different presets to get the required interaction performance that you like, with your current network setup.

Image reduction factor

Rendering a single frame of the data is expected to take a bit of time, usually too much time for smooth interaction, which needs roughly 10 frame updates per second or higher. Secondly, during interaction, such as rotation and zooming, a lower quality (resolution, color depth, etc) rendering is usually enough to orient the model, followed by switching back to the full image quality after interaction is done.

ParaView allows one to control the resolution of the image during interaction with the Image Reduction Factor. This is the amount of pixel subsampling used during interaction. Here you can see the subsampling (and color reduction) in action, with left the full-resolution image when not interacting and right the reduced-resolution image shown during interaction:

Verifying remote rendering

To verify that you are indeed using remote rendering you can set the remote rendering threshold to 0 (i.e. always render remotely) and set the Image Reduction Factor (see above) to something like 8 pixels or more. When you interact with the visualization in the client, for example, by rotating the model, the rendering should switch to a low-resolution version. When interaction is done, by releasing the mouse buttons, a full-resolution image will then be computed and displayed. If you indeed see this behaviour then remote rendering is being used.