Many research projects need to conduct interviews and analyze them. More often than not, the content of the interviews is sensitive personal data. It must be handled according to the applicable laws and the institute's policies. Or, more practically, there must be no copies of data on local devices, and the data should only be accessible to active project members. The researchers must have full and exclusive control throughout the lifecycle of the data. The Secure Interviews ecosystem on Research Cloud is built to support researchers with generating and handling interview footage. One part of this support is a safe automatic transcription of the recorded speech.


It starts with SRAM

To define who is entitled to collect and view footage, a collaboration in SRAM is set up. Admins and developers in the collaboration have more access rights. Users should only be added to the groups "src_co_admin" and "src_co_developer" when they are trusted with handling the data outside the safe boundaries of the collaboration.

You do need one or two users who are in the "src_co_admin" and "src_co_developer" groups in order to set up the SRAM collaboration secure interviews.

In the SRAM portal, SURF Research Cloud and the "WiLLMa Authentication Service" should be added to this collaboration in its "Applications" tab.

Also, users who are not src_co_admin users should be placed in the group src_no_ext_apps. That way, it will not be possible to extract data with fabricated custom catalog items.

Configuration

In the Research Cloud portal, the collaboration still needs some specific configuration ("Secrets") that is described on the pages of the individual catalog items and components. There is an overview of the configuration on the bottom of this page.

Generating the interview footage

To conduct interviews and collect the footage, a workspace of the "SI - Secure Interviews" catalog item is used. The catalog item is built around the online web-conferencing application Big Blue Button. Big Blue Button is an Open Source product that can be run on premises completely - in this case exclusively in SURF's own HPC Cloud datacenter. Big Blue Button has been designed for education purposes and complies with strict privacy legislation for youth.

The use of Big Blue Button is rather intuitive. If you have ever used online conferencing software, you will be up and running in a minute. Big Blue Button has its own system of user accounts. This means that project members that merely have to conduct interviews only need an account on this Big Blue Button server to do their work. They don't have to be members of the SRAM collaboration.

For all recorded interviews, an automatic transcription will be triggered. The transcription is done by SURF's own WiLLMa service. This service does not store any input data and consequently also does not use the data for re-training. Whatever goes into or out of WiLLMa will be "forgotten" by the service.

The "Secure Interviews" catalog item should be cloned, and its use should be restricted to the project's collaboration. This way, it is possible to control which users can start and access which workspaces.

The technical documentation of this catalog item can be viewed here.

Refer to the README.md there, to see how you can start and maintain a SI - Secure Interviews derived workspace.

Upload legacy and offline footage

Collecting footage online may not always be possible, and some footage may already exist. To upload existing files and have them processed (transcribed, encrypted, stored) you can use a workspace of the "SI - Media Upload" catalog item. It is a "headless" server that only src_co_admin members of the collaboration can use. With a local sftp client like Cyberduck, they can upload footage.

From that moment on, the footage is safe within the system. Of course, the remaining local copies will still have to be dealt with responsibly.

The "SI - Media Upload" catalog item should be cloned, and its use restricted to the project's collaboration.

The technical documentation of this catalog item can be viewed here.

Storing the footage

All catalog items of the Secure Interviews ecosystem make use of the "Research Drive by Link" component.

It creates a user-independent connection to Research Drive, based on a Research Drive "Public Link".

Actually, the usual, "personal" Research Drive connection is disabled for these workspaces.

The component can be configured through the catalog item to encrypt the stored data. For Secure Interviews, there should be a folder "raw" under the folder that is referenced by the Public Link.

This is where all the recorded and transcribed footage is stored.

Before applying any changes to files, the files should be copied elsewhere to keep the raw data unchanged.

Transcription

An extra script on the Secure Interviews server sends the recorded media to SURF's WiLLMa service for speech-to-text transcription.

The resulting *.json file is stored next to the original media file under the "raw" folder on Research Drive.

It can be used to start a Whisper Corrector session in a Media Viewer workspace.

To make the transcription human-readable without further tools, a flat text file (*.txt) is created, too.

It provides time codes and marks uncertain words in the transcription with "[?]".

However, still the entire transcription should be checked for completeness and precision.

Rules of use

SURF's WiLLMa service is still in the beta phase. The transcription feature is at present based on SURF's own Whisper instance.

Being in beta, a couple of rules and constraints apply that you need to be aware of: WiLLMa Rules of use

Viewing and analysis

It is a challenge to access and evaluate the collected footage without downloads and local copies. This is where the "SI - Media Viewer" workspace comes in.

It is a Windows virtual machine that enables the user to view the footage, create working copies of the data (on Research Drive) and improve the transcriptions.

The workspace can read and decrypt the data from the Research Drive link, but it cannot access the internet. This also prevents downloads and copying to locations outside the collaboration.

To make the correction of transcriptions more efficient, the application "Whisper Corrector" is installed by the Media Viewer catalog item.

There is an online user manual for Whisper Corrector.

The "SI - Media Viewer" catalog item should be cloned, and its use restricted to the project's collaboration.

Data lifecycle

The Secure Interviews system supports projects with handling the data responsibly. However, when the project is finished the data is going to be published or archived. To get the unencrypted data out of the system the  "SI - Data Tap" catalog item exists. Only the src_co_admin can use it.

Just like a SI - Media Viewer workspace, the Data Tap can decrypt the data that is on Research Drive. The difference with the SI - Media Viewer is that the Data Tap does have internet access.

In the profile tab of the Research Cloud portal, the collaboration admin can also specify a Research Drive or WebDav connection for the collaboration. Through this connection, the unencrypted data can be written to Research Drive or another WebDav service.


The Research Drive or WebDav connection can be found in your home directory: ~/researchdrive or ~/webdav, respectively.

You can find the project's collected data in your home-directory, too: ~/data/project/...

The admin of the project can also transfer the data e.g. to an archive for long-term preservation. Wherever the data goes, other security measures should be in place, of course. The data is not protected anymore by encryption.

The "SI - Data Tap" catalog can be used directly, without cloning. Only admin users are able to start it or log in to it.

Roles of team members in the workflow

The team consists of

  • admins who administer the team's user accounts on the Big Blue Button server
  • interviewers who plan and conduct interviews using Big Blue Button
  • transcribers who transcribe interviews using a SI - Media Viewer workspace
  • analysts who analyse interviews (text-based and/or by viewing interview footage)
  • data stewards who administrate the data on Research Drive
  • maintainers who maintain the storage and compute infrastructure on SURF Research Cloud


Only the maintainers need to have a little experience with using a Linux terminal.

Of course there will be team members who fulfill more than one of those roles.


Admins

The first "admin" user account is created by a maintainer when setting up the SI - Secure Interviews workspace. This is described in the installation chapter.
Other admin accounts can then be created through the Big Blue Button admin page.

Admin actions are
- Create user accounts for interviewers
- Overlook created video conferencing rooms and recordings
- Change Big Blue Button settings, if required.

Interviewers

A user with admin rights on the Big Blue Button application can create an account for an interviewer.
An initial password can be set and has to be changed by the interviewer after the first login.

Block the launch of unsafe catalog items

Make sure that in the SRAM collaboration of the project, "mere" interviewers, transcribers and analysts must be members of the collaboration's "src_no_ext_app" group.

Only admins/maintainers/data stewards should not be in this group.

This prevents interviewers from starting a workspace that could circumvent the data encryption and the download block.


It is a good idea to have the name of a room refer to the interviewed person or the case-number inside the project. Later in the flow, the room name and the start time and date of the interview will be used to identify the recording(s). The room is temporary and can be deleted manually after use.

Transcribers/Analysts

Anyone who is to view the interview footage must be a member of the SRAM collaboration. They log in to the viewing workspace with their SRAM user name and a time-based password. 

Actions are

- Make or edit transcriptions of conducted interviews
- View the interview footage for research

Viewing interviews for transcription or research does not take place on the interview workspace. The recordings only remain there for a short time. For viewing, another workspace is started on Research Cloud. A "SI - Media Viewer"-derived workspace. It is retrieving the data from Research Drive.

Data Stewards

The data steward is a power user of SURF's Research Drive service. Their credentials and rights are created and administrated there.

Actions are:
- Create a path where recordings and transcriptions can be written to, from the Big Blue Button server
- Create pathes where transcribers and analysts can write their working copies and results to

Maintainers

Maintainers are admin members of the SRAM collaboration.


Responsibilities are:
- Manage collaboration members (maintainers, transcribers and analysts)
- Start and maintain the Big Blue Button server and viewing workspaces.
- Create the initial admin user on the Big Blue Button Server

Configuration of the SRAM-collaboration

To configure the collaboration for Secure Interviews, go to the "Profile" tab of the Research Cloud portal.

Scroll down until you see the collaboration that will be used for the project.

Extend the display of the collaboration and select the "Secrets" tab.

Add the following secrets:

NameDescription
rd_by_link_encrypt_secret_1A random string of some reasonable length. A UUID is a good idea.
rd_by_link_encrypt_secret_2Another random string.
rd_by_link_passThe password of the Public Link, that is created in Research Drive.

rd_by_link_url

The url of the Public Link with the random username string removed and changed to "public.php".

Example: https://my_institution.surfsara.nl/public.php/webdav/

rd_by_link_userThe "random" string that is trailing the original url of the Public Link.
secure_interviews_willma_token

The authentication token as created at https://willma.soil.surf.nl/

Adding the "WiLLMa Authentication Service" to the collaboration enables you to create a token/API key, there.

Please bear in mind that if you create a token, any token you created earlier will be devalidated.

Configuring cloned catalog items for your collaboration

The catalog items "SI - Secure Interviews", "SI - Media Viewer" and "SI - Media Upload" should be cloned and made exclusive catalog items of the project's collaboration.

When cloning, make sure that the parameter "rd_by_link_use_encryption" is set/overwritten to "true", otherwise files on Research Drive will not be encrypted.

If you have never cloned or edited catalog items, please refer to this page.