Yoda hands-on first steps

This tutorial allows you to exercise the typical Research Data Management flows that YODA supports through its web interface.

YODA (from now on, Yoda) is a piece of software that runs on top of iRODS, and it facilitates certain flows when operating with datasets within the Research Data Management (RDM) discipline. The typical situation that you may encounter is that an IT department will offer a Yoda instance for your institution, which you can access through Yoda's web interface.

As a brief summary of Yoda's concepts, Yoda allows multiple users to cooperate for data management in so-called groups. A dataset consists of files brought together within a folder. It is at this folder level that you attach the metadata that belongs to that dataset. In terms of best RDM practices, that metadata belongs to the dataset; it is an intrinsic part of it. Metadata consists of information pieces that accompany the data itself, in order not only to describe the data but also to allow finding it back by means of searching through those bits of information. The main idea behind this cooperating around data is that scientists can gather data and share it with others, each working at their own pace. Thus, years after the dataset has been put together, somebody else can find the data and reuse it, possibly, in novel research projects.

Yoda facilitates all this interaction through its web portal. In the following sections we will be showcasing and exercising these flows, at times as though you were a scientist and data steward.

Table of Contents

0. Connecting to Yoda

Every participant in the course today will have access to these same folders, so you are encouraged to behave cordially. The intention is that you interact with each other as you naturally would within a research project. If you have any objection to this, please let the course facilitators know now.

Let us enter the Yoda server now.

Open a browser window. Navigate to the URL on the right.
Log in with your credentials
Click on the Research tab
Verify that you see two folders: research-train-oct23 and research-train-unsuperv-oct23
- These two folders will represent two different groups that you will be working in
- Try to remember that we will start working in the unsupervised group, and only in the last exercise we will move to the other one

Yoda server to work in today:

https://scuba-yoda.irods.surfsara.nl

Credentials will be given to you during the course.

1. Finding and reusing existing data

In this scenario we are going to pretend that you are a researcher who knows a dataset exists in the Vault of the Yoda instance you are working in. You only know some information about that dataset (i.e.: some metadata), but you know you want to use the data in the dataset.

By the end of the exercise you will know how to search for datasets in Yoda and work with the data in a dataset. You will also learn the difference between the Research area and the Vault.

1.1 Preparing a working place in the Research space

For the exercises today, we will be alternating between the Research and the Vault tabs. The idea behind those two tabs is that there is one working area, separated from a frozen area.

The working area is where you perform all the usual data and metadata managing. Think of operations related to adding, removing, modifying or shuffling files around (for the data), and also analogous operations for the metadata. The frozen area is a place to make datasets unmodifiable, sort of snapshots if you will, so that they can be safely used by others.

In Yoda's terms, the working area is the Research tab from the menu, and the frozen area is the Vault.

When you log into the the portal of a Yoda instance (or Yoda server, or, from now on simply Yoda) you are confronted with a screen showing a menu bar at the top. That menu bar allows you to choose between, at least, two tabs; namely: Research and Vault.

We will now pretend that you are working in a project of your own. You will therefore need to create a folder for that project. Since you ultimately want to actively be working with a dataset for this exercise, you will be creating now that folder in the Research area. Here are the steps you need to follow:

In the Yoda portal, click on the Research tab of the main menu.
You can see a list of folders. These are the groups that you are a member of. From the list of folders that appear, click on the research-train-unsuperv-oct23 folder.
You will create a folder here to represent your project. Do so by clicking on the Create Folder button.
Give the new folder you are creating a name similar to "Project X", where X should be something that you will be happy to work with, like: "Project Peter" or "Project Flamingos". Please, remember what you choose, because the rest of the course attendants will be creating their own folders here too, so you may see a different list every time to look.

You are now set to go! This project folder will be the place where you will import the dataset that you are now going to search for during the rest of this section.

1.2 Searching for a dataset in the Vault

In Yoda you can search in the Vault for datasets that have been placed there by you or others within your same groups. Yoda requires that you choose in which parts of a dataset you want to search, and the options it offers include:

search by file name
search by folder name
search by metadata
search by status

As explained at the beginning, you are pretending now to be a scientist in need for a dataset that you know exists in this Yoda instance. The information you have is little: it involves a picture taken over the Gulf of Biscay. That is precisely what you need for your research! Let us find it now.

a) Search by filename

In the Research tab of the Yoda portal, see that there is a search box at the top
Type in the search box a word or words that you think are reasonable for the little information you have about the dataset, such as: "ocean" or "Biscay" or "Atlantic"
Hit "Enter" or click on the magnifying glass button to the right of the search box

The result is probably going to disappoint you: you will not find anything. Let us try a different search method. By default, when you search this way, you exercise a "Search by file name" (see that "search by filename" is selected on the drop-down list to the left of the search box).

b) Search by folder

To the left of the search box of the previous search, choose now "Search by folder".
Type in the search box a word or words that you think are reasonable for the little information you have about the dataset, such as: "ocean" or "Biscay" or "Atlantic"
Hit "Enter" or click on the magnifying glass button to the right of the search box

The result is probably going to disappoint you this time as well: you will not find anything. Let us try yet a different search method.

c) Search by metadata

To the left of the search box, choose now "Search by metadata".
Type in the search box a word or words that you think are reasonable for the little information you have about the dataset, such as: "ocean" or "Biscay" or "Atlantic"
Hit "Enter" or click on the magnifying glass button to the right of the search box

Voilà! You should now have at least one result. However, how do you know what is the good one? You will have to bring it to your working area in order to inspect it.

1.3 Importing a dataset to the Research space

You are now going to import the dataset you found in the Vault, into the project folder that you created in the Resarch area a few steps ago. Remember? You called it Project <something>.

From the list of results of your search, click on the one that you want to work with (a hint: perhaps on the latest modified date, or after viewing the contents)
Note that you are now in the Vault tab from the main menu. That is because you are working with a dataset that was brought to the Vault as a way to "share it in an unmodifiable state".
Click on the Metadata button. Can you answer now some of the questions to the right of this hand-out? For example: can you now explain why you were not able to find the data set when searching by name or by folder, but you were when you searched by metadata?
Click on the "Close" button of the pop-up that is displaying the metadata. You should be seeing the folder contents again.
In order to import the dataset into your Project folder, click on the Actions button now. Then select the option that reads: "Copy datapackage to research space".
A pop-up will appear displaying your groups. Choose the unsupervised folder. Then choose your Project folder within it.
When you have selected your Project folder, click on the button "Copy package to research area".

1.4 Working with the dataset

You have now found and imported a dataset from the Vault into your Project folder. Let us simulate that you reuse the data by looking at the picture!

In the Yoda portal, click on the Research tab from the main menu
Navigate to your Project folder by clicking the unsupervised folder, and then click on your Project folder.
You should now see a new folder in your Project folder, whose name includes a large number between square brackets. This number is a Unix epoch, which you can consider to be a timestamp indicating when you made a copy of the folder. This notation prevents unexpected overwrites when moving datasets around.
Click on the new folder name. You will see it has another folder within it called original, and there is a yoda-metadata[epoch].json file there as well. If you click the Metadata button, it will be empty. How come!? Well, look in the original folder instead. Click, therefore, on the original folder.
You should now see a picture file and a yoda-metadata.json file as well.
Click on the Metadata button now that you are in the original folder. You should now see a lot of metadata fields, and you can even modify them! This metadata should be the same you saw before you imported the dataset into your working area.
You can click on the "Close" button of the form to go back to the list of files.
In order to simulate using the data, you can now click on the three dots to the right of the picture, and select View. That will display the picture in a pop-up. Alternatively, you can click on Download to simulate that you save a dataset onto your laptop's hard drive.
Lastly, select the three dots to the right of the yoda-metadata.json file, and click Download (see that there is no previw for .json files). You can open this .json file on your laptop with your favourite text editor. Can you identify any of the information there? Exactly! It is the same metadata as you see when you click on the Metadata button of the folder. Handy, right? This way you will always have the metadata along with the data, in a machine-workable format!

⟡ ⟡ ⟡

You have now completed this section. Feel free to move on to the next exercise at your own pace, but make sure you have answered the questions on the right to verify that you have found the intended dataset. More generally, after this exercise you must have understood the flow to find and reuse data in Yoda.

Flow

Questions to answer throughout this section:

What is the file name of the picture?
What is the folder name of the picture?
When was the picture taken?
Who took the picture? What is their affiliation?
Which three location tags have been given to the picture?
What does the picture show (i.e.: can you describe what the photograph has captured)?

Food for thought:

What is the name of the root folder of the dataset? Is this folder in any way related to the name of any of the folders you can see in the Research area? How is it related?

2. Unsupervised RDM cycle

In this scenario we are going to pretend that you are a researcher who has been collecting data, and wants to freeze it in the Vault. You will be assuming the role of a seasoned data practitioner who will complete the process on their own, without any data steward's supervision or intervention.

By the end of the exercise you will know how to advertise a dataset in Yoda's Vault, taking care of the metadata and the data.

You can work with any dataset you may have already on your laptop, or you can pretend that you have one of your own by downloading something from the Internet. For example, you can use the Data portal from the Gemeente Amsterdam to search for data that may appeal to you (please verify the dataset's license before you use it!): https://data.amsterdam.nl

2.1 Preparing another working place in the Research space

Remember that you have your own Project folder within the unsupervised folder in the Research space. You may still have the folder with the data that you found in the previous exercise. You will now be working with a different dataset, so you can best create a working folder directly in your Project folder.

Please do so now. Give it a suitable name for the dataset you will be working with. We will refer to this new folder as the dataset folder during this exercise. In order to accomplish this you can follow analogous steps as those you followed during the previous exercise.

2.2 Filling in the metadata

The dataset's metadata is crucial when you are working within RDM best practices. It will ensure that your dataset is reusable in the future. So you can best start with it, even before the data exists in Yoda. Let us tackle that right now.

In the Yoda interface, navigate to the unsupervised folder, then your Project folder, and then to the dataset folder that you have prepared for this exercise.
Once you are in the dataset folder, click on the Metadata button to start editing the metadata.
Now take all the time you need to think about what is reasonable metadata, and make sure you write plenty of it. Recall the feeling when you were searching for data in the previous exercise.
1. For inspiration, what would have helped you to be more effective in finding the dataset? Apply that now to facilitate that others will find your dataset both when they know it is there, and when they do not know it is there. This last case describes a data discovery scenario.
2. If you are working with a dataset which is published somewhere else (e.g.: like the Gemeente Amsterdam), you can draw ideas from the metadata that you already actually see in that portal.
3. For datasets that involve spatial or temporal information, make sure you fill in appropriate intervals and location descriptors. You may look at the previous exercise's dataset to see how you can include multiple location descriptors.
4. Think of the data policies from your research field or your institution. How could you use the form to add metadata that will fulfill those policies' requirements?
5. Pay special attention to filling in a reasonable value for the Version field of the metadata. We recommend that you enter a number. Remember the value you enter.
Once you are ready, click on the Save button. If the form is still open, you may want to scroll all the way up and start going down slowly while you verify for error messages asking you to fill in mandatory fields.
You can now see that there is a file in your folder called yoda-metadata.json. That is where Yoda stores the metadata in a format that you can bring along as a companion to the actual data.
If you have any colleagues in the course, now would be a good moment to ask them to verify your metadata and engage in a little discussion to see if you agree on what you have written.

2.3 Uploading data

Now that you have the metadata, you can upload files with actual data from your laptop. For this exercise and simplicity's sake, it will be enough to upload one or two files no larger than a few megabytes as though they are a full dataset; adding more would be overkill today.

In order to upload a file:

In the Yoda interface, navigate to the unsupervised folder, then your Project folder, and then to the dataset folder that you have prepared for this exercise.
Once you are in the dataset folder, click on the Upload button.
Your browser's file exploring dialog will pop up. You can navigate there through your laptop's folders to locate the data files you want to upload. Locate those files.
Double click on the file you want to upload. Yoda will display a progress bar which will be filling up as the file uploads. When the upload is ready, you will see an OK.
Close the progress bar dialog, and you will be back on your dataset folder. You should now see your file listed there.
If you want to play with a multi-file dataset, you can repeat the upload process with a second file.

Now you have made the files available to Yoda in the Research space. You are ready to freeze the dataset and make it available for others to use within this Yoda instance!

2.4 Submitting the dataset to the Vault

Now that you have a dataset which includes its metadata and the data itself, you can initiate the flow that will place the frozen version of the dataset in the Vault. This is going to be an unsupervised process during this exercise, simulating a situation where you are an expert data practitioner. This means that, whenever you start the process, the dataset will reach the Vault directly.

In the Yoda interface, navigate to the unsupervised folder, then your Project folder, and then to the dataset folder that you have prepared for this exercise.
Once you are in the dataset folder, click on the Actions button, and select the Submit option.
If you refresh the browser's content, you will see that a yellow label appears next to the folder's name title, indicating a few different state transitions. After a few minutes (depending on the file sizes and how busy the server may be), the yellow status label should read "Secured".

Congratulations! You have just successfully placed a frozen version of your dataset in the Vault.

2.5 Deleting the working copy of the dataset

To simulate a real situation, you can now rely on the Vault to keep your dataset for you, so you can remove it from your working area. If you try to remove the dataset folder directly, Yoda may complain indicating that it is not empty. In that case, you will have to delete the files inside it first.

In the Yoda interface, in the Research tab, navigate to the unsupervised folder, then your Project folder, and then to the dataset folder that you have prepared for this exercise.
You can now click on the triple dots button to the right of a file, and click Delete. You will have to delete all of the files there.
When you are finished deleting all the dataset folder's files, you can move up one folder to reach the Project folder (you can use the breadcrumbs bar above the folder's name).
Once you are in the Project folder, you can click on the three dots button to the right of the dataset folder, and select delete. Accept the verification step.

Done! The working copy of the dataset is now history. Long live the dataset in the Vault!

2.6 Recovering the dataset from the Vault

Now we are going to pretend that a year has passed since you last worked with your dataset. In the meantime you have decided you want to add a new file describing something related to the procedure, to the dataset (e.g.: a README file). In your view, this is simply a version upgrade, so we should reuse the same original dataset. For that, you will need to make a working copy out of the version that you had stored in the Vault a year ago.

Bring now a copy of the dataset from the Vault to your Project folder, following the same steps you applied during the previous exercise. You can locate your dataset by searching for it, or by navigating to the Vault tab, then the vault-unsupervised folder, and scrolling through the datasets that may be there.

You will have completed this exercise once you can see a dataset folder within your Project folder in the Research area.

2.7 Modifying the dataset as a new version

Navigate to the dataset folder within your Project folder in the Research area. You will see that if you click the Metadata button the metadata will be empty. Remember: you will have to navigate into the folder called original. Then you can edit the metadata (i.e.: the metadata will be there already). Increase the version number now, and save the changes to the metadata.

Prepare a new README file in your laptop that you want to upload into this dataset folder. Upload it now to the original folder.

2.8 Submitting the new version to the Vault

As you can see you have had to work in the original folder, but that is likely an unsuitable name for any worthy dataset. The proper name will be that of the original dataset folder. Rename now the original folder to that of the dataset folder (you will need to step out of the original folder and use the three dots button next to it to find the Rename option).

Now you can submit to the Vault this new version of your dataset.

You will have completed this exercise once you can see your two dataset folders named the same in the Vault.

⟡ ⟡ ⟡

Well done! You have now completed this section. Feel free to move on to the next exercise at your own pace, but make sure you have answered the questions on the right to verify that you have understood the unsupervised flow to secure datasets in the Vault.

Flow

Food for thought:

You must have realised by now that proper metadata management is key, but also very difficult to do properly. Yoda simplifies this effort a bit by allowing only metadata to be added to folders. Can you think of situations where this approach will feel like limitations, instead of a blessing? How would you tackle those, then?
Yoda also simplifies metadata management by allowing you to fill in a nice predefined form. Could you think of a need for your institution to customise that form? Or maybe customise the form per research discipline? Can you find something in the Yoda documentation that points to where this could be arranged? (hint: metadata schemas)
When you are defining metadata in the current form, you can probably see that there is a field for tags. What are these useful for?
What is in your own word a good definition for Vault? What is it useful for?

3. Supervised RDM cycle

In this scenario we are going to recreate the same steps as in the previous exercise, but you will be working in a group that requires that a data steward approves your dataset before it is allowed to reach the Vault. You will get the chance to be the scientist, but also the data steward.

By the end of the exercise you will know how to the interaction between a scientist and a data steward can lead to a dataset being placed in Yoda's Vault.

3.1 Preparing another working place in the Research space

Follow the steps from previous two scenarios to create a new Project folder and a new dataset folder in the Research space, but take care to use the folder called research-train-oct23 this time.

3.2 Filling the metadata and uploading data

Follow the steps from the previous scenario to provide metadata for the dataset folder.

Follow the steps from the previous scenario to upload data files to the dataset folder.

3.3 Submitting to Vault

Follow the steps from the previous scenario to submit the dataset folder to the Vault. You will see that the yellow label next to the folder now remains in status Submitted. This is where the data steward comes in.

3.4 Act as a data steward

Please get in touch now with the facilitators. They will give you instructions on how to work (possibly, together with a fellow participant) in order to simulate that you interact with a data steward to:

get your dataset to the Secured status, as expected, and
exercise your data steward role

In short, the steps that you will have to fulfill as a data steward are:

Open the submitted folder from a classmate
Find out the submitter's e-mail address by looking at the provenance information of the submitted folder
Send them an e-mail requiring a specific piece of metadata
Reject the submission
Wait for the submitter to send the submission again
Verify that you now have the expected metadata
Approve the submission

3.5 Verify your dataset is in Vault

After you exchange interactions with the data steward and you get their approval, you must see your dataset published in the vault-train-oct23 folder. Verify that this is the case.

⟡ ⟡ ⟡

You have now completed this section. Make sure you can answer the questions on the right to verify that you have understood the supervised flow to secure datasets in the Vault.

Flow

Food for thought:

Now that you have experienced both the unsupervised and the supervised flows, can you see when you would apply each in your institute?
Who would be suitable candidates to be carrying out the task of data steward for the sake of approving?
How is that scalable to cope with all the research data in your institute?
How would you organise Research spaces in your institution's Yoda? Why? Can you think of an alternative organisation of Research spaces?

❦ Epilogue

Well done! You have now finished the exercises for today. You can be proud of having completed some tough work today.

If you think anything is still unclear, do not hesitate to contact your facilitators.

For next steps with regards to RDM, iRODS, or other research services, you can:

follow our Yoda + iBridges tutorial
visit our documentation pages:
- Yoda: Yoda Hosting
- iRODS: iRODS
visit our research services web page: https://www.surf.nl/en/research-it
or contact our Servicedesk through the Servicedesk Portal

Thank you for your attention, and we hope to have been of help for you today. ∎

Complementary information:

A handy alternative interface to the YODA portal of a YODA instance is the WebDAV interface: https://scuba-data.irods.surfsara.nl
A feature we have not touched in this hands-on session is Publishing. That is a next step in the flow after the Vault.
Yoda is open source, and you can view their code and advanced documentation in the University of Utrecht's GitHub: https://github.com/UtrechtUniversity/yoda

Space shortcuts

Page tree

0. Connecting to Yoda

1. Finding and reusing existing data

1.1 Preparing a working place in the Research space

1.2 Searching for a dataset in the Vault

a) Search by filename

b) Search by folder

c) Search by metadata

1.3 Importing a dataset to the Research space

1.4 Working with the dataset

Questions to answer throughout this section:

Food for thought:

2. Unsupervised RDM cycle

2.1 Preparing another working place in the Research space

2.2 Filling in the metadata

2.3 Uploading data

2.4 Submitting the dataset to the Vault

2.5 Deleting the working copy of the dataset

2.6 Recovering the dataset from the Vault

2.7 Modifying the dataset as a new version

2.8 Submitting the new version to the Vault

Food for thought:

3. Supervised RDM cycle

3.1 Preparing another working place in the Research space

3.2 Filling the metadata and uploading data

3.3 Submitting to Vault

3.4 Act as a data steward

3.5 Verify your dataset is in Vault

Food for thought:

❦ Epilogue

Complementary information: