Creating datasets
Datasets are SQL tables that are stored within the workspace database. Any tabular data, e.g. a CSV file, can be converted to a dataset.
There are multiple benefits to doing this. Metadata can be added to a dataset to ensure that it follows the FAIR data principles. Built-in tools are also available for working with the datasets, such as a point and click SQL for joining tables, rapid data visualisation functionality and a table editor. To find out more about these tools, see articles in the Managing data section.
Creating a new dataset
To create a new table from scratch within your workspace, go to the Add tab and select ‘New Dataset’. Name the new table, click ‘Create dataset’ and an Add Column window will open.
You can either close the window or add columns by typing a column name and its data type.
Converting a CSV to a dataset
Datasets can be easily created within the workspace by converting a CSV file. From the Files tab, highlight a CSV file either from the blob or file store. Select the option 'Convert to dataset' in the sidebar, and a New Dataset tab will open in the workspace.
The top panel in the New Dataset tab gives you the following options to customise how the data is converted to a dataset:
- Data table name: the name of the new dataset.
- Include header row: specifies whether the data file includes a header row. If this checkbox is selected, the items in the first line of the file will be used as column names in the dataset.
- Delimiter: the character used as the delimiter to separate data items in the file. Text qualifier: the character used to distinguish where the text begins and ends. The default is a double quote.
- Null qualifier: the string used to represent null values in the file. Such data items will appear as no value fields in the dataset. Example null qualifiers: NULL, null, 0.
- Encoding: the character encoding used in the file. The default is UTF-8.
At the bottom of the New Dataset tab, the dataset preview shows the first 10 rows of the dataset. The preview automatically updates the data depending on the parameters set.
The panel above the preview lets you edit the header:
-
Column names must satisfy the restrictions outlined below, otherwise, the field will be highlighted in red, indicating an error. Column name requirements:
- must start with a letter
- can only contain lowercase letters, numbers and underscores
- must be less than 65 characters
- cannot be a reserved SQL word (e.g. GROUP)
-
Data types of the columns (selected from a dropdown list). Datasets currently support the following data types: text, integer, decimal, boolean, date, datetime, datetime with time zone, time, time with time zone.
-
The primary key of the table (more information about the primary keys in the next section).
-
Which columns to exclude from the dataset (done by clicking the checkbox beside the column name).
Below is an example of a new dataset conversion from a CSV file. Airport name was selected as the primary key of the table, and iata column was excluded from the dataset.
When happy with the data preview, you can select the ‘Confirm’ button which closes the current tab and opens a new Dataset Preview tab. For larger datasets, you will see a progress bar at the bottom. Once the dataset is fully uploaded, you can access the data as normal. Next, you might also want to add metadata to your newly created dataset.
Primary key
Workspaces do not enforce a Primary Key on all tables which are created using this method. By default, a new column is added to all tables in the New Dataset tab called id. This column is an index which is the default Primary Key for any new table.
The Primary Key can easily be changed at this stage by selecting another column which has no duplicate values in the first 10GB. Note, if another column is selected as the Primary Key, the auto-generated index column is no longer added to the table.
Error report
For all file to dataset conversions, a report is generated. This report is accessible from the Summary tab as well as from the Activity tab, which can be found in the sidebar of any successfully created dataset.
If the workspace could not create the dataset, a report is generated and shown in the Summary tab. This report contains a list of the errors encountered while trying to create the dataset, including a reference to the line or character whenever possible.
The image below shows an example report which indicates that the dataset creation failed because the primary key column contained duplicate values.
Troubleshooting
I can't find the 'Convert to dataset' button
This action is only available for files ending in .csv (or .CSV). Make sure that you are trying to make a conversion from a CSV file.
My dataset is taking a long time to show up
Large datasets can take quite a few minutes to convert, this is longer when you choose to exclude certain columns from their dataset or when multiple users are carrying out the action at once.