My name is Lan Li, and I am a fourth-year PhD student from the School of Information Sciences, University of Illinois Urbana Champaign. There is one work from #Data Curation session that impresses me a lot: “Describing Data Transformation Work in a Changing Data Curation community.”
Data curation is an essential process that involves various activities that help ensure the quality and long-term usability of a dataset. The process includes activities such as data selection, cleaning, formatting, and preservation. The primary objective of data curation is to maximize the usefulness of the dataset, which is consistent with FAIR principles.
To achieve this goal, I have been studying FAIR principles on improving the transparency, reusability, and reproducibility of data cleaning work through provenance analysis. This approach involves tracing the lineage of data and metadata transformations that occur during data cleaning. Provenance analysis helps enhance the FAIR principle of data transformations in the data cleaning area by providing information on the origin, flow, and transformation of data.
In data cleaning work, various data transformations are required to make the data fit for use. These transformations can range from simple data formatting to complex data merging and restructuring. The mechanism of data transformations affects the dataset space and, as a result, the extent and difficulty of reusing data transformations can vary. My previous research focused on how data transformations affect the reusability of data cleaning tasks through provenance analysis.
On the other hand, this study on data transformation at a major data science data archive presented a qualitative analysis of the extent and manner change of data transformation during a 16-year period of organizational change. The study analyzed the code in data transformation syntax files and covered the size, diversity, and breadth of the code, comparing the characteristics over the years.
The study found that during the period of organizational change, the number of lines of code and the number of commands used to transform data decreased significantly, indicating increased standardization and efficiency. However, the diversity of data transformations increased, which may have been due to changes in data formats, technologies, or research practices. The reorganization had goals of increased standardization and efficiency, which were achieved through the adoption of standardized processes. The study highlights the importance of continuous evaluation and improvement of data curation processes to ensure that datasets remain FAIR and fit for use over the long term. Those findings can inspire me a lot when I continue the study on reusing data transformations from multiple people over the years.