Bid
Name of Capital Programme: Repositories and Preservation
Name of Lead Institution: University of Essex
Name of Proposed Project: Data Exchange Tools and Conversion Utilities (DExT)
Name of Project Partners: UK Data Archive
Full Contact Details for Primary Contact:
Name: Louise Corti
Position: Associate Director, UK Data Archive
Email:
Louise Corti
Address: University of Essex Wivenhoe Park, Colchester CO4 3SQ
Tel No: +44 (0)1206 872145
Fax No: +44 (0)1206 872003
Project Start and End Dates:
Length of Project: 12 months
Start date: 11 December 2006
End date: 31 March 2008 (extended)
Total Funding Requested from JISC
£139,421
Funding Broken Down over Project Years:
Year 1 (1 Nov 06 - 31 March 07) £57,640
Year 2 (1 April 07 - 31 March 2008) £81,781
Outline Project Description
This project is exploring the feasibility of developing data exchange models and data conversion tools
for primary research data collected in the course of empirical research. It is developing, refining and testing
models for data exchange for both survey data and qualitative research data based on XML/RDF schema and
developing tools for data import and export. The data formats included are those that are
commonly used in research such as SPSS, Stata, XML, Atlas-ti, MaxQDA and Nvivo. The test data selected
for this project are from the social sciences, but these formats are typically found across all domains
of primary research. A small scale evaluation of the models and tools is being undertaken to inform JISC
of the most viable options for future development in this area.
1. Background
1.1 Data conversion and proprietary data entry and analysis are particularly important and
problematic aspects of data management and curation. This proposal aims to provide researchers
and support staff working with primary research data with a suite of tools that will enable data
to be long-term curated and exchangeable. The tools will research and both develop and test tools
for contemporary quantitative data or statistical data and qualitative data typically used by social
researchers.
1.2 Much important primary research data is created every day in the course of academic and policy
research. While Data Sharing Policies are encouraging sharing and formalised archiving of data, the
ideal life cycle for data creation to re-use remains beset by obstacles. The main issues involve the
buying-in to a dedicated analytic strategy and typically a particular software package. Over the years
the UK Data Archive has seen a number of such softwares quickly become obsolete. To address the problem of
incompatibility between software various data conversion tools have come of the market. One example
for numeric and statistical data is StatTransfer and DBMSCopy that enable conversion from say SPSS to
Stata. Equally the development of the format SPSS.por in the 80s enabled import and export between
the major statistical analytic packages. However in the qualitative data analysis software field
there are no such inter-software conversion tools. This proposal argues that open data exchange
formats are necessary for maximising the opportunities for data sharing and long-term archiving.
1.3 Legacy formats do and will present problems, a matter recognised for some time by the data
archiving community and by some commercial suppliers. A JISC/NPO study on the preservation of
electronic materials in 1997 gauged the extent of legacy materials sitting in our institutions -
significant. Bennett developed a framework of data types and formats and looked at issues affecting
the long-term preservation of digital material and the DCC is playing some part in this endeavour,
but we are still a long way short of actually dealing with legacy on a managed or efficient scale.
Universal exchange formats will help to alleviate the build up of yet more unreadable digital data
in our institutions.
1.4. Metadata Encoding and Transmission Standard (METS) is becoming an increasingly attractive
standard for dealing with preservation metadata and confirming validation processes and is JISC's
chosen stand for interoperability. It should therefore be relevant to the development of data conversion
standards and tools. The TNA and UK Data Archive recently undertook a JISC-supported project to report on Open Archival
Information System (OAIS) Reference Model and METS.
1.5 Outputs from primary research are also linked, either implicitly or explicitly, to
associated research data - for example statistical data, qualitative data, data encoded in a
scientific language, or even graphical data. Links between the output research findings and the
source data upon which they are based are highly desirable for virtual research environments which
are under construction in many institutions at the present time. Such links scarcely exist as yet,
however, and are certainly not currently made from the relatively embryonic institutional output
repositories which are currently under development. The STORE project is addressing some of these
matters. Linking to truly exchange data formats that are professionally curated will be of
greater benefit to the research community. The SHERPA project is also using OAIS and METS to
harness institutional repository systems with the AHDS preservation repository to create an
environment that fully addresses all the requirements of the different phases within the life
cycle of digital information.
2. Overview
2.1 This bid is for a project under the Tools and Innovation strand, which aims to develop and pilot
innovative approaches to repository use and digital preservation through the development of new software
and tools. The outputs will also be relevant to i) the Discovery to Delivery stream, as the proposed
service is based upon common standards for data interoperability, and (ii) Shared Infrastructure Services
for resource discovery, repositories and curation - machine to machine services that support rights,
profiling, terminologies, registries, file format and representation etc.
2.2 The proposal supports the Programme's desire to improve the efficiency and quality of repository
functions, by helping automate the processes of data conversion, and by providing SMART data and tools - in
the form of a universal data exchange format. The work set out will thus contribute to the need for
refinement of the application of standards and specifications for digital repositories and preservation by
building software and tools for both digital repository use and digital preservation. An immediate benefit
will be an increase in productivity for preservation and data sharing services and the enhancement of both the reprocessing
of legacy datasets and the data refreshment element vital to good data preservation practice.
2.3 The proposal builds on development work arising from the existing JISC programmes: Digital
Repository Programme and the Supporting Digital Preservation and Asset Management in Institutions.
Findings from the JISC projects set out in the Background section 1.0 will be utilised where appropriate. It
should be noted that the existing current projects from these JISC programmes are addressing publications and
theses, learning objects, experimental and geospatial data, but largely do not cover the types
of (widely used) data central to this proposal. The proposal also builds upon other contemporary reports
and investigatory papers into preservation and dissemination issues at academic institutions. The 2005
report on 'Digital Repositories Roadmap: looking forward' called for the provision of a 'solid environment
within which a wide variety of software tools (open source and commercial) and added value services can be
developed' and 'functionality and services that support curation, migration and preservation'.
2.4 The UK Data Archive has already produced three Best Practice Guides for the MRC and a joint ESRC/NERC/BBSRC RELU
Programme in: Data Management - covering data format selection, metadata and documentation standards and
content, version control, access, and authentication, and in Data Format Conversion - covering the issues
involved in data conversion. These provide solid background advice for researchers and centres with data
sharing or archiving commitments, and will be built upon for user guides and documentation purposes for the
conversion tools to be built.
3. Aims and Objectives
3.1 This proposal aims to provide researchers and support staff with standards and associated tools for
data (and metadata) format conversion. Central to these aims are the recognition and use of best
practice in longer-term data management and curation.
3.2 The specific objectives are to:
- research and develop a numeric data exchange standard and conversion tools
- research and develop a qualitative data exchange standard and conversion tools
- test and evaluate these standards and tools
- assess the feasibility of developing a web-based service for data conversion based on these tools sets
4. Overall Approach
4 The kinds of data to be dealt with under the remit of this proposal are data arising from primary
research using typical social research methods and techniques based on fieldwork. These are primarily:
structured social surveys and more in-depth interviews or focus groups, as set out in Appendix B. The
UK Data Archive has preferred formats, and also distinguishes between acquisition, dissemination and preservation
formats. Users make choices about which software to use in their research and require formats that can be
easily read.
4.1 Numeric data exchange and conversion tools (WP2)
4.1.1. Software applications are increasingly designed to provide the researcher with a view of their
data and its 'internal metadata' (variable and code labels, variable formats, etc.) that is divorced from
the software's internal (i.e. underlying) representation, used in conversion. Data creators, data centres
and researchers often need to translate datasets between formats, but lack the tools to do so in an accurate
and automated way. Data/research centres may use proprietary translation software for certain types of
conversion, while individual research projects often rely on the inbuilt import and export functions
of a given software package. Both options tend to be poorly documented and operate on the software's
internal representations of data. A range of subtle but significant conversion errors often occurs as
a result. The major problems that affect data format conversion are:
- rounding/truncation of numeric data
- truncation of textual data
- differences in handling 'internal' metadata (differential label lengths, missing value handling etc.)
- corruption of specially formatted variables (especially date/time variables)
- embedded special characters
4.1.2. For example, The SPSS command 'PRINT FORMATS' is often used to perform data typing upon conversion, yet
it seldom matches the actual data and this can lead to catastrophic coarsening of data upon conversion or,
conversely, unnecessary inflation of file size. Similarly, one of the ill-documented features of MS Access
is that the export precision can be controlled by the number of decimal places in the 'Regional Options' of
the Windows Control Panel. Lastly, and an example of point 5 above, embedded characters are an issue in MS
Access, wherein fields may contain characters like 'tabs' or 'carriage returns'. Unless these characters are
stripped out prior to conversion to delimited text, the data will lose its rectangular structure.
4.1.3. The main aim of this strand is to develop an XML standard for data exchange and curation. This
standard will be defined and expressed as an XML schema. It will store ALL information in a statistical
dataset including the internal metadata (variable and code labels, missing value definitions, variable
level notes, variable formatting, etc.). This would be a logical extension to the Data
Documentation Initiative (DDI) that is used globally to describe social research data collections,
with the DDI acting as the metadata standard for survey datasets and this new standard providing the
standard for the data themselves - and thereby providing a much needed open standard for curating
statistical data resulting from surveys.
4.1.3.1
Specifically the standard would need to take account of:
- numeric data in all relevant storage modes, i.e. integer, real and complex
- numerical data in specific formats, e.g. date/time
- textual (character) data, at a UTF-8 standard
- categorical data (ordered or unordered)
- logical data
- 'null' and undefined data
Also the standard would need to take account of any other feature (data type) which might be
found in a spreadsheet, statistical package, or relational database.
4.1.4 The resulting standard would be DDI 3 compatible and would need to be able to be used to
generate DDI 3 XML files via XSLT stylesheets. The XML schema would be published on the project and
ESDS web sites. It is important to realise that the standard is only of use to the potential user
community if freely available tools are made available to get data into the curation/interchange
standard from proprietary data formats and also export back out into proprietary formats. Hence
some pilot import and export utilities will be developed to accompany the new standard.
The basic work plan for this development work is set out in Section 11.
4.2 Non-numeric data exchange and conversion tools (WP3)
4.2.1. The majority of social researchers undertaking qualitative research methods are making use of
some form of data management software. This can be MS Word or MS Access but in the past 15 years a
number of dedicated packages have come on the market. These are called generically termed Computer
Assisted Qualitative Data Analysis Software (CAQDAS) packages, and include the market leaders NVivo,
Nudist, Atlas-ti and MaxQDA.7 The softwares have similar basic functionality that include:
- structuring work - ability to access to all parts of a project immediately
- staying 'close to data' - instant access to source data files (e.g. transcripts)
- exploring data - tools to search text for one word or a phrase
- code and retrieve functionality - create codes and retrieve the coded sections of text
- project management and data organisation
- searching and interrogating the database - search for relationships between codes
- writing tools - memos, comments and annotations
- outputs - reports to view a hard copy or export to another package
4.2.2. CAQDAS formats are proprietary and there exist no export or import formats. Thus once a researcher
uploads data into a particular package, s/he is locked into that particular software and format. The work carried
out within these packages, which is primarily coding and annotating data, is stored as an integral part of the
softwares 'project' which can typically be exported as whole unit. However it is not possible to share the
added value undertaken within the package. A list of annotations or codes can be exported but the links to
the underlying data cannot. As yet, there are no open source products which can compete with the
functionality of the leading softwares, Atlas-ti and QSR NVivo.
4.2.3. A standard format for representing richly encoded qualitative data is necessary because it:
ensures consistency across datasets; supports the development of common web-based publishing and search tools;
and facilitates data interchange and comparison among datasets. Importantly, it could also enable data and
linked products to be imported and exported directly into and out of CAQDAS packages, avoiding the reliance
on just a single product, and offering the opportunity to share analytic workings outside the confines of
any particular software.
4.2.4. The work to be undertaken for this strand is a refinement and joining of two new and roughly
specified models that have been developed by ESDS Qualidata at Essex and ANU in Australia. A draft but
limited formal definition of a common XML vocabulary and Document Type Definition (DTD) based on the Text
Encoding Initiative (TEI) for describing these structures has been prepared by ESDS Qualidata. The
Universities of Melbourne and Queensland have developed a draft Qualitative Data Interchange
Format (QDIF) for e-Social Science (QDIF)10. Both centres have been working closely together
in this very early development phase, but as yet, neither have any dedicated funding to work
further on realistic development or testing, so the specification work remains on the back-burner.
4.2.5. In essence, the model is aiming at specifying in XML or RDF, a data exchange format that can
represent a data collection and retain links between data, annotations and related objects. The model
makes use of the DDI and TEI for metadata, and is compliant with Dublin Core and OAI. The model will be
tested on various data types including output to a generic archival format and export from (and possibly
import to) 2-3 of the brand leaders of the CAQDAS software. Some basic import/export tools will be
developed. Agreed testers of the models ad tools are listed in Appendix C.
The workplan for this development work is very similar to WP2, as set out in Section 11.
5. Project outputs
5.1 The outputs of the project will comprise a suite of tools for achieving best practice in data format
conversions:
- a model for a data exchange format for qualitative data (first instance text - of NVIVO, Atlas ti, MaxQDA,
Qualrus, QDAMiner ) and a limited set of import/export tools
- import from SPSS and Excel
- these would be using visual basic scripts run from within SPSS and Excel
- these utilities would be available for people to download and execute locally from within the relevant software
.
- export to SPSS, STATA, and SAS
- this would be via XSLT stylesheets and would provide delimited text datafile and accompanying command file(s) to recreate the data in SPSS/Stata/SAS
- these stylesheets would be made available for people to download and execute locally (i.e. no software dependence beyond downloading one of the freely available XSL compilers)
- export to DDI 3 metadata files
- this would be via XSLT stylesheets which would be made available for people to download and execute locally, again, with no software dependencies
5.2. The experience of the UK Data Archive/ESDS in this line of work and the synergies this proposal has with existing
projects and staff expertise within the organisation can offer great value in terms of promoting and subsequently
maintaining the standard and import/export utilities - not least because it is a standard that UK Data Archive will adopt
for their own preservation and data exchange work.
6. Project Outcomes
6.1 The following outcomes will be sought from the project:
- specification of data exchange models and partner consultation
- pilot conversion tools and testing with partners
- evaluation of performance and results
6.2. Once these demonstrator tools can be shown to work it should draw attention to the potential for a
fully specified statistical data curation/exchange web service. Users could upload data that had been
extracted from a software package and return a chosen converted format. This is a logical extension to
the work that would be funded within this bid and it would be a highly beneficial low investment tool for
institutional repositories. The short time scale and limited budget for this proposal does not allow for
full exploration of this kind of facility, but the potential will be discussed in the final report.
7. Stakeholder Analysis
Group: Researchers across all domains who undertake data collection and analysis using
digital formats
Interest: More research data that is tied up in proprietary or legacy formats will become
available. Data and analysis, rather than simply a single dataset will be able to be long-term
preserved. Incorrect conversion of survey data will no longer be necessary. Appreciation of metadata will be improved
Value: Very high
Group: Post graduate students
Interest: As above
Value: Very high
Group: Undergraduate students
Interest: Low use of primary data in teaching is an issue of concern for some of the
Research Councils. More availability of data will be an asset
Value: Medium
Group: Research funding agencies
Interest: Greater value for money from grants awarded through an increase in the
archiving and availability of primary data. Increased re-use and further analysis of
datasets whose creation they have funded. Greater accountability for public funds
through increase in data usage and citation.
Value: High
Group: Library community
Interest: Increased recognition by academic and research communities of their libraries'
new role in promoting and pointing users to raw data sources for research
Value: Medium
Group: General public
Interest: Typically the general public do not create or analyse research data
Value: Low
8. Risk Analysis
Type: Staff recruitment
Probability: Medium
Action to minimise: Immediate widescale advertisement of positions subject to award letter.
Head-hunting initiated prior to this
Response to worst case: Development timeline reassessed if time permits
Type: Financial mismanagement
Probability: Low
Action to minimise: Part-time Project Manager and PIs with financial responsibility and
project meetings
Institution liable to under-write losses. Project halted
Type: Legal problems
Probability: Low
Action to minimise: Project Manager and PIs well-briefed and sensitive to legal environment
Response to worst case: Advice sought from JISC Legal and ultimately from legal personnel at the
University of Essex
Type: Technical problems
Probability: Medium
Action to minimise: Established track record in technical development. PIs to apply rigorous
monitoring
Response to worst case: Technical development strategy reassessed and new staff recruited, if
time and resource permits
Type: Staff retention
Probability: Low
Action to minimise: PIs to monitor
Response to worst case: Development timeline reassessed and new staff recruited, if time and
resource permits
Type: Tester drop out
Probability: Low
Action to minimise: PIs to monitor
Response to worst case: New testers recruited
9. Standards and Technical Developments
9.1 The Project will adhere to the JISC Information Environment Architecture Standards Framework.
Standards and protocols employed will be open wherever possible. These are likely to include Java,
OpenURL, OAI-PMH, SRU/W, XML and SQL. Metadata will conform to Qualified Dublin Core. UK Data Archive uses
the Data Documentation Initiative (DDI) and the Text Encoding Initiative (TEI), which both map onto
Dublin Core but provide more relevant detailed description.
10. Project Partnership
10.1 This is a small project and as such is limited to a single institution. A number of organisations
in England and Wales have agreed to be user testers, as set out in Appendix C.
10.2 The UK Data Archive (UK Data Archive) has been providing high quality social science data to the UK academic
community for over thirty years. In order to continue improving its ability to satisfy the requirements
of the social sciences HE/FE communities, the UK Data Archive has engaged in a sustained, long-term effort to define,
design and implement consistent and wide-ranging curation and preservation strategies, procedures and
formalised services. UK Data Archive is a centre of expertise in data acquisition, preservation, dissemination and
promotion and is curator of the largest collection of digital data in the social sciences and humanities
in the UK. Founded in 1967, it now houses several thousand datasets of interest to researchers in all
sectors and from many different disciplines. UK Data Archive provides resource discovery and support for secondary
use of quantitative and qualitative data in research, teaching and learning. As a lead partner of the
Economic and Social Data Service (ESDS) funded by the ESRC, the JISC and the University of Essex. The
UK Data Archive also hosts AHDS History11, one of the five Centres of the Arts and Humanities Data Service, and
the Census Registration Service (CRS)12, facilitating access to the census data resources for UK
higher and further education.13
10.3 UK Data Archive contributes to the development of internationally recognised metadata standards (the DDI
standard for statistical metadata) and has led the development of: state-of-the-art distributed
statistical dissemination software that has been widely adopted by the Data Archives community
(the NESSTAR system); a European-wide data portal (the Madiera portal) and the multilingual
social science thesaurus, ELSST (European Language Social Science Thesaurus).
10.4 UK Data Archive is recognised as a legal Recognised Place of Deposit for TNA for public records and
provides preservation services for other data organisations, supports the National Centre for e-Social
Science (NCeSS)14 and facilitates international data exchange through agreements with other national archives.
10.5 UK Data Archive is involved in Research and Development projects making a significant contribution to
new developments in data preservation and dissemination, metadata standards, software for web
browsing, data discovery and data delivery.15 UK Data Archive has led the initial development of data
exchange formats for both survey and qualitative data and is a project partner in the is
JISC StORe (Source To Output Repositories) project16 and the JISC-funded OHPR (Online Historical Reports Project).
11. Work packages
WP1: Project Management
Objective: to ensure that all the workpackages of the project are managed coherently
and that all the project outputs are delivered within the agreed deadlines and budget.
Timeline: 12 months
Project Administrator
- contract management
- financial management
Project Director
- Detailed workplan
- Running project meetings
- Production of bi-monthly progress reports and key reporting to JISC
- Creation and maintenance of project web site
- Management of dissemination in collaboration with WP staff
WP2: Develop technical specification of data exchange models - survey data
Timeline: 9 months
WP3 Develop technical specification of data exchange models - qualitative data
Timeline: 12 months
- Developing the standard- 6 months
- Import/export utilities - 4 months
- Finalising demonstrator tools and promoting the model - 2 months
WP4: Testing and Evaluation
Objective: to ensure that formative evaluation takes place throughout the project and is reported
to JISC, and a summative evaluation is performed to consider recommendations of the project.
Timeline: 12 months (formative); 2 months (summative)
- Limited evaluation of demonstrator system based on high quality user testing
- Report to Project Committee and JISC on the value of the project, with recommendations
for future development work
12. Project Management
12.1 The Principal Investigator, Louise Corti will lead the project and manage the two work
packages WP3 and 4. They will liaise regularly with staff and meet fortnightly. The project will
be overseen and run by the Project Director, Louise Corti. Her time is being offered at no cost to
this project. Expert consultancy for digital preservation has been built in for Matthew Woollard,
Head of Digital Preservation and Systems for the UK Data Archive. He will lead on WP2 which is to be redefined
in scope given the recent death of the fomer PI Alasdair Crockett.
12.2 A small Project Committee will be formed to steer and advise the project. The Project
Committee will be composed of people with data curation or data handling expertise in the UK who
will meet 3-4 times a year (face-to-face, VC or AGN) to consider progress, advise on quality assurance
and provide advice based on leading edge ideas in their fields. They will also help to disseminate the
work of the project, particularly among senior staff in UK universities and research agencies, to
government, and overseas. The Board will consist of 'research community champions' from the domains
represented in our project, together with senior figures from the worlds of digital libraries and data
curation. The members are:
- UK Data Archive, Ken Miller, Head of Information Development and e-Social Science
- UK Data Archive, Matthew Woollard, Head of Digital Preservation and Systems
- Data Curation Centre (DCC), Chris Rusbridge
- A representative from the statistical community (tba)
- University of Queensland - Andrew Smith
13. Evaluation
13.1 Formative evaluation will be the responsibility of the PIs. Summative evaluation will
be undertaken in the final two months of the project to assess performance and achievements against
the initial objectives.
14. Quality Assurance
14.1 Quality assurance of the technical workpackages will also be the responsibility of the PIs,
who will set requirements for documentation, usability and accessibility. Advice on all aspects of
quality assurance will be provided by the Project Committee.
15. Dissemination
15.1 Project personnel will look for opportunities to present the work of the project at
conferences and to publish papers in appropriate professional journals - including the academic
journals of our researcher groupings. Wherever possible, academic colleagues will be encouraged
to publicise the work of the project to their peers. The Project Committee will provide strategic-level
dissemination. UK Data Archive will host a project web site that will provide copies of all project publications,
reports and presentations.
16. Exit/Sustainability and IPR
16.1 This project seeks to undertake groundwork preparatory to further infrastructure
development. It will conclude with recommendations for a greater promotional effort towards
uptake of the data standards and a full-scale data conversion facility/service, and the UK Data Archive will
archive the project web site for as long as JISC requires, up to a maximum of five years after
completion of the project.
16.2 Software developed will be made available to the UK HE/FE community on an open source
basis. Where appropriate, all outputs will be made available via SourceForge for further development.
17. Budget
17.1 Staffing
The main costed staffing element of the project covers the programmers and consultants at the
University of Essex. Specifically, Louise Corti will act as Project Director, and be ultimately
responsible for all the elements of the project, and will lead WP3. Her time and consultancy is
offered at no cost. Matthew Woollard will be used as overall consultant to the project contributing
digital preservation expertise. One new post will be sought to undertake the research and programming
work for WP3 and 4. WP2 staffing is being reconsidered to match the scope currently under revision, but
will most likely continue to comprise a consultant and a full-time research officer.