Steve Cassidy: The Australian National Corpus Initiative: Technical and Legal Issues


Presentation slides

Extended abstract PDF

Authors

Michael Haugh (Griffith University), Dennis Alexander (Department of Education, Employment, and Workplace Relations), Linda Barwick (University of Sydney), Denis Burnham (University of Western Sydney), Kate Burridge (Monash University), Steve Cassidy (Macquarie University), Michael Clyne (University of Melbourne/Monash University), Anne Fitzgerald (Queensland University of Technology), Cliff Goddard (University New England), Jane Hunter (University of Queensland), Bruce Moore (Australian National University), Simon Musgrave (Monash University), Pam Peters (Macquarie University), Roly Sussex (University of Queensland), and Nick Thieberger (University of Melbourne)

Abstract

The Australian National Corpus Initiative involves a concerted push by linguists, applied linguists and language technologists to establish a massive online database of spoken and written language in Australia in all its forms and diversity. However, in order to maximize the potential for the Australian National Corpus to enable data sharing amongst researchers with an interest in Australian languages and society, the complex technical and legal issues that arise when attempting to share (historical) language data need to be addressed.

In this paper we first outline our vision for the Australian National Corpus (AusNC) as an online distributed set of multimodal and multilingual resources. By multimodal it is envisaged that not only plain text, but also visual texts, audio and audiovisual language data will feature in the corpus, while by multilingual it is intended that the corpus incorporate significant historical and prospective collections of English in Australia (including Australian English and migrant Englishes), indigenous languages, community languages, and Australian Sign Language. We then discuss its value as vital e-research infrastructure for not only linguists, but also language technologists, as well as the broader Humanities and Social Sciences, and informatics research communities who have an interest in Australian society. We then outline some of the key technical issues that we have identified, including building an underlying infrastructure for the AusNC, establishing appropriate annotation and metadata standards, meeting the challenges of incorporating legacy language data where media formats and annotation standards vary considerably, and capitalizing on the potential as well as the dealing with the challenges of mining language data from the Web.

Lastly, we briefly outline some of the attendant legal challenges that arise when sharing language data, including copyright, privacy and moral rights, as well as issues relating to indigenous and ethnic communities.

About the speaker

Steve Cassidy is a Computer Scientist who has worked in various areas relating to language and cognition over the last 20 years. He completed a PhD in Wellington, New Zealand on computer models of reading development and then moved to Macquarie University, Sydney to work in the Speech Hearing and Language Research Centre (SHLRC). At SHLRC he worked on applying statistical models to acoustic phonetics problems and on the development of the Emu Speech Database System. His work on Emu has led to an involvement with groups in the US and Europe who are aiming to define standards for Linguistic annotation. Steve is now working in the Computing department at Macquarie where he is pursuing research on problems relating to the management of language resources and the provision of eResearch infrastructure to the Linguistic community.