TaLC 2010

Pre-Conference Workshops

Pre Conf Workshops

The following four workshops are being offered on Wednesday, June 30th. They are all three-hours and will take place in computer labs at the Faculty of Arts where the main conference is being held. More details will be published here soon.

Lab 1 9.30 to 13.00	Build your own corpus led by Adam Kilgarriff (United Kingdom)
Lab 2 9.30 to 13.00	Pedagogic corpus tools for language learning; usage, deployment and maintenance led by María Sanchez-Tornel, Johannes Widmann (Germany)
Lab 1 14.00 to 17.30	Using Online Corpora - Corpus Query Language Explained led by Jarmila Fictumová and Petr Sudický (Czech Republic)
Lab 2 14.00 to 17.30	Speech Corpus Construction with Speech Indexer led by Ulrike Glavitsch and Jozsef Szakos (Switzerland)

Build your own corpus

Adam Kilgarriff
Lexical Computing Ltd
Lexicography MasterClass Ltd
University of Sussex

For a corpus lesson to work in the classroom, it has to be the right corpus. If the topic is football, then the corpus needs to be about football. Moreover, students’ expectations, from using the web, are that all the data should be there, instantly.

In the workshop we shall use a web tool, WebBootCaT, for producing instant corpora, and explore those corpora in a corpus query tool, the Sketch Engine. Students will have the opportunity to build and explore their own corpus, in the language of their choice. (For some languages - English, French, German, Italian, Spanish – there is also the option of part-of-speech-tagging the corpus.)

We shall also consider how much larger corpora – BNC-sized and beyond – can be developed from the web, and the issues of balance, filtering, and text “cleaning” that this presents. We have recently developed billion-plus word corpora for English, German and Italian and we shall describe the process, the issues raised, and the prospects that they open up.

We shall also explore how we can use corpora effectively, using the CQP query language and developing grammars for identifying the common subjects, objects and prepositions for a verb, the common adjectives and verbs for a noun, and so forth. We shall show how “word sketches” (one-page, corpus-based descriptions of a word’s grammatical and collocational behaviour) were developed and will give all participants the opportunity to develop their own.

Pedagogic corpus tools for language learning – usage, deployment and maintenance

Workshop Leaders: María Sanchez-Tornel, Johannes Widmann

Target audience: Heads of language centres and language units, language teaching professionals in language centres, higher education, and vocational education. The language of the workshop is English.

Prior knowledge required: Participants should have basic ICT skills: familiarity with Windows XP or Vista, file management, a word-processor (e.g. MS Word), navigating with a web browser (e.g. Internet Explorer), controlling a mouse, using the directional keys.

This workshop aims to

introduce participants to the concept of pedagogic corpora and their contribution to a blended-language learning approach; show participants best-practice examples of integrating corpus activities in the classroom; show options to enrich corpora with task-based activities that can be shared among teachers; demonstrate how our tools can be used to deploy your own corpora at your institution; provide hands-on practice in using the annotation tools of the BACKBONE project; examine and discuss the language-teaching potential of pedagogic corpora in web-based language learning scenarios

Format:
The workshop is a combination of a short introductory presentation with a lot of hands-on practice on individual computers. The workshop will take place in a computer lab equipped with a projection screen and loudspeakers. You will have enough time to try out the search tool and to experiment with the tools.

Contents: learning how to use to the BACKBONE and SACODEYL search tools

the available corpora (English, French, German, Italian, Polish, Romanian, Spanish, Turkish)
the 4 different search modes
the ready-made exercises (communicative, exploratory and focus-on form) that are available
learn about the available tools to deploy your own corpora
learning how to use the Annotator, a tool for pedagogic corpus creation

Workshop schedule

Introduction: pedagogic corpora
Basics: A blended-language learning approach and the place of pedagogic corpora
Hands-on practice: Using the BACKBONE search tool for language learning
Hands-on practice: Using the tools to create your own corpora (corpus compilation, annotation and pedagogical enrichment)
Questions, feedback

If you have your own texts that you would like to turn into pedagogic corpora bring your transcripts with you and we can help you get started right in the workshop.

Corpus Query Language Explained

Jarmila Fictumova, Masaryk University, Brno, Czech Republic
Petr Sudicky, Masaryk University, Brno, Czech Republic

Keywords: translation, writing, collocations, academic speech and writing, online corpora

“Never before have so many electronic resources been available to support the teaching of English. From a wide variety of online corpora to specialized archives of speech and writing, teachers and students are faced with the challenge of understanding these resources and selecting those appropriate to their purpose.”

Based on Exploring English with Online Corpora, a book by Wendy Anderson and John Corbett, published by Palgrave Macmillan in 2009, the workshop will tackle issues in the use of corpora in teaching English as a foreign language and translating into English, striving to provide guidelines for both teachers and students to help them become autonomous, confident users of the English language.

A course in Eldum (Moodle) will be set up for the participants of the workshop. It will be freely available to use and download before and after the TALC conference and will provide all the workshop materials, as well as other useful information.

The workshop should provide an introduction to online corpora, a guide to interpreting corpus data, suggest how corpora can be integrated into language and translation courses, and last but not least, provide a glossary of terms, as well as a list of suggested further resources. No previous knowledge of corpus tools or terminology is required. The online corpora used will include, among others, the following:

British National Corpus (BNC) – used via Bonito, Just the Word, BYU, SARA

New Model Corpus (via Corpus Architect)
British Academic Spoken English Corpus (BASE)
British Academic Written English Corpus (BAWE)
UKWaC British English web corpus (UKWaC )
BYU Corpus of Contemporary American English (COCA)
Michigan Corpus of Academic Spoken English (MICASE)

The purpose of the 3-hour workshop is to provide hands-on experience in using the corpora to find answers to language queries, in particular by using CQL (Corpus Query Language); other options will be mentioned as well.

We would like to draw on James Thomas’ website, The Sketch Engine in Practice, namely its part Using Corpus Query Language for complex searches and use samples of students’ work to demonstrate the typical problems that students of English as a foreign language may need to deal with. The samples will be taken from students’ writing and theses (BA and MA levels), as well as translations into English (MA level).

The target audience could be anybody interested in practical issues concerning everyday use of corpora in teaching ESL, writing in English and translating into English as a non-mother tongue. Various features of Moodle – glossaries, wikis, tests, and others will be used in the workshop.

Recommended reading:
Adriano Ferraresi, Eros Zanchetta, Marco Baroni,† Silvia Bernardini: Introducing and evaluating ukWaC, a very large web-derived corpus of English:
The BAWE Corpus Manual for the project entitled 'An Investigation of Genres of Assessed Writing in British Higher Education', funded by the ESRC

Speech Corpus Construction with SpeechIndexer

This workshop is about the building of speech corpora using the SpeechIndexer software suite. Its outline is as follows:

Indexing of speech files
Speech Concordancing
Translator/interpreter training

The SpeechIndexer software enables speech corpus construction so that speech corpora can be accessed in a way analogous to dealing with text collections. The original goal for the development of this software was the documentation of aboriginal Formosan languages [1]. The main idea is to correlate segments of speech with the corresponding text segments by so-called indices. Indices are stored in a file and they constitute the links between the audio and the text. From an indexed text segment it is immediately possible to listen to the corresponding audio segment. In later years, SpeechIndexer was extended by TextBookMaker and TextBookBrowser used for the creation of teaching materials from authentic speech recordings [2]. The software incorporates a pause finder that subdivides a speech file into sequences of pause and speech segments. This allows faster indexing of speech files because intonational units are given by the pause finder. In addition, audio, text, index and segmentation files can be stored under a project folder. All project components can be loaded together which allows for an efficient handling of the corresponding various files.

The SpeechConcordancer software was a new addition last year. This application generates concordances across audio archives. The speech files must be indexed and be made available as projects. SpeechConcordancer receives a list of project files and a word list as its input. Upon clicking on a word in the word list, the concordance is computed and displayed to the user. The user may listen to the audio segment of each search result.

The most recent initiative of the SpeechIndexer family is the development of a teaching and training tool for interpreters. The new tool is called SpeechInterpreter and makes use of two speech corpora – one in the original language and one in the translated language. Both corpora must be indexed, i.e. they must have established indices for the text and speech files. The SpeechInterpreter tool allows the user to create correlations between indices of pairs of audio files in the original and in the target language. Obviously, these correlations between index files can be permanently saved. Functions to listen to correlated indices and various search options will be provided. For example, interpreter students may step through a pair of indexed speech files (original and interpreted speech) and listen to the correlated indices in each language.

The first part of the workshop will concentrate on creating speech corpora with SpeechIndexer. The participants will learn how to segment audio files, create indices between speech segments and the corresponding texts, mark-up a whole speech file and create project files for easier use. In the second part of the workshop, they will be taught to use the elaborate concordancing functions as they are provided by SpeechConcordancer to search across speech archives. The third and final part of the workshop will introduce a prototype of the new SpeechInterpreter software that can be used to teach and train translators and interpreters. The participants will learn how to correlate indices of two different audio files and use the functions to listen to correlated segments. During the workshop the authors will demonstrate their various indexed speech collections.

Workshop participants may use the beta version of the SpeechIndexer software suite that is freely available under www.speechindexer.ethz.ch. All SpeechIndexer programs are implemented in C#/.NET and they are running under Windows XP, Vista or Windows 7. The installation guide provided on the website gives the necessary information on the installation process. The user’s guide describes the basic SpeechIndexer functions. Participants are invited to bring their own audio and transcript data. The audio files must be in WAVE format, the texts in UTF8 format to be successfully read by the SpeechIndexer software components. A set of speech files and their corresponding transcripts that can be used in the workshop will be ready for download here a few weeks before the workshop.

[1] J. Szakos, U. Glavitsch. Seamless Speech Indexing and Retrieval: Developing a New Technology for the Documentation and Retrieving of Endangered Formosan Languages. Proc. Intl. Conference on Education and Information Systems: Technologies and Applications (EISTA'04), Orlando, Florida, July 21 - 25, 2004.

[2] J. Szakos, U. Glavitsch, O. Hess. From Speech Corpora to Textbook Generation - Extending software technology to non-European languages, 7th Teaching and Language Corpora Conference (TaLC7), Paris, France, July 2 - 4, 2006.

Quick Contact

James Thomas
Conference organiser
+420 549 49 7614
talc2010@gmail.com

Jiri Salamoun
Conference Secretary
talcinformation@gmail.com

Costin Alexiu
Webmaster
Contact directly in case of technical problems.
alexixalex@gmail.com

Conference menu

Practical info