Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat)

AutomaticLinguisticAnnotationofLargeScaleL2

Databases:TheEF-CambridgeOpenLanguage

Database(EFCamDat)

JeroenGeertzen,TheodoraAlexopoulou,andAnnaKorhonen

UniversityofCambridge

1. Introduction

∗

Naturalistic learner productions are an important empirical resource for SLA research. Some

pioneering works have produced valuable second language (L2) resources supporting SLA research.

One common limitation of these resources is the absence of individual longitudinal data for numerous

speakers with different backgrounds across the proﬁciency spectrum, which is vital for understanding

the nature of individual variation in longitudinal development.

A second limitation is the relatively restricted amounts of data annotated with linguistic information

(e.g., lexical, morphosyntactic, semantic features, etc.) to support investigation of SLA hypotheses and

obtain patterns of development for different linguistic phenomena. Where available, annotations tend

to be manually obtained, a situation posing immediate limitations to the quantity of data that could be

annotated with reasonable human resources and within reasonable time. Natural Language Processing

(NLP) tools can provide automatic annotations for parts-of-speech (

POS) and syntactic structure and

are indeed increasingly applied to learner language in various contexts. Systems in computer-assisted

language learning (CALL) have used a parser and other NLP tools to automatically detect learner errors

and provide feedback accordingly.

Some work aimed at adapting annotations provided by parsing

tools to accurately describe learner syntax (Dickinson & Lee, 2009) or evaluated parser performance on

learner language and the effect of learner errors on the parser. Krivanek and Meurers (2011) compared

two parsing methods, one using a hand-crafted lexicon and one trained on a corpus. They found that

the former is more successful in recovering the main grammatical dependency relations whereas the

latter is more successful in recovering optional, adjunction relations. Ott and Ziai (2010) evaluated the

performance of a dependency parser trained on native German (MaltParser; Nivre et al., 2007) on 106

learner answers to a comprehension task in L2 German. Their study indicates that while some errors can

be problematic for the parser (e.g., omission of ﬁnite verbs) many others (e.g., wrong word order) can

be parsed robustly, resulting in overall high performance scores.

In this paper we have two goals. First, we introduce a new English L2 database, the EF Cambridge

Open Language Database, henceforth

EFCAMDAT. EFCAMDAT was developed by the Department of

Theoretical and Applied Linguistics at the University of Cambridge in collaboration with EF Education

First, an international educational organization. It contains writings submitted to Englishtown, the

∗

We are grateful to Detmar Meurers, Henri

ette Hendriks, and Rachel Baker and colleagues for comments and

discussion throughout the development of the database and its evaluation. We would also like to thank Dimitris

Michelioudakis and Toby Hudson for manual annotation/correction of parsed text, and Laura Rimell for her advice

and comments. We also gratefully acknowledge support by the Isaac Newton Trust at Trinity College Cambridge

and EF Education First for this research.

See for instance the ESF database, ZISA corpus, FALKO, and The International Corpus of Learner English

(ICLE) among others—a comprehensive list of learner corpora around the world is compiled by Sylviane Granger’s

team at the Centre for English Corpus Linguistics at the Universit

e Catholique de Louvain and can be found at

https://www.uclouvain.be/en-cecl-lcworld.html.

There are of course exceptions to this: for instance, the ESF database (Feldweg, 1991) and FLLOC (Myles &

Mitchell, 2007) both follow groups of learners over a period.

Such works use early rule-based parsing as in Barchan, Woodmansee, and Yazdani, 1986; Jensen, Heidorn, Miller,

and Ravin, 1983 or more modern approaches as in Amaral and Meurers, 2011; Menzel and Schr

oder, 1999.

Language Research Forum, ed. Ryan T. Miller et al., 240-254. Somerville, MA: Cascadilla Proceedings Project.

online school of EF accessed daily by thousands of learners worldwide. EFCAMDAT stands out for

two reasons: its size and rich individual longitudinal data from learners with a wide variety of L1

backgrounds. The magnitude of EF operations has allowed us to build a resource of considerable size,

currently containing half a million scripts from 85K learners summing up 33 million words. Crucially,

the progress of individual learners can be followed over time. As new data come in, the database

will continue to grow allowing investigation of longitudinal development of larger numbers of learners.

EFCAMDAT is an open access resource available to the research community via a web-based interface at

http://corpus.mml.cam.ac.uk/efcamdat/, subject to a standard user agreement.

Our second goal is to evaluate parser performance on

EFCAMDAT data. Our study provides

users of

EFCAMDAT with information on the accuracy of the automatically obtained morpho-syntactic

annotations that accompany

EFCAMDAT data. The parser performance is evaluated using a set of

manually annotated

EFCAMDAT data. We provide information on the effect of different types of errors

on parsing as well as aspects of learner language that are challenging for automated linguistic analysis.

This article is structured as follows: section 2 elaborates on the nature and characteristics of the learner

data. The syntactic annotations are evaluated and discussed in section 3. We conclude in section 4.

2. Data characteristics

EFCAMDAT consists of essays submitted to Englishtown, the online school of EF Education First,

by language learners all over the world (Education First, 2012). A full course in Englishtown spans 16

proﬁciency levels aligned with common standards such as TOEFL, IELTS, and the Common European

Framework of Reference for languages (CEFR) as shown in Table 1.

Table 1: Englishtown skill levels in relation (indicative) to common standards.

Englishtown 1-3 4-6 7-9 10-12 13-15 16

Cambridge ESOL - KET PET FCE CAE -

IELTS - <3 4-5 5-6 6-7 >7

TOEFL iBT - - 57-86 87-109 110-120 -

TOEIC Listening & Reading 120-220 225-545 550-780 785-940 945 -

TOEIC Speaking & Writing 40-70 80-110 120-140 150-190 200 -

CEFR A1 A2 B1 B2 C1 C2

Learners are allocated to proﬁciency levels after a placement test when they start a course at EF

or through successful progression through coursework. Each of the 16 levels contains eight lessons,

offering a variety of receptive and productive tasks. EFCAMDAT consists of scripts of writing tasks at the

end of each lesson on topics like those listed in Table 2.

Table 2: Examples of essay topics at various levels. Level and unit number are separated by a colon.

ID Essay topic ID Essay topic

1:1 Introducing yourself by email 7:1 Giving instructions to play a game

1:3 Writing an online proﬁle 8:2 Reviewing a song for a website

2:1 Describing your favourite day 9:7 Writing an apology email

2:6 Telling someone what you’re doing 11:1 Writing a movie review

2:8 Describing your family’s eating habits 12:1 Turning down an invitation

3:1 Replying to a new penpal 13:4 Giving advice about budgeting

4:1 Writing about what you do 15:1 Covering a news story

6:4 Writing a resume 16:8 Researching a legendary creature

Given 16 proﬁciency levels and eight units per level a learner who starts at the ﬁrst level and completes

all 16 proﬁciency levels would produce 128 different essays. Essays are graded by language teachers;

Starting students are placed at the ﬁrst level of each stage: one, four, seven, ten, thirteen, or sixteen.

241

learners may only proceed to the next level upon receiving a passing grade. Teachers provide feedback to

learners using a basic set of error markup tags or through free comments on students’ writing. Currently,

EFCAMDAT contains teacher feedback for 36% of scripts.

The data collected for the ﬁrst release of

EFCAMDAT contain 551,036 scripts (with 2,897,788

sentences, and 32,980,407 word tokens) written by 84,864 learners. We currently have no information on

the L1 backgrounds of learners.

Information on nationality is, thus, used as the closest approximation

to L1 background.

EFCAMDAT contains data from learners from 172 nationalities. Table 3 shows the

spread of scripts across the nationalities with most learners.

Table 3: Percentage and number of scripts per nationality of learners.

Nationality Percentage of scripts Number of Scripts

Brazilians 36.9% 187,286

Chinese 18.7% 96,843

Russians 8.5% 44,187

Mexicans 7.9% 41,115

Germans 5.6% 29,192

French 4.3% 22,146

Italians 4.0% 20,934

Saudi Arabians 3.3% 16,858

Taiwanese 2.6% 13,596

Japanese 2.1% 10,672

Few learners complete all of the proﬁciency levels. For many, their start or end of interacting with

Englishtown fell outside the scope of the data collection period for the ﬁrst release of

EFCAMDAT. More

generally, many learners only complete portions of the program. Nevertheless, around a third of learners

(around 28K) have completed three full levels, corresponding to a minimum of 24 scripts.

Only 500

learners have completed every unit from level one to six (accounting for at least 48 scripts).

Characterizing scripts quantitatively is difﬁcult, because of the variation across topics and proﬁ-

ciency levels. Texts range from a list of words or a few short sentences to short narratives or articles.

As learners become more proﬁcient they tend to produce longer scripts. On average, scripts count seven

sentences (SD = 3.8). Sample scripts are shown in Figure 1.

3. Syntactic annotations

3.1. Background and motivation

Many learner corpora have been annotated for errors and more recently for lemmas, parts of

speech (

POS), and grammatical relations (D

ıaz-Negrillo, Meurers, Valera, & Wunsch, 2010; Granger,

2003; Granger, Kraif, Ponton, Antoniadis, & Zampa, 2007; L

udeling, Walter, Kroymann, & Adolphs,

2005; Meurers, 2009; Nicholls, 2003). Information on

POS and grammatical relations allows the

investigation of morpho-syntactic and also some semantic patterns. Such annotation typically involves

two distinct levels, one providing category information (

POS) and a second level providing syntactic

structure, typically represented by phrase structure trees or grammatical dependencies between pairs of

words.

As mentioned in the introduction, NLP tools can provide annotations automatically. One critical

question is how learner errors and untypical learner production patterns affect parsing performance.

POS

taggers rely on a combination of lexical, morphological, and distributional information to provide the

most likely

POS annotation for a given word. But in learner language these three sources of information

may be pointing to different outcomes (D

ıaz-Negrillo et al., 2010; Meurers, 2009; Ragheb & Dickinson,

Metadata on the L1 background of learners is being collected for the second release of the database.

Of the 172 nationalities, 28 have over 100 learners, and 38 nationalities over 50 learners.

If learners don’t receive a satisfying score on their writing, they may repeat the task, which means that a learner

will have a minimum of eight scripts per level but may have more if they repeat the task.

242

1. LEARNER 18445817, LEVEL 1, UNIT 1, CHINESE

Hi! Anna,How are you? Thank you to sendmail to me. My name’s

Anfeng.I’m 24 years old.Nice to meet you !I think we are friends

already,I hope we can learn english toghter! Bye! Anfeng.

2. LEARNER 19054879, LEVEL 2, UNIT 1, FRENCH

Hi, my name’s Xavier. My favorite days is saturday. I get up at

9 o’clock. I have a breakfast, I have a shower... Then, I goes

to the market. In the afternoon, I play music or go by bicycle. I

like sunday. And you ?

3. LEARNER 19054879, LEVEL 8, UNIT 2, BRAZILIAN

Home Improvement is a pleasant protest song sung by Josh Woodward.

It’s a simple but realistic song that analyzes how rapid changes

in a town affects the lives of many people in the name of progress.

The high bitter-sweet voice of the singer, the smooth guitar along

with the high pitched resonant drum sound like a moan recalling

the past or an ode to the previous town lifestyle and a protest to

the negative aspects this new prosperous city brought. I really

enjoyed this song.

Figure 1: Three typical scripts, in which learners are asked to introduce themselves (1), describe their

favourite day (2), and review a song for a website (3).

2011). In particular, D

ıaz-Negrillo et al. (2010) discuss a range of mismatches, as for instance between

form and distribution (1-a) and stem and distribution (1-b). In (1-a) the verb want lacks 3rd person

agreement. Thus, the distributional information corresponds to a 3rd-person verbal form tag (VBZ)

while the morphological information is consistent with a bare verb form tag (VB). Similarly, the stem

choice is of category noun, but in (1-b), the morphological and distributional information would indicate

verbal tags.

(1) a. ...if he want to know this...

b. ... to be choiced for a job...

ıaz-Negrillo et al., 2010, ex.9,16

Turning to syntax, learners often make syntactic mistakes, e.g., word order mistakes involving misplaced

adverbs as in it brings rarely such connotations (Granger et al., 2007).

Results discussed in D

ıaz-Negrillo et al. (2010) and Meurers (2009) indicate that taggers show

robustness to many such mismatches and that good accuracy scores can be obtained for

POS tagging.

There is, though, the question of whether native annotations are suitable descriptions of learner language.

Ragheb and Dickinson (2011) rightly argue that annotating learner language with existing tools is an

instance of Bley-Vroman’s comparative fallacy. Bley-Vroman (1989) has argued for the need to analyse

learner language in its own right rather than with categories from the target language. In a series of

papers, Ragheb and Dickinson (2011) and Dickinson and Ragheb (2009, 2011) propose an annotation

model that seeks to capture the properties of learner language rather than to superimpose native tags.

They propose a multilayered annotation scheme where distributional and morphological information is

annotated separately. For example, the annotation for (1-a) would just record the mismatch. Under this

view (a good part of) errors are mismatches between stem, morphological, and distributional information.

Errors affect not only

POS tagging but can also reduce the probability of parses (Wagner & Foster, 2009).

At the same time, parsers show robustness to many errors, e.g., word order (Ott & Ziai, 2010).

In the following sections we present results on parser performance for

EFCAMDAT data and discuss

the effect of learner errors on parsing.

243

3.2. Methodology

To evaluate how parsers for well-formed English perform on learner language, a sample from

EFCAMDAT data was annotated with POS tags and syntactic structure. The Penn Treebank Tagset

(Marcus, Marcinkiewicz, & Santorini, 1993) was used for

POS annotation because it is widely used

and offers a relatively simple and straightforward set of tags. It comprises 36

POS tags, as listed in the

Appendix, and features a further 12 tags to mark punctuation and currency symbols. Syntactic structure

has been annotated in the form of dependency relations between a pair of words, where one word is

analyzed as a head (governor) and the other as a dependent (Tesni

ere & Fourquet, 1959). The main

feature of this annotation scheme is the absence of constituency and phrasal nodes. Because of its less

hierarchical structure, this framework is particularly suitable for manual annotation. An example is

illustrated in Figure 2 and compared with a more standard phrase structure annotation.

PRP

VBP

hope

SBAR

PRP

can

learn

NNP

English

ADVP

together

PRP VBP PRP MD VB NNP RB

I hope we can learn English together

nsubj

aux

ccomp

dobj

advmod

1 I PRP 2 nsubj

2 hope VBP 0 root

3 we PRP 5 nsubj

4 can MD 5 aux

5 learn VB 2 ccomp

6 English NNP 5 dobj

7 together RB 5 advmod

Figure 2: A sentence in phrase structure (left) and dependency relations (right) with the dependency

graph (top right) and corresponding column-based coding (bottom right).

A set of 1,000 sentences (11,067 word tokens) from

EFCAMDAT

was pseudo-randomly sampled

with equal representation from all 16 proﬁciency levels and ﬁve of the best represented nationalities

(i.e., Chinese, Russian, Brazilian, German, and Italian). These sentences were then tagged and parsed

using Penn Treebank

POS

tags and a freely available state-of-the-art parser (Stanford parser; Klein

& Manning, 2003). Two trained linguists manually corrected the automatically annotated evaluation

set (500 sentences each). They used the grammatical relations of the Stanford Dependency scheme

(De Marneffe & Manning, 2008) and worked with collapsed representations.

The annotators marked

learner errors and tagging/parsing errors manually on words. They used two error tags, L for learner

error and P for

POS

tagging and parsing errors, and provided a corrected version. Each annotator

corrected 500 sentences. Prior to the annotation, they both corrected a set of 100 additional sentences

with corresponding parses in order to measure inter-annotator agreement. Inter-annotator agreement

turned out to be 95.3% on the full combination of

POS

, attachment to the head, and relation type. All

disagreements were subsequently discussed with a third linguist to reduce any incidental mistakes, which

increased the ﬁnal inter-annotator agreement to 97.1%.

One of our goals was to evaluate the effect of learner errors on automated annotation, since errors

are rather frequent. In our sample of 1,000 sentences, 33.8% contains at least one learner error. These

learner errors are mostly spelling and capitalization ones, but also involve morphosyntactic and semantic

Our choice, therefore, does not reﬂect a theoretical preference but is guided by the need to adopt an intuitive and

ﬂexible annotation system that manual coders can manage easily.

To simplify patterns for extraction, dependencies involving prepositions, conjuncts, as well as information about

the referent of relative clauses are collapsed to get direct dependencies between content words.

244

irregularities or missing words. Some characteristic examples are illustrated in (2) and (3), which show

learner errors together with corrections suggested by the annotators.

(2) a. I often have meetings ar go on business trips

correction: and

b. I ﬁnally ger an offer as an secretary in a company

correction: get, a

c. At night i go to bed

correction: I

d. Because to whome could i talk English in Germany?

correction: whom, I

e. I can’t swim, sometimes I like take a walk

correction: to take

f. If you would like to work, like to volunteer, call to number 80000

correction: delete ‘to’

(3) a. To help me with my vocabulary I note down all the words that I cannot recognize in a

notebook which enclosed with a sentence for reference

correction: delete ‘which’ and replace ‘enclosed’ with ‘along’

b. It must be by far the exhilarating experience

correction: the most exhilarating

c. you must can my house soon

correction: be able to

d. to motivate people to reaching aims they would not reach theirselves

correction: reach, themselves

e. I’ll think about change my career

correction: changing

Examples in (2) and (3) give a sense of the range of difﬁculties. Some errors are more likely to

affect the parser than others. For instance, the spelling error in (2-b) where we have ger instead of get

may not be as damaging as the spelling error in (2-a) where instead of and we have ar. This is because

it is easier to recover a verbal category than a conjunction based on distributional evidence. Similarly

whome instead of whom in (2-d) may also have an effect on the parser as it may miss a wh-category. The

errors in (2-e-f) could be viewed as subcategorization ones and it is conceivable that the parser can still

recover the correct dependencies.

The effect of an error like (3-a) is rather unpredictable. Such cases illustrate how challenging it can

be to describe some errors. Under one analysis, this sentence may just have a missing ‘is’ (which is

enclosed....). However, our annotator preferred to rephrase. Cases like this require a systematic analysis

of the learner’s productions to establish whether such errors are an instance of a more general pattern

(e.g., auxiliary

BE omission) or not. It is hard to see how such issues can be addressed by error annotation

schemes.

The errors in (3-b-c) are semantic. Though interesting from an SLA perspective, they are unlikely

to affect the parser. The pair in (3-d-e) both involve an erroneous verbal form. But the effect on the

parser can be different. The parser may establish the correct dependency between reaching and motivate

in (3-d). But in (3-e) it may analyze change as a noun rather than a verbal category.

Our data also contain many cases of the mismatches discussed by D

ıaz-Negrillo et al. (2010) as

illustrated in (4).

(4) a. but the young woman was catched at the hair by a passer-by

correction: caught, by

b. please ﬁnd enclose

correction: enclosed

See Dickinson and Ragheb (2009) for a discussion of annotation issues arising in relation to subcategorization

errors in learner language.

245

c. I am also certiﬁcated in cardio kickboxing

correction: certiﬁed

Figure 3 illustrates a case where a learner error results in a parsing error. The sentence contains what

looks like a preposition in place of an article, and the parser analyzes the item as a preposition and

postulates the wrong dependency, where the noun sweater is the object of the preposition. The annotator

corrects the direct object interpretation for the noun, and provides corrections as shown in Figure 3.

NNP VBZ VBG IN JJ NN CC NNS .

Sue is wearing in brown sweater and jeans .

nsubj

aux

prep

amod

pobj

conj

root

det

dobj

Figure 3: Example of a parsed sentence with learner and parsing errors (above), and manual corrections

(below).

To assess the parser’s performance on L2 English and the effect of errors, it is crucial to know if

parsing errors occur in the context of well-formed or ungrammatical sentences. Manual annotation of

learner errors and parsing errors allows us to identify the following three scenarios:

1. learner error without a

POS-tagging/parsing error;

2. learner error with a

POS-tagging/parsing error;

3. a

POS-tagging/parsing error without a learner error.

The ﬁrst scenario may arise with semantic or morphological errors at word level that are not strong

enough to affect the grammatical rule selection during parsing. Such a situation is depicted by Figure 4,

in which the surface form of what appears to be intended as the verb get is correctly identiﬁed as such.

PRP RB VBP DT NN IN DT NN IN DT NN

I ﬁnally ger an oﬀer as an secretary in a company

nsubj

advmod

det

dobj

prep

det

pobj

dep

det

pobj

root

Figure 4: Learner errors (in black) without a parsing error.

The second scenario is one in which a learner error causes a parsing error, as exempliﬁed in Figure 5.

Both ‘mop’ and ‘de’ are identiﬁed as foreign words and interpreted to be compound nouns (see also

Figure 3).

Finally, parsers are not always accurate even with grammatically correct native English. The third

scenario, therefore, involves a parsing error without a learner error as illustrated in Figure 6 where the

word laundry is misanalysed as a subject of afternoon rather than an object of the verb does and afternoon

as a clausal complement rather than a modiﬁer of the verb. Our goal here is to evaluate how often each

of these scenarios is realized.

246

PRP RB VBP DT NN CC FW FW NN

I never do the laundry and mop de ﬂoor

nsubj

neg

det

dobj

root

conj

Figure 5: Learner error with a parsing error (both in black).

NNP VBZ DT NN DT NNP NN

Granny does the laundry every Tuesday afternoon

nsubj

det

nsubj

det

root

xcomp

Figure 6: Parsing error (in black) without learner error.

3.3. Results

Parser errors can be measured in various ways. Measurement may focus on errors per sentence

or per word. Evaluation can also consider different aspects of annotation,

POS tagging, dependency

attachment, and dependency relation label. For instance Figure 5 contains

POS errors for mop and de

which are tagged as foreign words. This tagging error results in an attachment and labelling error since

these two words are misanalysed as dependents of ﬂoor in a noun compound dependency. This is a

case where a

POS error has a knock-on effect on the syntactic annotation. But parsing errors may also

appear without a

POS error. For instance, in Figure 6 the dependency between the verb and afternoon is

mislabelled as

XCOMP (clausal complement) instead of modiﬁcation, but this misanalysis is not triggered

by a tagging error.

Parsing performance involving dependency relations is typically evaluated using labelled attach-

ment score (LAS) and unlabelled attachment score (UAS). The former metric is the proportion of word

tokens that are assigned both the correct head and the correct dependency relation label. The latter is

the proportion of word tokens that are assigned the correct head regardless of what dependency relation

label is assigned. A correct assignment of a dependency relation has to match the (manually) annotated

relation exactly.

Over all our sentences, the Stanford parser scored 89.6% LAS and 92.1% UAS as

shown in Table 4. This slightly exceeds results on Wall Street Journal (WSJ; Marcus et al., 1993) news

text parsing at 84.2% LAS, 87.2% UAS (Cer, Marneffe, Jurafsky, & Manning, 2010). This measure does

not include

POS tagging. POS tagging also reaches a high accuracy of 96.1%. When POS tagging errors

were included in the scoring this yielded 88.6% LAS and 90.3% UAS.

To get an idea of how parsing errors are distributed across sentences, accuracies were also computed

at sentence level. In our sample of 1,000 sentences, 54.1% was without any labelled attachment error

and 63.2% was without any unlabelled attachment error. As for

POS tagging, 73.1% of sentences was

error free.

The high accuracy scores indicate that the parser can recover the correct syntactic structure despite

learner errors as long as the likelihood of the overall dependency tree is sufﬁciently high (see Figure 4).

To assess how the presence of learner errors affected parsing performance, we split the 1,000 sentences

into two sets, the 33.8% that contain at least one learner error and the remaining 66.2% without learner

A more lenient way of evaluating parsing performance could take the hierarchical ordering of the dependency

relations in the Stanford typed dependency scheme into account. Under- or over-speciﬁcation could then be treated

as correct or partially correct. For instance, a passive nominal subject (nsubjpass) that gets labelled as a subject

(subj) can be considered less of an error than a direct object (dobj) and a subject.

247

errors, and parsed and evaluated them separately. For convenience, we will refer to these sets as the

learner error absent (LA) set and learner error present (LP) set, respectively. Scores are shown in Table 4.

As expected, the difference in parsing accuracy between sentences with and without learner errors is

considerable.

Table 4: Accuracy scores for sentences with and without learner errors.

Sentences LAS UAS

All 89.6% 92.1%

Without learner error (LA) 92.6% 95.0%

With at least one learner error (LP) 83.8% 87.4%

In the LP set, 593 words were directly associated to a learner error. Of those words, 49.2% received

an incorrect

POS or syntactic dependency assigned. Note that a small percentage of these 49.2% words

would also have received erroneous assignments in absence of the learner error. To obtain more detailed

information on the effect of learner errors, we assessed the nature of assignment errors to words. To

this purpose, we selected in each set of sentences the words with assignment errors, and looked at the

error type, which can relate to

POS, head attachment, and the relation. Table 5 shows percentages of

assignment error types for individual words, with and without learner errors.

POS tagging errors represent

21.6% for the LP set. Not surprisingly, in 53% of erroneous assignments, a

POS error causes also a

head and a relation error, which on themselves occur considerably less frequent. The picture is rather

different for the LA set.

POS tagging errors in isolation account for only 7.3% while head+relation errors

in absence of

POS tagging errors are the most challenging category for this set, accounting for 41.4% of

all assignment errors. In sum, learner errors affect primarily category assignment (

POS tagging), and as

a consequence, assignment of a dependency head and labelling of the relation. Nonetheless, automatic

assignment is still accurate for 50.8% words in the LP set, indicating robustness to a signiﬁcant number

of learner errors.

Table 5: Error types of assignment errors in the presence of a learner error (based on 292 words of the

LP set) and in the absence of a learner error (based on 572 words of the LA set), where ‘only’ means

that all other aspects are correctly assigned.

Error type LA set LP set

POS only 7.3% 21.6%

Head only 18.7% 2.4%

Relation only 16.3% 3.4%

Head + Relation only 41.4% 8.6%

POS + Head + Relation 9.3% 53.1%

Our next question was whether proﬁciency interacts with parser performance. Our sample was

balanced for proﬁciency, but it could have been possible that at lower proﬁciency levels accuracy scores

were signiﬁcantly lower.

We, thus, calculated scores for levels 1-6 and 7-16 and 10-16 separately,

shown in Table 6. Accuracy scores are higher at higher proﬁciency levels but the effect seems small.

Table 6: Accuracy scores for different levels of proﬁciency.

Proﬁciency level LAS UAS

L1-6 89.0% 91.5%

L7-16 90.2% 92.4%

L10-16 91.4% 93.2%

We would like to thank two anonymous reviewers for raising this point.

248

human

machine

acomp

advmod

amod

appos

auxpass

ccomp

conj-and

dep

dobj

nsubj

nsubjpass

poss

prep-in

tmod

xcomp

ROOT

0 0 0 0 0 0 0 33 0 0 0 0 0 0 0 0

advcl

0 0 0 0 0 18 0 40 0 0 0 0 0 0 0 0

advmod

0 0 9 0 0 0 0 26 0 0 11 0 0 0 0 0

amod

0 0 0 0 0 0 0 0 0 29 14 0 0 0 0 0

ccomp

0 0 0 0 0 0 0 50 0 0 0 0 0 0 0 0

conj-and

0 0 11 11 0 9 0 18 0 7 0 0 0 0 0 0

cop

0 0 0 0 25 0 0 16 0 0 0 0 0 0 0 0

dep

0 0 0 0 0 0 0 0 0 0 30 0 0 0 0 0

det

0 0 17 0 0 0 21 0 0 0 0 0 0 0 0 0

dobj

6 4 0 0 0 0 3 7 0 3 37 0 0 0 4 3

107

iobj

0 0 0 0 0 0 0 0 50 0 50 0 0 0 0 0

mark

0 45 0 0 0 0 0 36 0 0 0 0 0 0 0 0

mwe

0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 29 0 0 0 0 0 12 0 0 0 0 12 0 0

nsubj

0 0 0 0 0 0 12 0 0 8 0 12 8 0 0 0

prep-to

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 50

rcmod

0 0 0 0 0 50 0 33 0 0 0 0 0 0 0 0

ref

0 0 0 0 0 0 0 0 0 0 71 0 0 0 0 0

tmod

0 0 0 0 0 0 0 47 21 0 0 0 0 0 0 0

Figure 7: Error matrix containing percentages of the most frequent errors in assigning dependency

relation labels.

As some dependencies are more challenging for the parser to recover than others, we also measured

accuracy for individual grammatical dependencies. The evaluation measures used were Precision,

Recall, and F

. Recall measures how many of the annotated instances of a speciﬁc grammatical relation

are correctly recovered by the parser. Precision measures how many of the instances labelled by the

parser as being a speciﬁc grammatical relation are actually correct. The F

-score is the harmonic mean

of precision and recall.

The grammatical relations that occur at least ﬁve times in the evaluation data

are listed in Table 7, together with the evaluation metrics and the absolute frequency.

As can be seen in Table 7, only a few grammatical relations scored below 90% F

. Adjectival

complements (acomp, 73.3% F

) suffered from a rather low recall, as did clausal subjects (csubj,

57.1% F

) and appositional modiﬁers (appos, 75.8% F

We next examined the most frequent parser ‘confusions’. The erroneous assignment of dependency

relation labels is quantiﬁed using the error matrix in Figure 7, in which each column represents the

assigned label by the parser, while each row represents the actual label. Only the more frequent errors

(i.e., occurring at least four times) have been included, and together account for 38% of all parsing

errors. The number at the end of each row is the absolute frequency at which the label on that row

was erroneously assigned, with each cell containing the percentage of this number that got erroneously

assigned by the corresponding column label.

One prominent parsing error was misanalysis of direct objects (107 cases). In 37% of these cases, the

direct object was mis-analysed as a nominal subject. This case is illustrated with Figure 6, in which the

noun laundry is wrongly assigned to the nominal head afternoon as subject. Few errors were expected,

such as noun compounds (

NN) with (ADJECTIVAL MODIFIER), or a few indirect objects (IOBJ) with

direct objects (

DOBJ) or nominal subjects (NSUBJ), but there are no clear outliers. Only the DEPENDENT

relation (DEP) has been used often as it is the most underspeciﬁed category in the dependency scheme.

The F

-score combines precision (P ) and recall (R), and is deﬁned as 2 ∗ P ∗ R/(P + R).

249

Table 7: Grammatical relations occurring at least ﬁve times. Precision (Prec), recall (Rec) and F

scores

are listed for each grammatical relation, together with its absolute frequency (#GRs).

Relation Description Prec Rec F

#GRs

root root 95.2 97.2 96.2 999

dep dependent 92.5 65.7 76.8 210

aux auxiliary 98.0 98.9 98.4 646

auxpass passive auxiliary 96.1 96.1 96.1 78

cop copula 96.6 98.7 97.6 313

arg argument

agent agent 100.0 87.5 93.3 8

comp complement

acomp adjectival complement 78.6 68.8 73.3 16

ccomp clausal complement with internal subject 93.0 86.0 89.4 189

xcomp clausal complement with external subject 93.8 94.2 94.0 243

complm complementizer 98.1 100.0 99.0 52

obj object

dobj direct object 93.8 95.7 94.7 694

iobj indirect object 81.8 100.0 90.0 18

pobj object of preposition 100.0 85.2 92.0 27

mark marker (word introducing an advcl) 91.9 97.8 94.8 93

subj subject

nsubj nominal subject 97.7 96.4 97.0 1343

nsubjpass passive nominal subject 95.6 94.2 94.9 69

csubj clausal subject 66.7 50.0 57.1 8

cc coordination 96.4 100.0 98.2 28

conj conjunct 93.2 97.6 95.3 463

expl expletive (expletive “there”) 100.0 100.0 100.0 40

mod modiﬁer

amod adjectival modiﬁer 96.9 95.9 96.4 657

appos appositional modiﬁer 89.3 65.8 75.8 38

advcl adverbial clause modiﬁer 86.6 97.0 91.5 100

det determiner 98.3 99.4 98.9 1024

predet predeterminer 90.0 100.0 94.7 10

preconj preconjunct 83.3 100.0 90.9 5

infmod inﬁnitival modiﬁer 85.7 80.0 82.8 15

partmod participial modiﬁer 93.9 86.1 89.9 36

advmod adverbial modiﬁer 95.2 95.4 95.3 507

neg negation modiﬁer 98.9 100.0 99.5 91

rcmod relative clause modiﬁer 92.9 89.7 91.2 58

quantmod quantiﬁer 87.0 100.0 93.0 20

nn noun compound modiﬁer 92.7 89.9 91.3 300

npadvmod noun phrase adverbial modiﬁer 95.7 91.7 93.6 24

tmod temporal modiﬁer 84.6 87.3 85.9 63

num numeric modiﬁer 96.9 95.4 96.2 132

number element of compound number 100.0 100.0 100.0 10

prep prepositional modiﬁer 95.7 96.3 96.0 975

poss possession modiﬁer 99.1 98.8 99.0 336

prt phrasal verb particle 95.2 100.0 97.5 61

parataxis parataxis 62.5 100.0 76.9 5

250

3.4. Discussion

The most striking result of this evaluation is the high accuracy scores of the automated annotation

tools which demonstrates that the tools are robust to learner language. These scores are high even for

lower proﬁciency levels (Table 6). There are a number of explanations for this result.

The ﬁrst relates to the overall simplicity of learner language in comparison to native productions. For

instance, learner sentences tend to be shorter and therefore less demanding for the parser. The average

word length of sentences in this evaluation set is 11 words, whereas that of sentences in the British

National Corpus (BNC; Burnard, 1995) is 16 words, and that of the WSJ is 21 words. It is possible

then that the shorter average sentence length of learner productions compensates for the effect of learner

errors on the parser.

A second explanation relates to the nature of errors and the ﬁnding that automated tools show

robustness to at least half the learner errors. Many errors are semantic and do not affect parser

performance which targets syntactic structure. Consider for instance the examples in (5) from our

sample. The parser has correctly identiﬁed the relevant grammatical dependencies, despite the learner

errors (this is also true for also examples (3-b) and (3-c)).

(5) a. ... I play all other kind of music....

b. .... the best convenience of the modern technology ....

c. .... it is a potential large group...

d. ... my older brother is 40 and my littler brother is ...

e. ... my wife is wearing a white and pink pants....

f. ....I try to visualise pleasant and restful place ...

Similarly, a good part of what appear to be subcategorization errors or choice of preposition errors

do not affect the parser’s ability to obtain the correct dependencies, see examples (2-e) and (2-f) and (6).

(6) a. I encourage you to continue your support in respect of my presidency

b. I am looking forward for your feedback...

c. The body language in Russia is very similar with American one.

The erroneous patterns we see in (5) and (6) require a systematic investigation to be modelled accurately.

At the same time, the robustness of the parser is welcome since it allows us to obtain the main syntactic

patterns that are used by learners. Consider for instance, the sentence mentioned earlier with an adverb

intervening between the verb and its object it brings rarely such connotations. The parse of this sentence

is shown in Figure 8. The parser ‘ignores’ the word order violation and correctly analyzes rarely as

an adverbial modiﬁer and such connotations as a nominal direct object. Similarly, the parser identiﬁes

a relative clause in (3-a) and analyzes to reaching in (3-d) as a complement of the main verb. The

robustness of the parser in these cases is important, because it captures syntactic structures that are used

in learner language. In sum, the parser often abstracts away from local morphosyntactic information

and word order violations but succeeds with the underlying syntactic dependencies. While this results

in an incomplete picture of learner language, it nevertheless reveals an important aspect of the grammar

underlying learner language. The current dependency annotations can be used to investigate the range of

word order violations and the way different types of dependents (e.g., adverbs vs. indirect objects) may

intervene between the verb and the direct object.

PRP VBZ RB JJ NNS

it brings rarely such connotations

nsubj

advmod

amod

dobj

Figure 8: Sentence with a word order error.

251

Let us brieﬂy consider the mismatches raised by D

ıaz-Negrillo et al. (2010) and Ragheb and

Dickinson (2011). In general, the parser has resolved conﬂicting evidence in favor of morphological

information (and our annotators have accepted morphology-based

POS tags). Thus, catched in (4-a)

and certiﬁcated in (4-c) are tagged as

VBN (verb, past participle). Similarly, want in (1-a) is tagged as

VBP (verb, non-3rd-person form). Such annotations do not reﬂect the relevant mismatches but enable a

systematic investigation of the environments and lexical elements involved in them.

Finally, despite the very high accuracy scores, some patterns have been particularly challenging for

the parser. For instance, absence of critical morphological information may lead to a category error in

tagging which can then result in the wrong parse. For instance, change in (3-e) has been analyzed as a

noun.

In conclusion, the relatively high accuracy scores of the parser can be explained through a

combination of the overall simplicity of learner productions and the robustness of the parser that

tends to focus on the primary dependency relations ‘ignoring’ local morphosyntactic information, some

word order violations as well as semantic violations. Current NLP tools allow us to obtain reliable

morphosyntactic annotations for learner language that are vital for investigating a wide range of lexical

and morphosyntactic phenomena of L2 grammars.

4. Conclusions

EFCAMDAT is a new L2 English database built at the University of Cambridge, in collaboration with

EF Education First. It consists of scripts submitted to Englishtown, the online school of EF Education

First. It stands out for its size and rich individual longitudinal data from learners with a wide variety of

backgrounds. It is an open access resource available to the research community via a web-based interface

at http://corpus.mml.cam.ac.uk/efcamdat/, subject to a standard user agreement.

EFCAMDAT is tagged and parsed for POS and dependency relations using Penn Treebank tags and the

Stanford parser. The core of this paper presented the results of an evaluation study on parser performance

on a sample of 1,000 sentences from

EFCAMDAT. The results indicate that parsing English L2 with

current NLP tools generally yields performance close to that of parsing well-formed native English

text. This is despite the fact that 33.8% of our sample of 1,000 learner sentences contains at least one

learner error. Our analysis shows that current NLP tools are robust to a signiﬁcant part of learner errors.

This is because a good part of errors involve local morphosyntactic, subcategorization, word order, and

semantic errors that do not affect the main dependency relations the parser targets. As a result, the parser

can successfully capture syntactic patterns that are used by learners and provide valuable annotations

for the investigation of a wide range of phenomena and SLA hypotheses. At the same time, some form

errors (spelling, morphosyntactic) can lead to category (

POS) errors when the distributional information

cannot reliably deﬁne a category (e.g., co-ordination) or the form is ambiguous between a noun and verb

category.

The robust performance of current NLP tools on L2 English demonstrates that they can be

successfully used for automatic annotation of large scale databases like

EFCAMDAT. Moreover, they

provide a critical starting point for development of tools that can accurately model the erroneous and

untypical patterns of learner language.

In this work we have used one particular state-of-the-art parsing algorithm, the Stanford parser, but

various others could be used as well, such as the C&J reranking parser (Charniak & Johnson, 2005) and

the Berkely parser (Petrov, Barrett, Thibaux, & Klein, 2006). It would be worth exploring other systems,

in isolation and by using ensemble models that combine independently-trained models at parsing time

to exploit complementary strengths (Sagae & Tsujii, 2007).

252

Appendix:ThePennTreebankPOStagset(excl.punctuationtags)

1. CC Coordinating conjunction 19. PP Possessive pronoun

2. CD Cardinal number 20. RB Adverb

3. DT Determiner 21. RBR Adverb, comparative

4. EX Existential there 22. RBS Adverb, superlative

5. FW Foreign word 23. RP Particle

6. IN Preposition/subord. 24. SYM Symbol

7. JJ Adjective 25. TO to

8. JJR Adjective, comparative 26. UH Interjection

9. JJS Adjective, superlative 27. VB Verb, base form

10. LS List item marker 28. VBD Verb, past tense

11. MD Modal 29. VBG Verb, gerund/present participle

12. NN Noun, singular or mass 30. VBN Verb, past participle

13. NNS Noun, plural 31. VBP Verb, non-3rd ps. sing. present

14. NNP Proper noun, singular 32. VBZ Verb, 3rd ps. sing. present

15. NNPS Proper noun, plural 33. WDT wh-determiner

16. PDT Predeterminer 34. WP wh-pronoun

17. POS Possessive ending 35. WP Possessive wh-pronoun

18. PRP Personal pronoun 36. WRB wh-adverb

References

Amaral, Luiz, & Meurers, Detmar. (2011). On using intelligent computer-assisted language learning in real-life

foreign language teaching and learning. ReCALL, 23, 4–24.

Barchan, Jonathan, Woodmansee, B., & Yazdani, Masoud. (1986). A PROLOG-based tool for French grammar

analysis. Instructional Science, 15, 21–48.

Bley-Vroman, Robert. (1989). What is the logical problem of foreign language learning? In Susan M. Gass, &

Jacquelyn Schachter (Eds.), Linguistic perspectives on second language acquisition (pp. 41–68). New York:

Cambridge University Press.

Burnard, Lou. (1995, May). Users Reference Guide for the British National Corpus. http://info.ox.ac.uk/bnc/.

Cer, Daniel, Marneffe, Marie-Catherine De, Jurafsky, Daniel, & Manning, Christopher D. (2010). Parsing to

Stanford dependencies: Trade-offs between speed and accuracy. In Proceedings of the Seventh International

Conference on Language Resources and Evaluation (LREC’10) (pp. 1628–1632). Valletta, Malta.

Charniak, Eugene, & Johnson, Mark. (2005). Coarse-to-ﬁne n-best parsing and maxent discriminative reranking.

In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (pp. 173–180).

Stroudsburg, PA, USA: Association for Computational Linguistics.

De Marneffe, Marie-Catherine, & Manning, Christopher D. (2008). The Stanford typed dependencies representation.

In Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

(pp. 1–8).

ıaz-Negrillo, Ana, Meurers, Detmar, Valera, Salvador, & Wunsch, Holger. (2010). Towards interlanguage POS

annotation for effective learner corpora in SLA and FLT. Language Forum, 36, 139–154. Special Issue on

Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair.

Dickinson, Markus, & Lee, Chong Min. (2009). Modifying corpus annotation to support the analysis of learner

language. CALICO Journal, 26, 545–561.

Dickinson, Markus, & Ragheb, Marwa. (2009). Dependency annotation for learner corpora. In Marco Passarotti,

Adam Przepi

orkowski, Savina Raynaud, & Frank Van Eynde (Eds.), Proceedings of the Eighth International

Workshop on Treebanks and Linguistic Theories (pp. 59–70). Milan, Italy.

Dickinson, Markus, & Ragheb, Marwa. (2011). Dependency annotation of coordination for learner language. In Kim

Gerdes, Eva Hajicova, & Leo Wanner (Eds.), Proceedings of the International Conference on Dependency

Linguistics (Depling) (pp. 135–144). Barcelona, Spain.

Education First. (2012). Englishtown. http://www.englishtown.com/.

Feldweg, Helmut. (1991). The European Science Foundation Second Language Database. Nijmegen, Netherlands:

Max Planck Institute for Psycholinguistics.

Granger, Sylviane. (2003). Error-tagged learner corpora and CALL: A promising synergy. CALICO, 20, 465–480.

253

Granger, Sylviane, Kraif, Olivier, Ponton, Claude, Antoniadis, Georges, & Zampa, Virginie. (2007). Integrating

learner corpora and natural language processing: A crucial step towards reconciling technological sophisti-

cation and pedagogical effectiveness. ReCALL, 19, 252–268.

Jensen, Karen, Heidorn, George E., Miller, Lance A., & Ravin, Yael. (1983). Parse ﬁtting and prose ﬁxing: Getting

a hold on ill-formedness. Computational Linguistics, 9, 147–160.

Klein, Dan, & Manning, Christopher D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual

Meeting on Association for Computational Linguistics (pp. 423–430). Stroudsburg, PA, USA: Association

for Computational Linguistics.

Krivanek, Julia, & Meurers, Detmar. (2011). Comparing rule-based and datadriven dependency parsing of learner

language. In Proceedings of the International Conference on Dependency Linguistics (Depling 2011)

(pp. 128–132).

udeling, Anke, Walter, Maik, Kroymann, Emil, & Adolphs, Peter. (2005). Multi-level error annotation in learner

corpora. In Proceedings from the Corpus Linguistics Conference Series (Vol. 1). Birmingham, UK.

Marcus, Mitchell P., Marcinkiewicz, Mary Ann, & Santorini, Beatrice. (1993, June). Building a large annotated

corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330.

Menzel, Wolfgang, & Schr

oder, Ingo. (1999). Error diagnosis for language learning systems. ReCALL, 11, 20–30.

Meurers, Detmar. (2009). On the automatic analysis of learner language. CALICO Journal, 26, 469–473.

Myles, Florence, & Mitchell, Rosamond. (2007). French learner language oral corpora. http://www.ﬂloc.soton.ac.

uk/. University of Southampton.

Nicholls, Diane. (2003). The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In

Proceedings of the Corpus Linguistics Conference (pp. 572–581). Lancaster University: University Centre

for Computer Corpus Research on Language.

Nivre, Joakim, Hall, Johan, Nilsson, Jens, Chanev, Atanas, Eryigit, G

ulsen, Kubler, Sandra, . . . & Marsi, Erwin.

(2007). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language

Engineering, 13, 95.

Ott, Niels, & Ziai, Ramon. (2010). Evaluating dependency parsing performance on German learner language.

In Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (NEALT 2010)

(pp. 175–186).

Petrov, Slav, Barrett, Leon, Thibaux, Romain, & Klein, Dan. (2006). Learning accurate, compact, and interpretable

tree annotation. In Proceedings of the 21st International Conference on Computational Linguistics and the

44th annual meeting of the Association for Computational Linguistics (pp. 433–440). Stroudsburg, PA, USA:

Association for Computational Linguistics.

Ragheb, Marwa, & Dickinson, Markus. (2011). Avoiding the comparative fallacy in the annotation of learner

corpora. In Gisela Granena, Joel Koeth, Sunyoung Lee-Ellis, Anna Lukyanchenko, Goretti Prieto Botana,

& Elizabeth Rhoades (Eds.), Selected Proceedings of the 2010 Second Language Research Forum:

Reconsidering SLA research, dimensions, and directions (pp. 114–124). Somerville, MA, USA: Cascadilla

Proceedings Project.

Sagae, Kenji, & Tsujii, Jun’ichi. (2007). Dependency parsing and domain adaptation with LR models and parser

ensembles. In Proceedings of the CoNLL shared task session of EMNLP-CoNLL (Vol. 7, pp. 1044–1050).

Tesni

ere, Lucien, & Fourquet, Jean. (1959). El

ements de syntaxe structurale. Paris: Klincksieck.

Wagner, Joachim, & Foster, Jennifer. (2009). The effect of correcting grammatical errors on parse probabilities. In

Proceedings of the 11th International Conference on Parsing Technologies (pp. 176–179). IWPT ’09. Paris,

France: Association for Computational Linguistics.

254

Selected Proceedings of the

2012 Second Language Research Forum:

Building Bridges between Disciplines

edited by

Ryan T. Miller, Katherine I. Martin,

Chelsea M. Eddington, Ashlie Henery,

Nausica Marcos Miguel, Alison M. Tseng,

Alba Tuninetti, and Daniel Walter

Cascadilla Proceedings Project Somerville, MA 2014

Selected Proceedings of the 2012 Second Language Research Forum:

Building Bridges between Disciplines

ISBN 978-1-57473-464-5 library binding

A copyright notice for each paper is located at the bottom of the first page of the paper.

Reprints for course packs can be authorized by Cascadilla Proceedings Project.

Ordering information

Orders for the library binding edition are handled by Cascadilla Press.

To place an order, go to www.lingref.com or contact:

Cascadilla Press, P.O. Box 440355, Somerville, MA 02144, USA

phone: 1-617-776-2370, fax: 1-617-776-2271, [email protected]

Web access and citation information

This entire proceedings can also be viewed on the web at www.lingref.com. Each paper has a unique document #

which can be added to citations to facilitate access. The document # should not replace the full citation.

This paper can be cited as:

Geertzen, Jeroen, Theodora Alexopoulou, and Anna Korhonen. 2014. Automatic Linguistic Annotation of Large

Scale L2 Databases: The EF-Cambridge Open Language Database (EFCamDat). In Selected Proceedings of the

2012 Second Language Research Forum, ed. Ryan T. Miller et al., 240-254. Somerville, MA: Cascadilla

Proceedings Project. www.lingref.com, document #3100.