corpus.byu.edu

corpora, size, queries = better resources, more insight


Overview
Corpora
Size, speed, queries
Insight into variation

History / updates
FAQ / questions
Researchers
Publications

Register
Modify profile

Related resources
   Full-text data 
   Word frequency
   Collocates
   N-grams
   WordAndPhrase
   Academic vocabulary

Problems
Contact us


Three of the most powerful corpus architectures and interfaces for large online corpora are the Corpus Workbench approach used by Sketch Engine and BNCweb / CQPWeb, as well as the architecture that we use for the corpora from BYU. The table below summarizes some of their features.

There are a few features in our corpus architecture that are not in the other two architectures, but the converse is also undoubtedly true. The bottom line, however, is that any one of these three architectures should work fine for large, heavily-annotated online corpora.

We want to be fair to all three architectures, so if there is incorrect information for Sketch Engine or CQP/BNCweb, or if you are aware of another architecture that allows most of these features (at least basic queries, collocates, and limiting searches by a section of the corpus), please let us know.
 

Feature

BYU (click)1

Sketch Engine

CQP/BNCweb

Basic queries
   word
   phrase
   wildcard
   lemma
   part of speech
   combine any of above


Y
Y
Y
Y
Y
Y


Y
Y
Y
Y
Y
Y


Y
Y
Y
Y
Y
Y

Visualization
   frequency of each matching string

   frequency of each matching string, in each of several sections

   overall frequency for all matching forms, in different sections


Y
Y
Y


Y
N
N


Y
N
N

Collocates
   basic collocates search
   sort by Mutual Information
   limit collocate by part of speech
   find specific collocate(s) near node word(s)


Y
Y
Y
Y


Y
Y
Y
Y


Y
Y
Y
Y

Feature

BYU (click)

Sketch Engine

CQP/BNCweb

Word comparisons
   basic (e.g. collocates of small vs. little, or men and women)


Y


Y


N

Integrated synonyms
   basic: search by synonyms
   advanced: include synonyms as part of another query
   see frequency of synonyms in different sections (e.g. by genre or over time)
   compare frequency of synonyms in different sections

   see all collocates for a much larger list of words (e.g. all synonyms of large)
   "synonym chains": explore web of related words (click on [S] in the entries)


Y
Y
Y
Y
Y
Y


N
N
N
N
N
N


N
N
N
N
N
N

Customized / personalized lists
   create lists of words and re-use them as part of query syntax


Y


N


N

Limiting by sections of corpus
   basic (e.g. collocates of strong in academic journals)
   compare frequencies in different sections (e.g. ADJ in ACAD-Medicine vs ACAD)
   compare collocates in different sections (e.g. chair in spoken vs. academic)


Y
Y
Y


Y
N
N


Y
N
N

Speed


All three architectures are equally as fast for single words, lemmas, and collocates -- one second or less for most queries of this type. The difference appears with strings of words, as seen in the following table, which shows the speed in seconds for searches in different versions of the 100 million word British National Corpus. For queries of this type, our architecture is typically 1.5-3.0 times faster than Sketch Engine, and 5-15 times as fast as CQP/BNCweb.
 

STRING BYU (click)

Sketch Engine

CQP/BNCweb8

1.   the [adj] thing (e.g. the best thing) 0.9 1.8  + 0.9 5 10.6 + 3.1
2.   [pron] had better [verb] [pron] (e.g. she had better warn him) 1.1 7.5 + 0.8 6 15.0 + 0.8
3.   [conj] [pron] [be] like ,|' ( and she was like , ) 1.3 5.7 + 0.8 7 19.0 + 0.8


Notes:
1. We have reduced the time by about 20% with SE and BNCweb to account for network latency.
2. With Sketch Engine and BNCweb, the query is divided into two parts (KWIC and then frequency of strings), whereas in our architecture one goes directly to the frequency of the strings.
3. Sketch Engine and BNCweb cache their queries, so if someone has done the same query in the last few hours, it will be faster than what is shown here.
4. There are just three sample queries here, although we have done many other similar queries and have obtained similar results.
5. CQL: [word = "the"] [tag = "AJ."] [word = "thing"]
6. CQL: [tag = "PN."] [tag = "PN."] [word = "had"] [word = "better"] [tag = "V.."] [tag = "PN."]
7. CQL: [tag="C.."] [tag="PN."] [word ="was|is|are|be|being|been|am"] [word="like"] [word=","] (lemma "be" doesn't work here in SE)
8. The CQL query in BNCweb is the same as [1]-[4] above, with the change of [tag=...] > [pos=...]
 

Scalabilty


Although the times above are 1.5-3.0 as fast in our architecture as in Sketch Engine, our architecture is even more scalable for very large corpora. For example, query [3] above takes about 1.3 seconds in the 100 million word BYU-BNC, but only about 1.6 seconds for COCA, a corpus more than four times as large (note: if you clicked on #3 in the table above, the search in BYU-BNC will be faster than 1.3 seconds the second time around). At this rate, it would take about 2.0 seconds in our architecture for a 1.8-2.0 billion word corpus like the Oxford English Corpus (OEC). In the Sketch Engine version of the OEC, however, it takes about 46 seconds for query [3]. In other words, the Corpus Workbench version of Sketch Engine is about 20-25 times as slow as COCA (using our architecture), and this holds for many other queries that we have done with the two corpora as well.