|
Overview
Corpora
! Size,
speed,
queries
!
Insight into variation
History / updates
FAQ / questions
Researchers
Publications
Register
Modify profile
Related resources
WordAndPhrase
Word frequency
Collocates
N-grams
Problems
Contact us
|
Three of the most powerful corpus architectures and interfaces for large
online corpora are the
Corpus
Workbench approach used by
Sketch
Engine and
BNCweb / CQPWeb, as well as the architecture that we use for the
corpora from BYU.
The table below summarizes some of their features.
There are a few features in our corpus architecture that are not in the
other two architectures, but the converse is also undoubtedly true. The bottom line,
however, is that any one of
these three architectures should work fine for large, heavily-annotated online corpora.
We
want to be fair to all three architectures, so if there is incorrect
information for Sketch Engine or CQP/BNCweb, or if you are aware of another
architecture that allows most of these features (at least basic queries,
collocates, and limiting searches by a section of the corpus), please
let us know.
|
Feature |
BYU
(click)1 |
Sketch Engine |
CQP/BNCweb |
|
Basic queries
word
phrase
wildcard
lemma
part of speech
combine any of above |
Y
Y
Y
Y
Y
Y
|
Y
Y
Y
Y
Y
Y
|
Y
Y
Y
Y
Y
Y
|
|
Visualization
frequency of each matching string
frequency of each matching string, in each of several sections
overall frequency for all matching forms, in different sections |
Y
Y
Y |
Y
N
N |
Y
N
N |
|
Collocates
basic collocates search
sort by Mutual Information
limit collocate by part of speech
find specific collocate(s) near node word(s) |
Y
Y
Y
Y
|
Y
Y
Y
Y
|
Y
Y
Y
Y
|
|
Feature |
BYU
(click) |
Sketch Engine |
CQP/BNCweb |
|
Word comparisons
basic (e.g. collocates of small vs. little, or men
and women) |
Y
|
Y
|
N
|
|
Integrated synonyms
basic: search by synonyms
advanced: include synonyms as part of another query
see frequency of synonyms in different sections (e.g. by genre or
over time)
compare frequency of synonyms in different sections
see all collocates for a much larger list of words (e.g. all
synonyms of large)
"synonym chains": explore web of related words (click on
[S] in the entries) |
Y
Y
Y
Y
Y
Y
|
N
N
N
N
N
N |
N
N
N
N
N
N |
|
Customized / personalized lists
create lists of words and re-use them as part of query syntax |
Y
|
N
|
N
|
|
Limiting by sections of corpus
basic (e.g. collocates of strong in
academic journals)
compare frequencies in different sections (e.g. ADJ in
ACAD-Medicine vs ACAD)
compare collocates in different sections (e.g. chair in
spoken vs. academic) |
Y
Y
Y
|
Y
N
N
|
Y
N
N
|
|
Speed |
|
All three architectures are equally as fast for
single words, lemmas, and collocates -- one second or less for
most queries of this type. The difference appears with strings
of words, as seen in the following table, which shows the speed in seconds
for searches in different versions of the 100 million word
British National Corpus. For queries of this type, our architecture is
typically 1.5-3.0 times
faster than Sketch Engine, and 5-15 times as fast as CQP/BNCweb.
|
|
STRING |
BYU
(click) |
Sketch Engine |
CQP/BNCweb8 |
|
1. the [adj] thing (e.g. the best thing) |
0.9 |
1.8 + 0.9 5 |
10.6 + 3.1 |
|
2. [pron] had better [verb] [pron] (e.g. she had
better warn him) |
1.1 |
7.5 + 0.8 6 |
15.0 + 0.8 |
|
3. [conj] [pron] [be] like ,|' ( and she was like
, ) |
1.3 |
5.7 + 0.8 7 |
19.0 + 0.8 |
|
Notes:
1. We have reduced the time by about 20% with SE and BNCweb to
account for network latency.
2. With Sketch Engine and BNCweb, the query is divided into two parts (KWIC
and then frequency of strings), whereas in our architecture one goes
directly to the frequency of the strings.
3. Sketch Engine and BNCweb cache their queries, so if someone
has done the same query in the last few hours, it will be faster
than what is shown here.
4. There are just three sample queries here, although we have
done many other similar queries and have obtained similar
results.
5. CQL: [word = "the"] [tag = "AJ."] [word = "thing"]
6. CQL: [tag = "PN."] [tag = "PN."] [word = "had"] [word = "better"]
[tag = "V.."] [tag = "PN."]
7. CQL: [tag="C.."] [tag="PN."] [word ="was|is|are|be|being|been|am"]
[word="like"] [word=","] (lemma "be" doesn't work here in SE)
8. The CQL query in BNCweb is the same as [1]-[4] above, with the change
of [tag=...] > [pos=...]
|
|
Scalabilty |
|
Although the times above are 1.5-3.0 as fast
in our architecture as in Sketch Engine, our architecture is
even more scalable for very large corpora. For example,
query [3] above
takes about
1.3 seconds in the 100 million word BYU-BNC, but
only about
1.6 seconds for COCA, a corpus more than four times
as large (note: if you clicked on #3 in the table above, the
search in BYU-BNC will be faster than 1.3 seconds the second
time around). At this rate, it would take about 2.0 seconds in our
architecture
for a 1.8-2.0 billion word corpus like the Oxford English Corpus
(OEC). In the Sketch Engine version of the OEC, however, it
takes about 46 seconds for query [3]. In other words, the Corpus
Workbench version of Sketch Engine is about 20-25 times as
slow as COCA (using our architecture), and this holds for many other queries that we have
done with the two corpora as well.
|
|