Need a larger bigram corpus

The current bigram model was built from a corpus with just over 1 millions tokens (words and punctuation).  We need probably something five times this size for the production version.  The current sparseness is illustrated by the prediction context "The j" where "Jewish" and "Jews" are both in the prediction list, something that would not be expected after analysis of a larger, more representative English corpus.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a larger bigram corpus #167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Need a larger bigram corpus #167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions