The current bigram model was built from a corpus with just over 1 millions tokens (words and punctuation). We need probably something five times this size for the production version. The current sparseness is illustrated by the prediction context "The j" where "Jewish" and "Jews" are both in the prediction list, something that would not be expected after analysis of a larger, more representative English corpus.
The current bigram model was built from a corpus with just over 1 millions tokens (words and punctuation). We need probably something five times this size for the production version. The current sparseness is illustrated by the prediction context "The j" where "Jewish" and "Jews" are both in the prediction list, something that would not be expected after analysis of a larger, more representative English corpus.