A bigram (pronounced "BY-gram") is a linguistic term for a phrase of two words.
The Bigram Search feature looks for a specific two-word phrase within a certain decade. If the phrase is not found, it will also look for the hyphenated and single-word versions of that phrase. (Think "tea cup" vs. "tea-cup" vs. "teacup".) This is because English from 1800-1923 tended to use more hyphenated phrases than 2010s English does. For instance, the hyphenated word "dinner-parties" appears in the 1810s corpus, but the bigram "dinner parties" does not.
The Google Books Ngram Viewer is a Google feature that graphs word and phrase frequencies over time, based on the Google Books corpus. The Ngram charts can be embedded in other websites using iframes. Wordsworth's Google Ngram charts default to pulling from Google's "English Fiction" corpus.
This feature allows you to tell the system to ignore certain words in your passage before analyzing it. It is most useful in situations when you already know that certain words are not found in the comparison corpus and you don't need the system to tell you that again. For instance, Mary Robinette Kowal's fantasy novel Shades of Milk and Honey is heavily influenced by the works of Jane Austen, but it features a hero named "Mr. Vincent" and a magic system known as "glamour". The words "vincent" and "glamour" do not appear in our 1810s corpus, so Kowal might find it useful to tell the system to discount those words.
Most of Wordsworth's features are free to use without registering, but if you are frequently using the Discount Words feature, you may wish to create an account. With a free user account, you can maintain a list of words to ignore that Wordsworth will automatically leave out of search results. You will no longer need to input these Discount Words manually every time that you want to analyze a passage of your text.
You can view a list of all books in the corpus here.
All texts come from Project Gutenberg.
All of the books in our corpus are works of fiction written in English between the years of 1800 and 1923. Additionally, the corpus focuses on stories that had a realistic, contemporary setting at the time they were published -- although we've waived that rule in order to include some influential works of genre fiction.
Although Dracula is a fantasy/horror novel and the Sherlock Holmes stories are mystery fiction, they are recognizably set in late-Victorian England and offer much valuable information about language use in that era. They have also proved extremely influential in modern pop culture. Some of our users may be writing mystery or fantasy fiction themselves, and it therefore makes sense for our corpus to include some of the foundational works of the genre.
Ben-Hur is itself a work of historical fiction, taking place in Biblical times. Therefore, its language would likely skew our corpus results in a way that wouldn't be helpful to someone who's using Wordsworth to help them write a story that is set in the 1800s. This is also why we have avoided adding other popular historical novels like Walter Scott's Ivanhoe (written 1820, set in the 1100s), Edith Wharton's The Age of Innocence (written 1920, set in the 1870s), or Charles Dickens' A Tale of Two Cities (written 1859, set in the late 1700s).
As of 2019, the public domain in the United States ends in 1923. Therefore, to avoid potential legal liability, we have avoided analyzing and storing word-frequency data from novels published in 1924 and beyond. As more 1920s novels enter the public domain in the years to come, we will try to add some of them to our corpus. If you feel that the current 95-year U.S. copyright term is ridiculously long, lobby your Congressional representatives.
We chose 1800 as our start date because that's when the Google Ngram data starts. (Google may have made this choice because English spelling was much less standardized prior to the publication of Samuel Johnson's dictionary in the UK in 1755 and Noah Webster's dictionary in the USA in 1806.) Additionally, realistic contemporary prose fiction—the mainstay of Wordsworth's corpus—wasn't really a thing in the 1500s and 1600s. However, since several users have requested a corpus for earlier centuries and there wouldn't be copyright issues with that, we will continue to think about whether and how to implement this.
Email us at marissa.wordsworth @ gmail.com and suggest it! If it fits the corpus parameters, we'll look into adding it in a subsequent release. However, we do not plan to develop a feature to allow users to update the corpus themselves.
Wordsworth is inspired by the work of Ben Schmidt, a digital-humanities professor who, in the early 2010s, wrote about trying to identify linguistic anachronisms in historical fiction screenplays by comparing them to texts written in the period they were set. Schmidt's program and corpus is not publicly available, so we decided to try to create an approximation of it.
We can't resist a literary pun and we told ourselves that no one else would overthink this.
Email the app's creator at marissa.wordsworth @ gmail.com, or submit an Issue on Wordsworth's GitHub page.