What makes SBD difficult?

Breaking text into sentences is difficult for a number of reasons:

  • Punctuation is frequently ambiguous
  • Abbreviations often contain periods
  • Sentences may be embedded within each other by the use of quotes
  • With more specialized text, such as tweets and chat sessions, we may need to consider the use of new lines or completion of clauses

Punctuation ambiguity is best illustrated by the period. It is frequently used to demark the end of a sentence. However, it can be used in a number of other contexts as well, including abbreviation, numbers, e-mail addresses, and ellipses. Other punctuation characters, such as question and exclamation marks, are also used in embedded quotes and specialized text such as code that may be in a document. ...

