Counting The Syllables In A Word With PHP

Years and many site revisions ago, I used to have a very basic haiku page on my site. People could submit their own haiku and bring up random haiku that were submitted previously.

The only problem that I had was that either people couldn’t count or they had a different idea on the classic haiku format (three lines with five, seven and five syllables, respectively). Thinking that there was no meaningful way to check the number of syllables in a word via code, I never pursued it any further and just left it as-is.

For no real reason beyond random inspiration, I revisited this idea and came up with a pretty decent function that will take a word and simply return the number of syllables.

Here’s what the code looks like:

function count_syllables($word) {

$word = strtolower($word);

// Regex Patterns Needed
$triples = “dn\’t|eau|iou|ouy|you|bl$”;
$doubles = “ai|ae|ay|au|ea|ee|ei|eu|ey|ie|ii|io|oa|oe|oi|oo|ou|oy|ue|uy|ya|ye|yi|yo|yu”;
$singles = “a|e|i|o|u|y”;
$vowels = “/(“.$triples.”|”.$doubles.”|”.$singles.”)/”;
$trailing_e = “/e$/”; $trailing_s = “/s$/”;

// Cleaning up word endings
$word = preg_replace($trailing_s, “”, $word);
$word = preg_replace($trailing_e, “”, $word);

// Count # of “vowels”
preg_match_all($vowels, $word, $matches );

$syl_count = count($matches[0]);
return $syl_count;
}

It works based on the following assumptions:

  • The number of syllables a word has is equal to the number of “vowel sounds” in the word
  • A “vowel sound” can be defined largely by series of consecutive vowels (greater than or equal to one) with a few exceptions
  • There are certain instances in which a “vowel sound” doesn’t indicate a new syllable

The letter groupings defined in $triples, $doubles and $singles (which get concatenated into the pattern in $vowels) are the summation of these assumptions. To handle the third point, I remove trailing “s” and “e” letters from words. Since I’ve removed any “e” from the end of words and the suffix “-able” is two syllables, I look for the pattern “bl$” to account for these discrepancies.

Also, to account for contractions, I’ve found that the string “n’t” preceded by the letter “d” typically should count as a vowel sound. Just finding the string “n’t” in a word doesn’t necessarily count as a vowel sound by itself. This allows us to properly differentiate between the contraction such as “can’t” and “couldn’t”.

For the most part, I’ve matched up any two-vowel pair with the exception of “ia”. This allows us to treat “i” and “a” as single vowels in words like “pliable” where they’d be otherwise be treated as a single vowel sound if the pair “ia” was added to the regex pattern.

I’m sure there’s all sorts of additional edge cases that I’m missing. And additionally, any non-English word has a chance of not abiding by these rules. The good thing is that if there’s any glaring holes, you can add new vowel sounds to the patterns above. Since preg_match_all() “short circuits” on a successful match (meaning that it will start at the next character after a match is found and start at the beginning of the match string), be sure to add them at an appropriate spot in the list. This also explains why the “larger” patterns should probably come first.

All-in-all, the function is fairly tight and small for what it does. With a minor caveat that there may be exceptions to the results it returns for weird edge cases, this should provide sufficiently accurate and efficient results for most casual use.

Oct 11th, 2009 | Posted in Nerd, Programming
  1. Feb 25th, 2011 at 00:02 | #1

    this code is bad calculations…?

  2. Dec 6th, 2012 at 23:49 | #2

    Brilliant! well done

Show Hide 1 trackbacks

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>