php - preg_split based on a Sentence

651

I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.

This is the code I have that works:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",    
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\.              # or "Sr.",
| T\.V\.A\.         # or "T.V.A.",
| a\.m\.            # or "a.m.",
| p\.m\.            # or "p.m.",
| a€¢\.
| :\.

                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.

/ix';

This is an example phrase I have tried to add: "Total Gross Income"

I have tried formating it in these ways, but none of them work:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)  

This for example if I have the following code:

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    echo $i . " - " . $sentance . "<BR>";
}

The results I get are:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold 

What I want to get is :

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold 

What am I doing wrong?

46

Answer

Solution:

Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.

Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.

You can also use thePREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:

$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    if (!ctype_space($sentences[$i])) {
        echo $i . " - " . $sentences[$i] . "<br>";
    }
}

Output:

0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.

People are also looking for solutions to the problem: php - kendoui grid is not showing data?

Source

Didn't find the answer?

Our community is visited by hundreds of web development professionals every day. Ask your question and get a quick answer for free.

Ask a Question

Write quick answer

Do you know the answer to this question? Write a quick response to it. With your help, we will make our community stronger.

Similar questions

Find the answer in similar questions on our website.