regex之为什么这个正则表达式返回的组比它应该返回的多

Terrylee 阅读:42 2025-02-15 21:57:57 评论:0

我正在浏览一本关于正则表达式的流行书籍,发现了这段正则表达式,它应该从包含逗号分隔值的行中挑选出值。

这应该处理双引号,"" 被视为转义双引号(序列 "" 允许在另一对双引号内)

这是我为此编写的 perl 脚本:

$str = "Ten Thousand,10000, 2710 ,,\"10,000\",\"It's \"\"10 Grand\"\", baby\",10K"; 
#$regex = qr"(?:^|,)(?:\"((?:[^\"]|\"\")+)\"|([^\",]+))*"; 
$regex = qr! 
        (?: ^|,) 
        (?:  
            " 
                ( (?: [^"] | "" )+ ) 
            " 
            | 
            ( [^",]+ ) 
        ) 
    !x; 
 
@matches = ($str =~ m#$regex#g); 
print "\nString : $str\n"; 
if (scalar(@matches) > 0 ) { 
    print "\nMatches\n"; 
    print "\nNumber of groups: ", scalar(@matches), "\n"; 
    for ($i=0; $i < scalar(@matches); $i++) { 
        print "\nGroup $i - |$matches[$i]|\n"; 
    } 
} 
else { 
    print "\nDoesnt match\n"; 
} 

这是我期望的输出(据我所知,这也是作者所期望的):

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K 
   Matches 
   Number of groups: 7 
   Group 0 - |Ten Thousand| 
   Group 1 - |10000| 
   Group 2 - | 2710 | 
   Group 3 - |10,000| 
   Group 4 - || 
   Group 5 - |It's ""10 Grand"", baby| 
   Group 6 - |10K| 

这是我实际得到的输出:

String : Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K 
   Matches 
   Number of groups: 12 
   Group 0 - || 
   Group 1 - |Ten Thousand| 
   Group 2 - || 
   Group 3 - |10000| 
   Group 4 - || 
   Group 5 - | 2710 | 
   Group 6 - |10,000| 
   Group 7 - || 
   Group 8 - |It's ""10 Grand"", baby| 
   Group 9 - || 
   Group 10 - || 
   Group 11 - |10K| 

有人可以解释为什么实际输出中有空组(除了 10,000 之前的组,这是预期的)? 我直接从书中复制了正则表达式,所以我做错了什么吗?

TIA

请您参考如下方法:

该正则表达式有 2 个捕获组和几个非捕获组。当您将正则表达式应用于字符串时,您使用了 g 修饰符告诉它继续匹配尽可能多的次数。在这种情况下,模式匹配 6 次,每次返回数组中总共 12 个元素的 2 个捕获组。

The regular expression: 
 
(?-imsx:! 
        (?: ^|,) 
 
        (?: 
 
            " 
 
                ( (?: [^"] | "" )+ ) 
 
            " 
 
            | 
 
            ( [^",]+ ) 
        ) 
    !x) 
 
matches as follows: 
 
NODE                     EXPLANATION 
---------------------------------------------------------------------- 
(?-imsx:                 group, but do not capture (case-sensitive) 
                         (with ^ and $ matching normally) (with . not 
                         matching \n) (matching whitespace and # 
                         normally): 
---------------------------------------------------------------------- 
  !                        '!\n        ' 
---------------------------------------------------------------------- 
  (?:                      group, but do not capture: 
---------------------------------------------------------------------- 
                             ' ' 
---------------------------------------------------------------------- 
    ^                        the beginning of the string 
---------------------------------------------------------------------- 
   |                        OR 
---------------------------------------------------------------------- 
    ,                        ',' 
---------------------------------------------------------------------- 
  )                        end of grouping 
---------------------------------------------------------------------- 
                           '\n\n        ' 
---------------------------------------------------------------------- 
  (?:                      group, but do not capture: 
---------------------------------------------------------------------- 
                  "          '\n\n            "\n\n                ' 
---------------------------------------------------------------------- 
    (                        group and capture to \1: 
---------------------------------------------------------------------- 
                               ' ' 
---------------------------------------------------------------------- 
      (?:                      group, but do not capture (1 or more 
                               times (matching the most amount 
                               possible)): 
---------------------------------------------------------------------- 
                                 ' ' 
---------------------------------------------------------------------- 
        [^"]                     any character except: '"' 
---------------------------------------------------------------------- 
                                 ' ' 
---------------------------------------------------------------------- 
       |                        OR 
---------------------------------------------------------------------- 
         ""                      ' "" ' 
---------------------------------------------------------------------- 
      )+                       end of grouping 
---------------------------------------------------------------------- 
                               ' ' 
---------------------------------------------------------------------- 
    )                        end of \1 
---------------------------------------------------------------------- 
                  "          '\n\n            "\n\n            ' 
---------------------------------------------------------------------- 
   |                        OR 
---------------------------------------------------------------------- 
                             '\n\n            ' 
---------------------------------------------------------------------- 
    (                        group and capture to \2: 
---------------------------------------------------------------------- 
                               ' ' 
---------------------------------------------------------------------- 
      [^",]+                   any character except: '"', ',' (1 or 
                               more times (matching the most amount 
                               possible)) 
---------------------------------------------------------------------- 
                               ' ' 
---------------------------------------------------------------------- 
    )                        end of \2 
---------------------------------------------------------------------- 
                             '\n        ' 
---------------------------------------------------------------------- 
  )                        end of grouping 
---------------------------------------------------------------------- 
       !x                  '\n    !x' 
---------------------------------------------------------------------- 
)                        end of grouping 
---------------------------------------------------------------------- 

TLP 已经提到您还可以使用 Text::CSV 模块。这是那个例子。

#!/usr/bin/perl 
 
use strict; 
use warnings; 
use Text::CSV_XS; 
use Data::Dumper; 
 
my $csv = Text::CSV_XS->new({binary => 1, eol => $/, allow_whitespace => 1}); 
 
while (my $row = $csv->getline (*DATA)) { 
    print Dumper $row; 
} 
 
__DATA__ 
Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K; 

输出:

$VAR1 = [ 
          'Ten Thousand', 
          '10000', 
          '2710', 
          '', 
          '10,000', 
          'It\'s "10 Grand", baby', 
          '10K;' 
        ]; 


声明

1.本站遵循行业规范,任何转载的稿件都会明确标注作者和来源;2.本站的原创文章,请转载时务必注明文章作者和来源,不尊重原创的行为我们将追究责任;3.作者投稿可能会经我们编辑修改或补充。

关注我们

一个IT知识分享的公众号