Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Implementation of :w in regexes and other regex questions

0 views
Skip to first unread message

David Romano

unread,
Feb 13, 2006, 7:43:37 PM2/13/06
to perl6-l...@perl.org
Hello everyone,
This is my first post to the actual mailing list and not to Google Groups
(yeah, took me a bit to figure out they're not the same). I have a few
questions about the rules in Perl 6, and hopefully I'm not repeating stuff
that's already been brought up before. (I searched through the archive a bit,
but didn't see anything.)

==Question 1==
macro rxmodinternal:<x> { ... } # define your own /:x() stuff/
macro rxmodexternal:<x> { ... } # define your own m:x()/stuff/
With this, I can make my own adverbs then? Like :without, or :skip, and
describe what each does? If so, then maybe the rest of my major questions have
a very simple answer: make it yourself. If that is the case, I'll try to
figure out how to do it with pugs, if possible.

==Question 2==
I finished reading E05 and A05, and I really like the idea of the :w modifier
being able to essentially skip over certain parts of the text. Right now A05
states:

> <?ws> can't decide what to do until it sees the data. It still does the
> right thing. If not, define your own <?ws> and :w will use that.

So is :w invoking a rule that just skips whatever it matches? What I'm
wondering about is how I can create a mechanism that acts like :w, but can be
combined for nested rules. For instance, say I'm trying to pull out date from
(html) text:
...
Jan had a great birthday on <B>F e b</B> 5, 2<B>00</B>3.
Her older sister, May, turned 23 on <B>Ma r</B> 5, 19<b>98</b>
Their younger sister, June, will be going home on <B >Apr</B> 5,
2<B>006</B>
April is their mother, and she's buying a car on <B>Feb< / B > 7,
2<B>0</B>06
I don't know when Roger, their father, is going to buy his guitar.
...

The grammar becomes messy when I have to account for things that the rules
don't allow me to just easily skip:
grammar Date {
rule tag_B_beg:w:i { \<B\> }
rule tag_B_end:w:i { \<\/B\> }
rule tag_B:w:i { <tag_B_beg>|<tag_B_end> }

rule month_english:w:i {
J<sp>*a<sp>*n | F<sp>*e<sp>*b | M<sp>*a<sp>*r
| A<sp>*p<sp>*r | M<sp>*a<sp>*y | J<sp>*u<sp>*n<sp>*e
| J<sp>*u<sp>*l<sp>*y | A<sp>*u<sp>*g | S<sp>*e<sp>*p
| O<sp>*c<sp>*t | N<sp>*o<sp>*v | D<sp>*e<sp>*c
}

rule year:w:i { (\d<tag_B>?\d<tag_B>?\d<tag_B>?\d) }
rule month:w:i { <after <tag_B_beg> > (<month_english>) <before
<tag_B> > }

rule day {
<after <month> >
( <after <[1..2]> >? <[1..9]> | 3<[0..1]> ) <sp>+
<before <year> >
}

rule date { <month> <day> <year> }

}

I don't want to just skip <B> tags wholly, because they do serve a purpose,
but only in a particular context. (Can <?ws> be changed back to a "default" if
changed to include html tags?) I was thinking about maybe using a closure
at the beginning of the rule (to change the string about to be processed) and
then a closure at the end of the rule (to change it back to its pre-processed
form) to make it work:
grammar Date {
rule tag_B_beg:w:i { \<B\> }
rule tag_B_end:w:i { \<\/B\> }

rule month_english:w:i {
{ $/ ~~ s/<sp>// }
[ Jan | Feb | Mar | Apr
| May | June | July | Aug
| Sep | Oct | Nov | Dec
]
{ $/ ~~ $/.pretext }
}

rule year:w:i {
{ $/ ~~ s/<tag_B_beg>|<tag_B_end>// }
(\d{4})
{ $/ ~~ $/.pretext }
}

rule month:w:i { <after <tag_B_beg> > (<month_english>) <before
<tag_B> > }

rule day {
<after <month> >
( <after <[1..2]> >? <[1..9]> | 3<[0..1]> ) <sp>+
<before <year> >
}

rule date { <month> <day> <year> }

}

That's okay to do right? It looks a lot cleaner to me, but I'm wondering if
there's a better way to skip a rule match in another rule (another adverb
like :skip, with :w being a built-in shorthand for :skip(<?ws>)). Or am I
making this too complex when it really isn't? Any pointers on how to do stuff
like this more simply?

==Question 3==
I'm also curious about exclusions. Right now, to do a general exclusion, I'm
thinking I would probably do something like:
rule text_no_date {
{$/ !~ /<date>/ }
^ [.*] $

}

Would something like below be easier to decode for a human reader?
text:without(<date>) {
^ [.*] $

}

If that adverb were available, then I could have a rule that doesn't include
two other rules:
line:without(<date>&&<name>) {
^^ [.*] $$

}

The rule above would match a line with a <date> or <name>, but not a line with
both. Like I said before, I don't know if this is the best way to do stuff
like this, or if I'm thinking about these problems the wrong way, so *any*
help would be great.

Thanks,
David

Luke Palmer

unread,
Feb 14, 2006, 3:34:17 AM2/14/06
to David Romano, perl6-l...@perl.org
On 2/14/06, David Romano <david....@gmail.com> wrote:
> ==Question 1==
> macro rxmodinternal:<x> { ... } # define your own /:x() stuff/
> macro rxmodexternal:<x> { ... } # define your own m:x()/stuff/
> With this, I can make my own adverbs then? Like :without, or :skip, and
> describe what each does?

Yes, although exactly how is completely unspecified.

Brackets serve as a kind of scoping for modifiers. We're also
considering that :ws take an argument telling it what to consider to
be whitespeace. So you could do:

rule Month :w {
[ :w(&my_ws) J a n ] # not sure about the &
# out here we still have the default :w
}


> ==Question 3==
> I'm also curious about exclusions. Right now, to do a general exclusion, I'm
> thinking I would probably do something like:
> rule text_no_date {
> {$/ !~ /<date>/ }
> ^ [.*] $
> }
>
> Would something like below be easier to decode for a human reader?
> text:without(<date>) {
> ^ [.*] $
> }

Well, if you could define exactly what it means, then perhaps. Does
that mean that date appears nowhere within the matched text, or that
it just doesn't appear at the beginning. In either case, you can make
these rules grammar friendly by including your test at the end:

rule text_no_date {
(.*)
{ $1 !~ /<date>/ }
}

Or if you just don't want one at the beginning:

rule text_no_date {
<!before <date>> .*
}

>
> If that adverb were available, then I could have a rule that doesn't include
> two other rules:
> line:without(<date>&&<name>) {
> ^^ [.*] $$
> }
>
>
> The rule above would match a line with a <date> or <name>, but not a line with
> both.

Huh. That kind of test really wants a closure. You can't use the
regex & because that requires that they match at the same place. You
can't use the logical &&, because <date> isn't an expression. Of
course, that is unless you include your own parsing rule, but that
isn't recommended.

Luke

David Romano

unread,
Feb 14, 2006, 2:35:18 PM2/14/06
to perl6-l...@perl.org
On 2/14/06, Luke Palmer <lrpa...@gmail.com> wrote:
> On 2/14/06, David Romano <david....@gmail.com> wrote:
> > I don't want to just skip <B> tags wholly, because they do serve a purpose,
> > but only in a particular context. (Can <?ws> be changed back to a "default" if
> > changed to include html tags?)
>
> Brackets serve as a kind of scoping for modifiers. We're also
> considering that :ws take an argument telling it what to consider to
> be whitespeace. So you could do:
>
> rule Month :w {
> [ :w(&my_ws) J a n ] # not sure about the &
> # out here we still have the default :w
> }
Ahh, okay. So am I to understand that my_ws would just return a set of
individual characters or character sequences that would be considered
whitespace? Or would my_ws do something else?

> > Would something like below be easier to decode for a human reader?
> > text:without(<date>) {
> > ^ [.*] $
> > }
>
> Well, if you could define exactly what it means, then perhaps. Does
> that mean that date appears nowhere within the matched text, or that
> it just doesn't appear at the beginning. In either case, you can make
> these rules grammar friendly by including your test at the end:
>
> rule text_no_date {
> (.*)
> { $1 !~ /<date>/ }
> }

This is what I was thinking: nowhere within the matched text...hadn't
thought about the closure at the end.

> > If that adverb were available, then I could have a rule that doesn't include
> > two other rules:
> > line:without(<date>&&<name>) {
> > ^^ [.*] $$
> > }
> >
> >
> > The rule above would match a line with a <date> or <name>, but not a line with
> > both.
>
> Huh. That kind of test really wants a closure. You can't use the
> regex & because that requires that they match at the same place. You
> can't use the logical &&, because <date> isn't an expression. Of
> course, that is unless you include your own parsing rule, but that
> isn't recommended.

I see. I guess I initally thought that it would be nice to have
important information like exclusions at the beginning, rather than at
the end, of a rule. Thanks again for the explanations and pointers.

David

Patrick R. Michaud

unread,
Feb 14, 2006, 3:21:55 PM2/14/06
to David Romano, perl6-l...@perl.org
On Tue, Feb 14, 2006 at 11:35:18AM -0800, David Romano wrote:
> On 2/14/06, Luke Palmer <lrpa...@gmail.com> wrote:
> > On 2/14/06, David Romano <david....@gmail.com> wrote:
> > > I don't want to just skip <B> tags wholly, because they do
> > > serve a purpose, but only in a particular context. (Can <?ws>
> > > be changed back to a "default" if
> > > changed to include html tags?)
> >
> > Brackets serve as a kind of scoping for modifiers. We're also
> > considering that :ws take an argument telling it what to consider to
> > be whitespeace. So you could do:
> >
> > rule Month :w {
> > [ :w(&my_ws) J a n ] # not sure about the &
> > # out here we still have the default :w
> > }
> Ahh, okay. So am I to understand that my_ws would just return a set of
> individual characters or character sequences that would be considered
> whitespace? Or would my_ws do something else?

I would think that my_ws would be a rule of some sort:

rule my_ws { [ \s+ | \< /? b \> ]* }

Also, it wasn't noted in the previous post, but one can
explicitly call the "default" ws rule by referring to it explicitly,
as in <Rule::ws>. (Currently PGE has it as <PGE::Rule::ws>.)
So, presumably one could do

rule Month :w(&my_ws) { J a n # my_ws rule here
[:w(&Rule::ws) . . . ] # "default" ws rule here
[:w(0) . . . ] # no :w here

PGE doesn't yet implement rule arguments to the :w modifier, but
I bet we can add it without too much trouble. :-)

Pm

0 new messages