Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

some newbie questions about synopsis 5

3 views

Skip to first unread message

H. Stelling

unread,

Feb 15, 2006, 4:09:05 AM2/15/06

to perl6-l...@perl.org

Hello,

I've stumbled upon Perl6 a couple of weeks ago and I'm really looking
forward
to seeing the finished product. Currently, I'm trying to implement a
perl-like
rules module for Python, and I've got some questions which I think aren't
covered in the Synopsis or anywhere else I looked, mostly concerning
captures
and aliases:

- Capture numbering:

/(a) [ (b) (c) (d) | (e) & (f) ] (g)/ capture.t suggests something like
$0 $1 $2 $3 $1 $2 $4, but I'm only guessing about the
"&" bit.

In the following,

/ (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) /

does the first alias have any effect on where the f's will go
(probably not)?

- Which rules do apply to repeated captures with the same alias? For
example,
the second array aliasing example

m:w/ Mr?s? @<names> := <ident> W\. @<names> := <ident>
| Mr?s? @<names> := <ident>
/;

seems to suggests that by using $<names>, the lower branch would have
resulted in a single Match object instead of an array (like the array we
would have gotten if we hadn't used the aliases in the first place). Is
this right? And could the same effect have been achieved by something
like

/ $<names> := <indent>**{1} / ?

- More array aliasing:

is / mv @<files> := [...]* /
just (slightly) shorter for / mv [$<files> := [...]]* / ?

Likewise, could / @<pairs> := ( (\w+) \: (\N+) )+ /
have also been written / [ $<pairs> := (\w+) \: $<pairs> := (\N+) ]+ / ?

- Array and hash aliasing of quantified subpatterns or subrules: what
happens
to the named captures?

/ @<foo> := ( ... $bar := (...) ... )* /

And if the subpattern or subrule ends with an alternation, can the
number of
array elements to be appended (or hashed) vary depending on whitch
branch is
taken?

- Which of the following constructs could possibly be ok (I hope, none)?

/ $<foo> := ... & $<foo> := ... /
/ $<foo> := ... %<foo> := ... /
/ $<foo> := ... | %<foo> := ... /
/ $<foo> := $<foo> := ... /

- Do aliases bind right-to-left, as do assignments?

/ $2 := $5 := ... / # next should be $3, not $6

- Which kind of escape sequences are allowed (or required) in enumerated
character classes?

Thanks in advance for any answers!

Patrick R. Michaud

unread,

Feb 15, 2006, 8:39:23 AM2/15/06

to H. Stelling, perl6-l...@perl.org

On Wed, Feb 15, 2006 at 10:09:05AM +0100, H. Stelling wrote:
> - Capture numbering:
>
> /(a) [ (b) (c) (d) | (e) & (f) ] (g)/ capture.t suggests something like
> $0 $1 $2 $3 $1 $2 $4, but I'm only guessing about the
> "&" bit.

Yes.

> In the following,
>
> / (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) /
>
> does the first alias have any effect on where the f's will go
> (probably not)?

I'll defer to @Larry on this one, but my initial impression is
that the (f) capture would go into $6.

> - Which rules do apply to repeated captures with the same alias? For
> example,
> the second array aliasing example
>
> m:w/ Mr?s? @<names> := <ident> W\. @<names> := <ident>
> | Mr?s? @<names> := <ident>
> /;
>
> seems to suggests that by using $<names>, the lower branch would have
> resulted in a single Match object instead of an array (like the array we
> would have gotten if we hadn't used the aliases in the first place). Is
> this right?

Yes, that's correct.

> And could the same effect have been achieved by something
> like
>
> / $<names> := <indent>**{1} / ?

Yes, a quantified capturing subrule or subpattern results in an
array of Match objects (even if the quantification is "1").

> - More array aliasing:
>
> is / mv @<files> := [...]* /
> just (slightly) shorter for / mv [$<files> := [...]]* / ?

I think so.

> Likewise, could / @<pairs> := ( (\w+) \: (\N+) )+ /
> have also been written / [ $<pairs> := (\w+) \: $<pairs> := (\N+) ]+ / ?

Seems like it would work.

> - Array and hash aliasing of quantified subpatterns or subrules: what
> happens
> to the named captures?
>
> / @<foo> := ( ... $bar := (...) ... )* /

Presuming you meant $<bar> there instead of $bar, I have no idea
what would happen. (With $bar it's an external alias and would
capture an array of matches into the scope in which the rule was
declared.)

> And if the subpattern or subrule ends with an alternation, can the
> number of
> array elements to be appended (or hashed) vary depending on whitch
> branch is
> taken?

Again I have to refer this to @Larry, but my initial impression is
"yes, it would vary".

> - Which of the following constructs could possibly be ok (I hope, none)?
>
> / $<foo> := ... & $<foo> := ... /

I think this one is okay. $<foo> is an array of Match objects, and
each Match is likely repeated within the array.

> / $<foo> := ... %<foo> := ... /

I hope this is not okay. It's certainly not going to be okay anytime
soon in the PGE implementation of Perl 6 rules. :-)

> / $<foo> := ... | %<foo> := ... /

Since the two aliases are in separate alternation branches, I think
this is okay. The argument would be similar to

/ $<foo> := ... | @<foo> := .../

in which $<foo> is either a single Match object or an array of
Match objects depending on the branch matched.

> / $<foo> := $<foo> := ... /

While my instinctual reaction is to say that this ought to be okay,
upon thinking about it a bit more I think I'd prefer to say that
it's not. At least initially, if nothing else. In particular, I
wonder about something like

/ @<foo> := $<bar> := [...]+ /

If we say that an alias always requires a subpattern or subrule
(and not another alias), then we avoid a lot of ambiguity, and the
above could be written as

/ @<foo> := [ $<bar> := [...]+ ] /
/ @<foo> := [ $<bar> := [...] ]+ /

depending on what is desired.

> - Do aliases bind right-to-left, as do assignments?
> / $2 := $5 := ... / # next should be $3, not $6

Assuming we allow chained aliases such as this (see above note),
I'd still argue for $6 instead of $3.

> - Which kind of escape sequences are allowed (or required) in enumerated
> character classes?

AFAIK, this hasn't been completely decided or specified yet.

H. Stelling

unread,

Feb 17, 2006, 8:33:12 AM2/17/06

to Patrick R. Michaud, perl6-l...@perl.org

Patrick R. Michaud wrote:

>>In the following,
>>
>>/ (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) /
>>
>>does the first alias have any effect on where the f's will go
>>(probably not)?
>>
>>
>
>I'll defer to @Larry on this one, but my initial impression is
>that the (f) capture would go into $6.

I think that sequences should behave exactly as single branch
alternations (only that there is no such thing, although we
can write "[foo|<fail>]"). So I would rather opt for $1.

>>- Which rules do apply to repeated captures with the same alias? For
>>example,
>>the second array aliasing example
>>
>>m:w/ Mr?s? @<names> := <ident> W\. @<names> := <ident>
>> | Mr?s? @<names> := <ident>
>> /;
>>
>>seems to suggests that by using $<names>, the lower branch would have
>>resulted in a single Match object instead of an array (like the array we
>>would have gotten if we hadn't used the aliases in the first place). Is
>>this right?
>>
>>
>
>Yes, that's correct.

But wouldn't it be nice if the same rules applied to aliases and
subrule invocations, that is, recursion put aside, to think of

/ <foo> /

simply as a shorter way to say

/ $<foo> := ([definition of foo]:) /?

And I've got two more somewhat related questions:

The synopsis says:

* If a subrule appears two (or more) times in the same lexical scope
(i.e. twice within the same subpattern and alternation), or if the
subrule is quantified anywhere within the entire rule, then its
corresponding hash entry is always assigned a reference to an array
of Match objects, rather than a single Match object.

Maybe you're not the right person to ask, but is there a particular
reason for the "entire rule" bit?

/ (<foo>|None) <foo> (<foo>) /

Here we get three Matches $0<foo> (possibly undefined), $<foo>, and
$1<foo>. At least, I think so.

/ (<foo>?) <foo> (<foo>) /

Now, we suddenly get three more or less unrelated arrays with lengths
0..1, 1, and 1. Of course, I admit this example is a bit artificial.

Furthermore, I think "within the same subpattern and alternation" is
not quite correct, at least it wouldn't apply to somethink like

/ (<foo> [ <foo> | ... ]) /

unless we consider the (...) sequence as a kind of single branch
alternation. And why are alternation branches considered to be
lexical scopes, anyway? Just because of subpattern numbering?

My second question is why adding a "?" or "??" to an unquantified
subrule which would otherwise result in a single Match object should
result in an array, rather than a single (possibly undefined) Match.
That is, why doesn't "<foo>?" rather behave like "[<foo>|<null>]"?
This would save us the trouble to create all these tiny arrays, or
having to write "[...|<null>]" all the time. Or maybe one could
define one's own quantifiers?

Patrick R. Michaud

unread,

Feb 17, 2006, 9:32:18 AM2/17/06

to H. Stelling, perl6-l...@perl.org

On Fri, Feb 17, 2006 at 02:33:12PM +0100, H. Stelling wrote:
> Patrick R. Michaud wrote:
> >>In the following,
> >>
> >>/ (a) [ (b) (c) | $5 := (d) $0 := (e) ] (f) /
> >>
> >>does the first alias have any effect on where the f's will go
> >>(probably not)?
> >
> >I'll defer to @Larry on this one, but my initial impression is
> >that the (f) capture would go into $6.
>
> I think that sequences should behave exactly as single branch
> alternations (only that there is no such thing, although we
> can write "[foo|<fail>]"). So I would rather opt for $1.

The current implementation is that a capturing subpattern
is indexed based on the largest index in all of the alternation
branches. I'm not sure it makes sense to base it on aliases of
the last alternation branch.

Here are some examples we can chew on:

/ (a) [ (b) (c) | (d) ] (f) / # (f) is $3 or $2? (currently $3)

/ (a) [ (b) (c) | $1 := (d) ] (f) / # (f) is $3 or $2?

Since the second example is essentially saying the same as the first,
the (f) capture ought to go to the same place in each case. If we
say that the existence of the $1 causes the (f) to go into $2, it
also becomes the case that $2 is an array of match objects, which
isn't technically problematic but it might be a bit surprising for
many.

Some other examples to consider:

/ (a) [ (b) (c) | $0 := (d) ] (f) / # (f) is $3 or $1?

/ (a) [ (b) (c) | $0 := (d) (3) ] (f) / # (f) is $3 or $2?

At any rate, I find that having a subpattern capture base its
index on the highest index of all of the previous alternation
branches is easy to understand and works well in practice. It can
also be easily changed with another alias if needed.

> But wouldn't it be nice if the same rules applied to aliases and
> subrule invocations, that is, recursion put aside, to think of
>
> / <foo> /
>
> simply as a shorter way to say
>
> / $<foo> := ([definition of foo]:) /?

First, is that colon following "[definition of foo]" intentional or
a typo? Currently we can backtrack into subrules -- there's no "cut"
assumed after them.

But secondly, I'm not sure we can casually toss recursion
aside when thinking about this, since it's really a driving force
behind having named subrules. :-) There's also a difference in
that subrules can take arguments, as in <foo('args')>, or can come
from another grammar, as in <Rule::foo>, which seems to argue that
<foo> is really something other than an alias shorthand.

> The synopsis says:
>
> * If a subrule appears two (or more) times in the same lexical scope
> (i.e. twice within the same subpattern and alternation), or if the
> subrule is quantified anywhere within the entire rule, then its
> corresponding hash entry is always assigned a reference to an array
> of Match objects, rather than a single Match object.
>
> Maybe you're not the right person to ask, but is there a particular
> reason for the "entire rule" bit?
>
> / (<foo>|None) <foo> (<foo>) /
>
> Here we get three Matches $0<foo> (possibly undefined), $<foo>, and
> $1<foo>. At least, I think so.
>
> / (<foo>?) <foo> (<foo>) /
>
> Now, we suddenly get three more or less unrelated arrays with lengths

> 1..1, 1, and 1. Of course, I admit this example is a bit artificial.

Oh, I hadn't caught that particular clause (or hadn't read it as
you just did). PGE certainly doesn't implement things that way.
I think the "entire rule" clause was intended to cover cases like

/ [ <foo> ]* /

where <foo> is indirectly quantified and therefore is an array of
match objects. We should probably reword it, or get a clarification
of what is intended. (Damian, @Larry: can you confirm or clarify
this for us?)

> Furthermore, I think "within the same subpattern and alternation" is
> not quite correct, at least it wouldn't apply to somethink like
>
> / (<foo> [ <foo> | ... ]) /
>
> unless we consider the (...) sequence as a kind of single branch
> alternation. And why are alternation branches considered to be
> lexical scopes, anyway?

In the example you give, $0<foo> is indeed an array of match objects.
The "same alternation" in this case is the subpattern... compare to

/ (<foo> [ <foo> | ... ]) | <foo> /

$0<foo> is an array, $<foo> is a single match object.

Alternation branches don't create new lexical scopes, they just
affect quantification and subpattern numbering. In both of the
following examples

/ abc <foo> def <foo> /

/ ghi <foo> | jkl <foo> /

each <foo> has the same lexical scope ($<foo>), but in the "abc"
example $<foo> is an array of match objects, while in the "ghi"
example $<foo> is a single match object.

> My second question is why adding a "?" or "??" to an unquantified
> subrule which would otherwise result in a single Match object should
> result in an array, rather than a single (possibly undefined) Match.

The specification was originally this way but was later changed
to the current definition. I think people found the idea of
"?" producing a single match object confusing, so for consistency
we ended up with "all quantifiers produces arrays of match objects".

(Note also that even if "?" produced a single Match object instead
of an array, it wouldn't be "undefined" -- it would be a failed Match.)

Larry Wall

unread,

Feb 17, 2006, 2:26:18 PM2/17/06

to perl6-l...@perl.org

On Fri, Feb 17, 2006 at 08:32:18AM -0600, Patrick R. Michaud wrote:
: > The synopsis says:
: >
: > * If a subrule appears two (or more) times in the same lexical scope
: > (i.e. twice within the same subpattern and alternation), or if the
: > subrule is quantified anywhere within the entire rule, then its
: > corresponding hash entry is always assigned a reference to an array
: > of Match objects, rather than a single Match object.
: >
: > Maybe you're not the right person to ask, but is there a particular
: > reason for the "entire rule" bit?
: >
: > / (<foo>|None) <foo> (<foo>) /
: >
: > Here we get three Matches $0<foo> (possibly undefined), $<foo>, and
: > $1<foo>. At least, I think so.
: >
: > / (<foo>?) <foo> (<foo>) /
: >
: > Now, we suddenly get three more or less unrelated arrays with lengths
: > 1..1, 1, and 1. Of course, I admit this example is a bit artificial.
:
: Oh, I hadn't caught that particular clause (or hadn't read it as
: you just did). PGE certainly doesn't implement things that way.
: I think the "entire rule" clause was intended to cover cases like
:
: / [ <foo> ]* /
:
: where <foo> is indirectly quantified and therefore is an array of
: match objects. We should probably reword it, or get a clarification
: of what is intended. (Damian, @Larry: can you confirm or clarify
: this for us?)

I believe that was the intent, but I'll defer to Damian on the wordsmithing
because I'm a bit out of sorts at the moment and it'd probably come out
all sideways.

Larry

Damian Conway

unread,

Feb 20, 2006, 9:24:48 PM2/20/06

to Patrick R. Michaud, perl6-l...@perl.org

Patrick clarified:

> At any rate, I find that having a subpattern capture base its
> index on the highest index of all of the previous alternation
> branches is easy to understand and works well in practice. It can
> also be easily changed with another alias if needed.

I strongly agree, and would be unhappy to see it work any other way.

>>* If a subrule appears two (or more) times in the same lexical scope
>> (i.e. twice within the same subpattern and alternation), or if the
>> subrule is quantified anywhere within the entire rule, then its
>> corresponding hash entry is always assigned a reference to an array
>> of Match objects, rather than a single Match object.
>>
>>Maybe you're not the right person to ask, but is there a particular
>>reason for the "entire rule" bit?
>>
>>/ (<foo>|None) <foo> (<foo>) /
>>
>>Here we get three Matches $0<foo> (possibly undefined), $<foo>, and
>>$1<foo>. At least, I think so.
>>
>>/ (<foo>?) <foo> (<foo>) /
>>
>>Now, we suddenly get three more or less unrelated arrays with lengths
>>1..1, 1, and 1. Of course, I admit this example is a bit artificial.
>
>
> Oh, I hadn't caught that particular clause (or hadn't read it as
> you just did). PGE certainly doesn't implement things that way.
> I think the "entire rule" clause was intended to cover cases like
>
> / [ <foo> ]* /
>
> where <foo> is indirectly quantified and therefore is an array of
> match objects. We should probably reword it, or get a clarification
> of what is intended. (Damian, @Larry: can you confirm or clarify
> this for us?)

Sorry, you're correct that it's not what was intended. I was specifically
trying to address the case where the same subrule appears with different
quantifications in different alternations in the same scope.

That is, the difference between:

m/ bar <foo> | baz <foo> / # $<foo> always contains a scalar

and:

m/ bar <foo> | baz <foo>* / # $<foo> always contains an array ref

Is this clearer:

* If a subrule appears two (or more) times in any branch of a

lexical scope (i.e. twice within the same subpattern and

alternation), or if the subrule is quantified anywhere within a
given scope, then its corresponding hash entry is always assigned

a reference to an array of Match objects, rather than a single
Match object.

???

If so, I'd be happy if someone wanted to update the Synposis that way.

Note, however, that this question suggests that we need a more overt statement
about what consistitutes a scope within a regex. I'll work on providing that
when I take my next pass through the Synopses (probably next week).

>>Furthermore, I think "within the same subpattern and alternation" is
>>not quite correct, at least it wouldn't apply to somethink like
>>
>>/ (<foo> [ <foo> | ... ]) /
>>
>>unless we consider the (...) sequence as a kind of single branch
>>alternation. And why are alternation branches considered to be
>>lexical scopes, anyway?
>
> In the example you give, $0<foo> is indeed an array of match objects.
> The "same alternation" in this case is the subpattern... compare to
>
> / (<foo> [ <foo> | ... ]) | <foo> /
>
> $0<foo> is an array, $<foo> is a single match object.
>
> Alternation branches don't create new lexical scopes, they just
> affect quantification and subpattern numbering. In both of the
> following examples
>
> / abc <foo> def <foo> /
>
> / ghi <foo> | jkl <foo> /
>
> each <foo> has the same lexical scope ($<foo>), but in the "abc"
> example $<foo> is an array of match objects, while in the "ghi"
> example $<foo> is a single match object.

Patrick is spot-on here.

In simplest terms, the only things that create a scope are the regex
delimiters (which delimit the outermost lexical scope), and any pair of
capturing parentheses (which delimit some nested scope).

>>My second question is why adding a "?" or "??" to an unquantified
>>subrule which would otherwise result in a single Match object should
>>result in an array, rather than a single (possibly undefined) Match.
>
> The specification was originally this way but was later changed
> to the current definition. I think people found the idea of
> "?" producing a single match object confusing, so for consistency
> we ended up with "all quantifiers produces arrays of match objects".

That's my recollection too. And I certainly agree with the decision, even
though I proposed it the other way originally.

Damian

0 new messages