|
Folks, I'm having trouble writing a grammar. First and foremost, I'm disappointed that there's no samples in my May Oslo SDK so I can't see any languages other than the ones I can find online, and there's no good tutorials other than the "movies" tutorial at the dev center. Basically I'm looking for a grammar that gives me the ability to grab a little bit of meta data, but basically I want it to organize large blocks of free form text. So, for instance, my DSL might look something like this: Content "Hello World" by "Kevin Hoffman" with tags test sample foo bar End Content I have a Mg that parses this... it's the blocks of text that I want to appear inside that I can't seem to get an Mg for. Every attempt at creating a block syntax fails miserably. Ideally I'd like: Content .... Normal Block .. some prose .. End Block Code Block ... some code ... End Block End Content So that my MGraph output will have the title, tags of the content (I already have this working), but will have a list of blocks of varying types each containing a bunch of free-form text. Anybody have any idea where to start on this or some samples I can look at? Thanks! The .NET Addict - http://dotnetaddict.dotnetdevelopersjournal.com | | Kevin Hoffman | Yeah honestly the hard part there is the arbitrary text inside of the blocks. I haven't really found a great way to do this either, the only solution I've found is to make the blocks a token with "Code Block" "End Block" as part of the token, except of course then that text ends up inside of the token.
I actually think there is a deficiency in the way MGrammar tokenizes, it seems to always match the most greedy token. Even if you use a "final token" if there is another token that seems to be longer it can override it and end up being impossible to solve. For example:
syntax TextBlock = TextBlockStart TextBody BlockEnd;
final token TextBlockStart = "Text Block";
final token BlockEnd = "End Block";
token TextBody = any*;
To me this should work as it appears, the token TextBody should continue on until it finds a final token match but it does not. Which makes arbitrary blocks nearly impossible to solve. However if you change TextBlock to be a token then I think it will work. Except of course the "Text Block" and "Block End" become part of the token. This is how strings and comments can work, they have distinct begin and end characters but those characters get included in with the token.
You might be able to improve things by breaking it up into lines, and therefore use "\r\n" as the end character. Something like:
token Text = Line+;
token Line = (' '..'~')* '\r' \'n';
Good luck! - Proposed As Answer byKraig BrockschmidtMSFT, ModeratorFriday, July 31, 2009 5:28 PM
- Marked As Answer byKraig BrockschmidtMSFT, ModeratorFriday, August 07, 2009 2:46 PM
-
| | justncase80 | Can you provide a sample grammar so we can see what's going on? Also real sample of the text you want? MGrammar does pretty well with parsing text that has clear begin/end tokens for blocks so you should be able to achieve what you are trying to do. | | justncase80 | Here's my grammar so far:
module Kevin
{
@{CaseSensitive[false]}
language ContentLanguage
{
interleave Skippable
= Whitespace;
syntax Author
= By name:Name => name
| By name:NameVerbatim => name
| empty => "Ulysses Agenda Team";
syntax Tags
= WithTags tags:Name+ => tags
| empty;
syntax Main
= post:Post
=> Post{ valuesof(post) };
syntax Post
= PostStart name:PostName author:Author tags:Tags PostEnd
=> { Title {name},
Author {author},
Tags {tags}
};
syntax PostName
= name:Name => name
| name:NameVerbatim => name;
nest syntax NameVerbatim
= '"' name:NameWithWhitespace '"' => name;
token AlphaNumerical
= 'a'..'z' | 'A'..'Z' | '0'..'9';
@{Classification["Keyword"]}
final token PostStart
= "Post";
@{Classification["Keyword"]}
final token By
= "by";
@{Classification["Keyword"]}
final token WithTags
= "with tags";
@{Classification["Keyword"]}
final token PostEnd
= "End Post";
token Name
= AlphaNumerical+;
token NameWithWhitespace
= (AlphaNumerical | Whitespace)+;
token Whitespace
= '\r'
| '\n'
| '\t'
| ' ';
}
}
And here's a sample input:
Post "This is a sample post" by kevin
with tags red white blue awesome
End Post
The grammar properly parses all of this, giving me the following output Mgraph:
Post{
Title => "This is a sample post",
Author => "kevin",
Tags => [
"red",
"white",
"blue",
"awesome"
]
}
What I can't figure out how to do is allow an arbitrary number of text blocks to appear within the post. I want to be able to support a DSL that looks like this:
Post "This is a sample post" by kevin
with tags red white blue awesome
Text Block
here is some text
and some more text
End Block
Code Block
Line1;
Line2;
End Block
Text Block
And here's some more text
End Block
End Post
The .NET Addict - http://dotnetaddict.dotnetdevelopersjournal.com | | Kevin Hoffman | Yeah honestly the hard part there is the arbitrary text inside of the blocks. I haven't really found a great way to do this either, the only solution I've found is to make the blocks a token with "Code Block" "End Block" as part of the token, except of course then that text ends up inside of the token.
I actually think there is a deficiency in the way MGrammar tokenizes, it seems to always match the most greedy token. Even if you use a "final token" if there is another token that seems to be longer it can override it and end up being impossible to solve. For example:
syntax TextBlock = TextBlockStart TextBody BlockEnd;
final token TextBlockStart = "Text Block";
final token BlockEnd = "End Block";
token TextBody = any*;
To me this should work as it appears, the token TextBody should continue on until it finds a final token match but it does not. Which makes arbitrary blocks nearly impossible to solve. However if you change TextBlock to be a token then I think it will work. Except of course the "Text Block" and "Block End" become part of the token. This is how strings and comments can work, they have distinct begin and end characters but those characters get included in with the token.
You might be able to improve things by breaking it up into lines, and therefore use "\r\n" as the end character. Something like:
token Text = Line+;
token Line = (' '..'~')* '\r' \'n';
Good luck! - Proposed As Answer byKraig BrockschmidtMSFT, ModeratorFriday, July 31, 2009 5:28 PM
- Marked As Answer byKraig BrockschmidtMSFT, ModeratorFriday, August 07, 2009 2:46 PM
-
| | justncase80 | Just to note, in answer to your search for sample grammars, that there is an "M" Language Gallery here on the Dev Center...just isn't always the most visible thing. It's at http://msdn.microsoft.com/en-us/oslo/cc749619.aspx. .Kraig | | Kraig Brockschmidt | Yeah I've seen that. Some of those links don't work at all, one of them doesn't actually publish the Mg file, and none of them include arbitrary blocks of free-form text :) The .NET Addict - http://dotnetaddict.dotnetdevelopersjournal.com | | Kevin Hoffman | Thanks for pointing out that some of the links don't work. Haven't checked those in a while, so I'll put it on the update list.
.Kraig | | Kraig Brockschmidt | Here is a link to a grammar where I use freeform text:
Notice though, no syntax :( all tokens. I'm basically srubbing the token input in codebehind to get this to work... but at least it works. | | justncase80 | Oh and to explain what it does in case you want to run it for your own testing, it essentially would parse text such as this:
int x = 0;
@for(item in Items):
x += {item.Value};
@end
Meaning, it looks for lines beginning with @ (whitespace ignored before the @) and blocks with { }. Everything else is essentially freeform text. Part of the trick here is not interleaving whitespace or newlines, you have to explicitly factor it into your tokens.
Future updates will allow @@ in case you want an @ at the beginning of a line to be considered literal. I hope this helps at least a little. | | justncase80 | Thanks for the link--I'll get that onto the grammar gallery soon. | | Kraig Brockschmidt | I generate LaTeX docsusingMg with requirements similar to yours (mostly verbatim text... in LaTeX). I call it HLaTeX (High-Level LaTeX), but... I haven't made it available on my site ( http://sixpairs.com) yet. In the meantime, you can have a look at how this grammar works by checking out the first video on the following page: http://www.sixpairs.com/mgplugin.aspxAlthough this is a video that is related to an Intellipad plug-in, it is relevant int this context. (Sorry the page is so heavy, just download the video from the link without waiting for all others). Though I have not posted the grammar, I am happy to send it to you if you find it of any use; I generate my internal software documentation with it instead of word. I can also send you the HLaTeX.exe as a whole if you have TexLive. This will probably not cut-out what you want to do but this is all I have at present that is remotely similar to what you want to do. Sorryif it looks toomeagre; in any case I have a Pov-Ray generator that uses inline verbatim text where my parser falls-back to Pov-Ray's SDL (as text) wherever I have not implemented something (down to vectors and coordinates, up to whole scenes, lights etc.); you can find it @ http://sixpairs.com/povol.aspx for online use! If you know Pov-Ray, you can try to enter, e.g. angle*"sin(smthng)+cos(25+other)" instead of angle*0.123. | | Ceyhun Ciper | Can you provide just a few of the tricks you used for parsing freeform text? Even if not the whole grammar. | | justncase80 | It is not tricky in HLatex; the string processing part is:
@{Classification["String"]}
token VerbatimLatex =
'"' c:DQuoteChar* '"'
=> VerbatimLatex{c};
token DQuoteChar =
Grammar.DoubleQuoteTextCharacter
| Grammar.DoubleQuoteTextVerbatimCharacter
| '\r' '\n';
And the whole grammar is:
module Ceyhun {
import Language;
@{CaseInsensitive}
language HLatex {
syntax Main =
title:LabeledVerbatimLatex("Title")?
author:LabeledVerbatimLatex("Author")?
date:LabeledVerbatimLatex("Date")?
toc:TableOfContents?
b:Block* => Document {
title, author, date, toc,
valuesof(b) };
syntax LabeledVerbatimLatex(label) =
l:label t:VerbatimLatex => id(l){valuesof(t)};
syntax Block =
v:VerbatimLatex => v
| d:Drawing => d;
syntax Drawing = "drawing" "{" s:Shape* "}" => Drawing{valuesof(s)};
syntax Shape = 'point' x:Double ','? y:Double => Point {x, y};
token TableOfContents = "toc" => TableOfContents{};
@{Classification["String"]}
token VerbatimLatex =
'"' c:DQuoteChar* '"'
=> VerbatimLatex{c};
token DQuoteChar =
Grammar.DoubleQuoteTextCharacter
| Grammar.DoubleQuoteTextVerbatimCharacter
| '\r' '\n';
token Double = (('0'..'9')* '.')* Grammar.Integer;
interleave Whitespace = Base.Whitespace;
}
}
But in my PovRay Object Language (PovOL)there is stuff like:
syntax Value =
e:Expression => Value[e]
| v:TVerbatim => Value[valuesof(v)];
syntax Point =
precedence 2: v:Vector => Point{valuesof(v)}
| precedence 1: v:TVerbatim => Point{valuesof(v)};
syntax Radius = v:Value => Radius{valuesof(v)};
syntax Cone = c:Cone("Cone") => Cone{valuesof(c)};
syntax OpenCone = c:Cone("OpenCone") => OpenCone{valuesof(c)};
syntax Cone(kw) =
kw => [ BasePoint[0], BaseRadius[1], CapPoint["y"], CapRadius[0] ]
| kw base_radius:Radius => [ BasePoint[0], BaseRadius[base_radius], CapPoint["y"], CapRadius[0] ]
| kw base_point:Point cap_point:Point => [ BasePoint[base_point], BaseRadius[1], CapPoint[cap_point], CapRadius[0] ]
| kw base_point:Point base_radius:Radius cap_point:Point => [ BasePoint[base_point], BaseRadius[base_radius], CapPoint[cap_point], CapRadius[0] ]
| kw base_point:Point base_radius:Radius cap_point:Point cap_radius:Radius => [ BasePoint[base_point], BaseRadius[base_radius], CapPoint[cap_point], CapRadius[cap_radius] ]
;
which is quite tricky; even a Cone's cap point (say)or cap radius can be a verbatimexpression (string)like so: "(sin(pi)+cos(clock/2))*<2,3,5>"
Try it online @ http://www.sixpairs.com/PovOL.aspx-Ceyhun - Edited byCeyhun Ciper Tuesday, August 04, 2009 8:52 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:27 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:18 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:51 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:20 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:29 AM
- Edited byCeyhun Ciper Tuesday, August 04, 2009 8:33 AM
-
| | Ceyhun Ciper | This is a big deal for me. I've been trying to write a bunch of grammars dealing with simple textual markup languages, such as Wikipedia markup, Markdown and other trivial variants where most of the text is undefined. Doing this with MGrammar is just not a good idea, but I really would like that. Let's take a simple sample: Hello *world*, I want that text to be bold, and _this_ to be emphasized (italic). This is a [link|to_some_other_page]. Now that's solvable with a bunch of regular expressions but a bad idea because it would not be well suited for parsing _*[link]*_ that. A regular expression is not recursive in that way a parser is and when working with free-form text, it's essential to be able to express the opposite of a formal grammar. If MGrammar supported a mode in where input that does not match any specific token or syntax rule is put into a catch-all token, you would be able to collect the input while benefiting from a fully fledged parser. Normally this is where the parser produces an error, but it sounds to me that if you opt-out of that error and just collect the input, you have a much simpler way of working with free-form text in MGrammar. e.g. syntax Markup = "*" text:collect() Markup "*" => Bold{text};
| | John Leidegren |
|