Mediawiki Markdown



EBNF grammar project

  • BNF

MediaWiki to Markdown. Convert MediaWiki pages to GitHub flavored Markdown (or other formats supported by Pandoc). The conversion uses an XML export from MediaWiki and converts each wiki page to an individual markdown file. The Markdown markup is a set of characters that let you organize and format your text, add links and images, and much more.

Markdown MediaWiki Extension About. This is a relatively simple extension for MediaWiki that allows Markdown instead of Wiki markup in articles. This extension uses erusev's Parsedown and, optionally, Parsedown Extra libraries. Download this extension (grab the latest from releases). My question is: is there a convenient way to do the work from markdown to MediaWiki wikitext? See full list on github.com.

There are many documents aiming at creating a formal representation of the MediaWiki markup and the parser behaviour. So far, none of them are complete, but there are a number of drafts in different syntaxes such as BNF, EBNF or ANTLR. In this document all of these efforts are collected, discussed and coordinated.

Goals[edit]

Produce a specification of MediaWiki's markup format that is sufficiently complete and consistent. Future parser implementations can be built from it. Also, features that are currently either Not Possible or Very Hard (e.g. WYSIWYG editing) could benefit from such a specification.

Specification might include both grammar description and parser behaviour.

Concerning the grammar description, the specification might use a standard notation such as BNF or EBNF.

In order to avoid breaking existent pages, the specification should preserve present parser behavior that is reasonable and well-defined. Adding requirements for new behaviour must be well-considered in that respect. Where the current parser's behavior is undefined or obviously buggy, the specification may define new behavior which is different. The parser might be described using tools such as ANTLR.

Also a data model for a parse tree can be defined. The data model should be representable in XML. An official XML schema for such a representation may or may not be defined. Round-trip conversion between source code and the data should be also possible with a data model. There might be a many-to-one relationship between source code and parse trees, but the canonical transformation from parse tree to source code should always parse back to the same parse tree.

Feasibility study[edit]

It has been broadly asserted that the Wiki Markup is a context-sensitive language, and therefore that it cannot be expressed with a context-free grammar (such as the those defined with BNF or EBNF). To put some light on the topic, it would be useful to first define some concepts:

  • Formal grammar: it is a set of rules to form strings in a language. According to the Chomsky hierarchy a grammar can be:
    • type 3 (or regular): when all the rules are of the form Aa or AaB (being A and B single nonterminal symbols and a a terminal symbol);
    • type 2 (or context-free): when all the rules are of the form Aγ (being γ any combination of terminal and nonterminal symbols);
    • type 1 (or context-sensitive): when all the rules are of the form αAβαγβ (being α and β any combination of terminal and nonterminal symbols); or
    • type 0 (or unrestricted): when there are no restrictions in the production rules (αβ).
  • Syntactic analysis: it is the process of analyzing a text to determine its grammatical structure with respect to a given formal grammar.
  • Parser: it is the program performing this syntactic analysis.
  • Context-free language: that which can be generated with a context-free grammar.
  • Context-sensitive language: that which can be generated with a context-sensitive grammar.
  • Ambiguous grammar: that grammar producing any string so that it can be generated in more than one way.
  • Inherently ambiguous language: that language that can only be generated by ambiguous grammar.

Language generation and grammar type[edit]

As it can be seen in the rule description of each one of the grammars, regular grammars have very relaxed rules, whereas unrestricted grammars are allowed to describe very restraining rules. Relaxed in this context means that symbols can be generated in regular grammars without taking into account what has been produced so far. More on the contrary, context-sensitive grammars' and unrestricted grammars' rules might need some already-generated strings in order to apply certain rules.

Contrary to what could be intuitive, context-sensitive grammars can have more restrictive rules. Not in vain, Chomsky conceived context-sensitive grammars as a way to describe natural language. Natural languages are clearly more restrictive than classical context-free computer languages (such as C or Java), for it is true that a word may or may not be appropriate in a certain place depending upon the context.

Wikicode is, as many other computer languages, a context-free language. The Wikicode is composed of many tokens for formatting, title description, text linking or list representation. Some of the tokens need to be placed in a certain place (such as those which need to go at the beginning of a line), but otherwise tokens may appear in any place of the document, regardless of the context. Considering that, it could be argued that Wikicode is indeed a regular language. However, a regular grammar can only express nesting if it also defines a known maximum nesting level, while Wikicode does not have one.

In short, Wikicode does not need context-sensitive grammar rules, for any token can be placed anywhere (with few restrictions such as #REDIRECT, include modes and similar structures). It cannot be however expressed with regular grammar rules, as nested structures with an arbitrary depth cannot be described in type 3 grammars. Therefore, a context-free grammar suffices for Wikicode description.

Ambiguities in the language[edit]

The fact that Wikicode uses the same characters for different tokens leads to strings that can be interpreted in many different ways. This does not mean, however, that the language is not context-free, but instead that there are many different combinations of rules in the grammar that reach to the same final string. Consider the following string of Wikicode:

The '''dog'''s bone

Being the word dog enclosed in cursive marks (two apostrophes) and bold marks (three apostrophes), with an additional apostrophe to indicate a saxon genitive. This does not necessarily mean that the language is inherently ambiguous either, as there might be a grammar which can generate that structure without ambiguity. In terms of language recognition, this can be easily avoided just considering a precedence in the rules (pretty much like the precedence rules in computer language's mathematical expressions).

Language recognition[edit]

Parsers normally use grammars to analyze or recognize strings of a certain language. When a mismatch (or error) is found, the parser needs to figure out what to do with the unexpected input, and a way to recover from the error in order to further analyze the string.

Whereas the grammar description is quite easy to describe, the parser behaviour is somewhat complex. This is due to the fact that every input string should derive to the most-likely result, even if it contains syntax errors. MediaWiki's parser does a complex error recovery, like for instance when dealing with wrongly nested structures such as:

'The ''quick' brown'' fox

where the opening bold mark is inside an italics structure, and the closed bold mark is outside. For a valid output to be produced, the parser closes the italics structure and opens it again, producing an output where the two structures are properly nested. Also, should a tag mismatching occur, such as in the following example:

The quick 'brown fox

the parser will add the mismatched closing tags. The huge number of recovery rules make the language recognizer hard to describe, for every single rule should be reflected in the specification. Since there is not such thing as right or wrong Wikicode, an extensive list of recovery rules is as important as the own grammar when aiming at creating a complete description of the wiki language interpretation.

Current efforts[edit]

So far progress has been made in both grammar definition and parser behaviour. Although none of the descriptions seems to be complete, some have achieved to describe a good part of the language.

Current parser descriptions tried to do its best to follow the MediaWiki's parser behaviour. It is however very hard to properly describe all the error recovery and rule precedence performed by the MediaWiki's parser, for there are a number of different cases of grammar mismatch (i.e.: unclosed tags) or operator precedence, considering the ambiguity that symbols such as ' or | brings.

  • Grammar descriptions:
  • Parser description:

Resources[edit]

  • User:HappyDog/WikiText parsing - some observations based on 1.3.10 by HappyDog
  • mail:wikitext-l - Wikitext-l maillist
  • JamWiki claims to have Mediawiki compatible syntax - see the code for an attempt to write a parser.
  • MediaCloth Is a Ruby library for parsing Mediawiki compatible syntax to XHTML.
  • WikiCloth is another Ruby library for Mediawiki markup.
  • Infoboxer is Ruby MediaWiki client and parser, aiming (mostly successfully) to successfully parse and navigate any page of any WikiMedia project, including Wikipedia.
  • Raid Magnus's wiki2xml work for some starting points; examine how his parser works (and how it differs from the main one) and the intermediate XML format he uses
  • Riehle et al.'s work on an EBNF grammar for Wiki Creole (a subset of MediaWiki syntax), XML interchange and XSLT transformations: http://dirkriehle.com/publications/2008/wiki-creole/

The Markup Language[edit]

The MediaWiki markup language (commonly referred to within the MediaWiki community as wikitext, though this usage is ambiguous within the larger wiki community) uses sometimes paired non-textual ASCII characters to indicate to the parser how the editor wishes an item or section of text to be displayed. The parser translates these tokens into (X)HTML as closely as semantically possible.

Mediawiki

v1.6 markup tokens[edit]

The markup tokens fall into two broad categories: unary tokens (like : or * used at the beginning of a line), which stand alone, and binary tokens (like those for italic or boldface) which must be used in matched pairs. Unary tokens may only be preceded by comments or whitespace; otherwise, they will not be interpreted.

Unary[edit]

Start of line only[edit]
  • blank line: paragraph break (HTML <p>)
  • Horizontal line: ---- (4 or more hyphens), specified in /BNF/Article#Horizontal rule
  • Pre-formatted text: (space)
  • Lists
    • Bulleted: *
    • Numbered: #
    • Indent with no marking: :
    • Definition list: ;
    Notes:
    • These may be combined at the start of the line to create nested lists, e.g. *** to give a bulleted list three levels deep, or **# to have a numbered list within two-levels of bulleted list nesting.
  • Redirects: #redirect or #REDIRECT (followed by wikilink)
  • The whole quagmire that is table formatting: {| ... |} with in between |- |+ || | !! ! .
Can be used anywhere[edit]
  • 'Magic words', e.g. __FORCETOC__, __NOEDITSECTION__ (see Help:Magic words)
  • Signatures:
    • ~~~ Replaced with your username
    • ~~~~ Replaced with your username and the date
    • ~~~~~ Replaced with the date.
    Notes:
    • These tags are replaced at the point the edit is saved.
  • Magic links: ISBN ..., RFC ..., PMID ...(see BNF/Magic links)

Binary[edit]

The ellipses (...) are used to indicate where the content goes and are not part of the markup.

Beginning of a line[edit]
  • Equals signs are used for headings (must be at start of line)
    • 1st level heading: = ... =
    • 2nd level heading: ...
    • 3rd level heading: ...
    • 4th level heading: ...
    • 5th level heading: ...
    • 6th level heading: ...
    • Specified in /BNF/Article#Heading

Mediawiki Markdown

Anywhere[edit]
  • Square brackets are used for links:
    • Internal/interwiki link + language links + category links + images: [[ ... ]] (see also Namespaces below)
      vertical bars separate optional parameters, which are:
      • link: first parameter: display text (also defaulted using 'pipe trick') (also trailing concatenated text included in display, e.g. s for plural)
      • image: many parameters; see w:Wikipedia:Extended image syntax; may contain nested links (and images!) in caption text.
      • category: first parameter: sort order in category list
      link contents have to be parsed for whether they're dates if $wgUseDynamicDates is on
    • External link: [ ... ]
      space separates optional second parameter, which is display text
    • undecorated URLs are also recognized and hotlinked
    • Specified in /BNF/Links
  • Apostrophes are used for formatting:
    • Italic: ' ... '
    • Bold: '' ... ''
    • Bold + Italic: ''' ... '''
    • Note that improper nesting of bold and italics is currently permitted.
  • Curly braces are used for transclusion:
    • Include template: {{ ... }} (see also Namespaces below)
      • Unlimited number of optional pipe-delimited parameters, each of which may optionally start with a parameter name preceding an equals sign
    • Include template parameter: {{{ ... }}}
      • Optionally including a pipe followed by the parameter default: {{{1|Bob}}} will use the first passed in parameter, and if none is received, will insert 'Bob' instead.
    • Use a built-in variable: {{PAGENAME}} (see m:Help:Variable)
    • Call a parser function: {{ ... }}
  • Various XML style tags:
    • Inline and text elements:
      • <!-- ... --> HTML-style comments (filtered out)
      • <nowiki> do not interpret wiki markup, do allow newline in list and indent elements (but still flow text, still allow SGML entities)
      • <onlyinclude><noinclude><includeonly>
      • SGML entities: &...; (converted to Unicode characters where appropriate, otherwise to numeric entities)
      • Parser extension tags, like <ref> (using Cite.php)
      • <math> if $wgUseTeX is set
      • Plus most non-dangerous HTML styling inline elements: 'font' (deprecated), 'bdo', 'b', 'i', 'u', 's', 'strike', 'big', 'small', 'tt', 'span'
      • Plus most non-dangerous HTML semantic inline elements: 'br', 'abbr', 'cite', 'del', 'ins', 'sub', 'sup', 'em', 'strong', 'var', 'code', 'ruby', 'rb', 'rt', 'rp'
        The following non-dangerous semantic inline elements are still not recognized (or were forgotten): 'acronym', 'q', 'address', 'dfn', 'samp', 'kbd'.
        The following potentially dangerous inline elements are forbidden: 'bgsound', 'style', 'a', 'image', 'map', 'area', 'embed', 'noembed', 'applet', 'object', 'param', 'script', 'noscript', 'iframe'
    • Block elements (most of them are semantic):
      • <html> if $wgRawHtml is set
        The following document-structure declarative elements are forbidden in the wiki source code: 'head', 'title', 'base', 'meta', 'link', 'body', 'frameset', 'frame'
      • <pre> do not interpret wiki markup, do not flow text (but still allow SGML entities)
      • <gallery>
        Note: the non-dangerous standard tags for column groups or column definitions ('colgroup', 'col') are still not supported (not even with their HTML/XHTML syntax) despite they would help saving lots of CSS duplication or complexities in tables and would offer additional usability features, plus adding semantics (notably for row groups).
      • Plus most non-dangerous HTML block elements: 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p', 'hr', 'div', 'center' (presentational, deprecated), 'blockquote', 'ol', 'ul', 'li', 'dl', 'dt', 'dd', 'table', 'caption', 'tr', 'td', 'th'
        Form elements are also not accepted in wiki source code: 'form', 'input', 'button', 'textarea', 'label' (as well as old gopher-like menu elements)
        Note: MediaWiki enforces the HTML DOM (and not the XHTML DOM), which restricts the recursive inclusion of some block element types (for example 'div' in the middle of a paragraph will break it, independently of its CSS 'display:' attribute which may be specified in a separate stylesheet, and not in the wiki source code). It will then close some elements implicitly and convert XHTML syntax to valid HTML/SGML, recognizing also elements that can't have no content in the HTML DOM (such as 'br').

Namespaces[edit]

In wikilinks and template inclusions, colons set off namespaces and other modifiers:

  • proper namespaces: Talk:, User:, Project:, etc.
  • 'special' namespaces: File: (was Image:), Category:, Template:
  • pseudo-namespaces: Special:, Media:
  • lone/leading :
    • lone : forces main namespace
    • leading : allows link to image page rather than inline image, or similarly to category or template page
  • interwiki links:
    • same project, different language: code of two or more letters
    • different project, same language: w: for Wikipedia, wikt: for Wiktionary, m: for Meta, etc. -- see m:Help:Interwiki_linking for more information (especially when using in templates; transwiki transclusion, iw_trans)
  • subst: force one-time template substitution upon edit, rather than dynamic expansion on each view
  • int:, msg:, msgnw:, raw: -- see m:Help:Magic words#Template modifiers
  • MediaWiki: magically access mediawiki formatting and boilerplate text (e.g. MediaWiki:copyrightwarning)
  • Standard parser functions: UC:, LC:, etc. (see m:Help:Parser function)
  • Additional parser functions: #expr:, #if:, #switch:, etc.
  • other extensions?

Several combinations of the above are possible, e.g. m:Help:Variable -- help namespace within Meta project.

MetaWiki markup description[edit]

The following text was at m:Wikitext Metasyntax and needs to be merged with the description above.

Document element declaration[edit]

Basic Markup[edit]

Define markups

Either[edit]

Parser outline[edit]

Another way to check whether we've covered everything in the grammar is to look at the steps the parser actually goes through:

The preprocessor does
  1. Strip (hooks before/after)
  2. Remove HTML-like comments
  3. Replace variables
    1. Subst
    2. MSG, MSGNW, RAW
    3. Parser functions
    4. Templates
The parser does
  1. Strip (hooks before, after)
    1. treats nowiki, pre, math and possibly other with 'userfunc tag hooks' hiero)
    2. Removes HTML-like comments
      • HTML comments are removed. (this text by HappyDog)
      • Any tags that are not allowed by the software (e.g. <script> tags) are replaced by HTML entitities, so they display as literals and are not treated as HTML by the browser.
      • Any badly formed tags (e.g. nested tags that shouldn't be nested, <tr> tags outside a <table> tag, etc.) are also replaced by HTML entitities so they are not treated as HTML.
      • Any attributes that are not allowed by the software (e.g. onMouseOver) are removed from otherwise valid tags.
      • A small amount of minor source formatting is applied (basically, the removal of unnecessary whitespace).
      • A closing tag is added at the end for all tags that are not closed properly. Note that some tags (e.g. <br>) don't need to be closed.
  2. Internal parse
    1. Noinclude/onlyinclude/includeonly sections
    2. Remove HTML tags
    3. Replace variables
      1. Hooks: Internalparsebeforelinks
    4. Tables
    5. Magic words
      1. Strip TOC (__NOTOC__, __TOC__)
      2. Strip no gallery (__NOGALLERY__)
    6. do headings
    7. Do dynamic dates
    8. Do quotes (' and '')
    9. Replace internal links
      1. Process images (do the caption recursively as it might contain links, or even other images...)
      2. Process categories
    10. Replace external links
    11. Re-replace masked internal links
    12. Do magic links (ISBN, RFC...)
    13. Format headings (__NEWSECTIONLINK__, __FORCETOC__...)
  3. Unstrip general
  4. Fix tags (french spaces, guillemet)
  5. Blocks (lists etc)
  6. Replace link holders
  7. Language converter:
    1. Normal text converted on a word by word basis(?) if autoconvert is enabled
    2. Text in -{code1:text1;code2:text2;...}- blocks converted manually
    3. Text in -{...}- not converted at all.
  8. Unstrip no wiki
  9. Extra tags and params
  10. User funcs?
  11. Un strip general
  12. Normalise char references
  13. Tidy + hook

Cheatsheet - MediaWiki

The save parser does
  1. Convert newlines
  2. Strips
  3. Pass 2
    1. Substs
    2. Strip again? gallery something.
    3. Signatures
    4. Trim trailing whitespace
  4. Unstrips
Retrieved from 'https://www.mediawiki.org/w/index.php?title=Markup_spec&oldid=4441973'

Set text color by using <span>This text will be red</span>

Set background color by using <span>This text will be on a green background</span>

See more results

Set both by using <span>This text will be yellow and on a green background</span>

Pandoc Has A Web Tool For This.

WikimediaUI color palette (M82)[edit]

ColorCodeColor name
#fffWhite
#f8f9fa
#eaecf0
#c8ccd1
#a2a9b1
#72777d
#54595d
#222mini black
#000Black
#eaf3fflight blue
#36cblue
#2a4b8ddark blue
#fee7e6pink
#d33red
#b32425dark red
#fef6e7light yellow
#fc3yellow
#ac6600dark yellow
#d5fdf4light green
#00af89green
#14866ddark green
Full

Other colors[edit]

Mediawiki Markdown Editor

Wikipedia help page for article editing and creating
HTML nameHexadecimal code
R G B
Red colors
IndianRedCD 5C 5C
LightCoralF0 80 80
SalmonFA 80 72
DarkSalmonE9 96 7A
LightSalmonFF A0 7A
CrimsonDC 14 3C
RedFF 00 00
FireBrickB2 22 22
DarkRed8B 00 00
Pink colors
PinkFF C0 CB
LightPinkFF B6 C1
HotPinkFF 69 B4
DeepPinkFF 14 93
MediumVioletRedC7 15 85
PaleVioletRedDB 70 93
Orange colors
LightSalmonFF A0 7A
CoralFF 7F 50
TomatoFF 63 47
OrangeRedFF 45 00
DarkOrangeFF 8C 00
OrangeFF A5 00
Yellow colors
GoldFF D7 00
YellowFF FF 00
LightYellowFF FF E0
LemonChiffonFF FA CD
LightGoldenrodYellowFA FA D2
PapayaWhipFF EF D5
MoccasinFF E4 B5
PeachPuffFF DA B9
PaleGoldenrodEE E8 AA
KhakiF0 E6 8C
DarkKhakiBD B7 6B
Purple colors
LavenderE6 E6 FA
ThistleD8 BF D8
PlumDD A0 DD
VioletEE 82 EE
OrchidDA 70 D6
FuchsiaFF 00 FF
MagentaFF 00 FF
MediumOrchidBA 55 D3
MediumPurple93 70 DB
BlueViolet8A 2B E2
DarkViolet94 00 D3
DarkOrchid99 32 CC
DarkMagenta8B 00 8B
Purple80 00 80
Indigo4B 00 82
SlateBlue6A 5A CD
DarkSlateBlue48 3D 8B
HTML nameHex code
R G B
Green colors
GreenYellowAD FF 2F
Chartreuse7F FF 00
LawnGreen7C FC 00
Lime00 FF 00
LimeGreen32 CD 32
PaleGreen98 FB 98
LightGreen90 EE 90
MediumSpringGreen00 FA 9A
SpringGreen00 FF 7F
MediumSeaGreen3C B3 71
SeaGreen2E 8B 57
ForestGreen22 8B 22
Green00 80 00
DarkGreen00 64 00
YellowGreen9A CD 32
OliveDrab6B 8E 23
Olive80 80 00
DarkOliveGreen55 6B 2F
MediumAquamarine66 CD AA
DarkSeaGreen8F BC 8F
LightSeaGreen20 B2 AA
DarkCyan00 8B 8B
Teal00 80 80
Blue colors
Aqua00 FF FF
Cyan00 FF FF
LightCyanE0 FF FF
PaleTurquoiseAF EE EE
Aquamarine7F FF D4
Turquoise40 E0 D0
MediumTurquoise48 D1 CC
DarkTurquoise00 CE D1
CadetBlue5F 9E A0
SteelBlue46 82 B4
LightSteelBlueB0 C4 DE
PowderBlueB0 E0 E6
LightBlueAD D8 E6
SkyBlue87 CE EB
LightSkyBlue87 CE FA
DeepSkyBlue00 BF FF
DodgerBlue1E 90 FF
CornflowerBlue64 95 ED
MediumSlateBlue7B 68 EE
RoyalBlue41 69 E1
Blue00 00 FF
MediumBlue00 00 CD
DarkBlue00 00 8B
Navy00 00 80
MidnightBlue19 19 70
HTML nameHex code
R G B
Brown colors
CornsilkFF F8 DC
BlanchedAlmondFF EB CD
BisqueFF E4 C4
NavajoWhiteFF DE AD
WheatF5 DE B3
BurlyWoodDE B8 87
TanD2 B4 8C
RosyBrownBC 8F 8F
SandyBrownF4 A4 60
GoldenrodDA A5 20
DarkGoldenrodB8 86 0B
PeruCD 85 3F
ChocolateD2 69 1E
SaddleBrown8B 45 13
SiennaA0 52 2D
BrownA5 2A 2A
Maroon80 00 00
White colors
WhiteFF FF FF
SnowFF FA FA
HoneydewF0 FF F0
MintCreamF5 FF FA
AzureF0 FF FF
AliceBlueF0 F8 FF
GhostWhiteF8 F8 FF
WhiteSmokeF5 F5 F5
SeashellFF F5 EE
BeigeF5 F5 DC
OldLaceFD F5 E6
FloralWhiteFF FA F0
IvoryFF FF F0
AntiqueWhiteFA EB D7
LinenFA F0 E6
LavenderBlushFF F0 F5
MistyRoseFF E4 E1
Grey colors
GainsboroDC DC DC
LightGreyD3 D3 D3
SilverC0 C0 C0
DarkGrayA9 A9 A9
Gray80 80 80
DimGray69 69 69
LightSlateGray77 88 99
SlateGray70 80 90
DarkSlateGray2F 4F 4F
Black00 00 00

Related links[edit]

Mediawiki Markdown Code

See also[edit]

Retrieved from 'https://meta.wikimedia.org/w/index.php?title=Wiki_color_formatting_help&oldid=20888848'