wtorek, 1 marca 2011

Regexes for xml parsing

Limitations of presented resolution:
- regexes - shall be used only for "uncertain" data (e.g. xmls are not well formed)
for "real" xml real parser shall be used (e.g. expat)
- elements structure where child element has same name is not allowed e.g.
<a><a></a></a>
- empty-element tags are not recognized e.g. <a/>

XML declaration - search for encoding
"<\\?xml(\\s+(?:[^\\?<>]*?\\s+)*encoding\\s*=\\s*(['\"])((?:(?!\\2).)*)\\2[^\\?<>]*)\\?>"
Result groups:
1 - attributes
3 - encoding attribute value

Element with arbitrary name
"<([^\\s<>]+)(?:(\\s[^<>]*)?>(.*?)</\\1)?\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value

Element with specified name
"<(" + elem_name + ")(\\s[^<>]*)?>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value

Element with specified name and required attribute
"<(" + elem_name + ")(\\s+(?:[^<>]*?\\s+)*" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - required attribute value
5 - element value

Element with specified name and optional attribute
"<(" + elem_name + ")(\\s*>|\\s+(?:[^<>]*?\\s+)*(?:" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3)?[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - optional attribute value
5 - element value

Search for attribute within attribute result from element parsing
"\\s+" + attr_name + "\\s*=\\s*(['\"])(.*?)\\1"
Result group 2 - attribute value

Here is discussion on stackoverflow regarding the regexes for xml:
http://stackoverflow.com/questions/5204022/regex-for-xml-parsing