Limitations of presented resolution:
- regexes - shall be used only for "uncertain" data (e.g. xmls are not well formed)
for "real" xml real parser shall be used (e.g. expat)
- elements structure where child element has same name is not allowed e.g.
<a><a></a></a>
- empty-element tags are not recognized e.g. <a/>
XML declaration - search for encoding
"<\\?xml(\\s+(?:[^\\?<>]*?\\s+)*encoding\\s*=\\s*(['\"])((?:(?!\\2).)*)\\2[^\\?<>]*)\\?>"
Result groups:
1 - attributes
3 - encoding attribute value
Element with arbitrary name
"<([^\\s<>]+)(?:(\\s[^<>]*)?>(.*?)</\\1)?\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value
Element with specified name
"<(" + elem_name + ")(\\s[^<>]*)?>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value
Element with specified name and required attribute
"<(" + elem_name + ")(\\s+(?:[^<>]*?\\s+)*" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - required attribute value
5 - element value
Element with specified name and optional attribute
"<(" + elem_name + ")(\\s*>|\\s+(?:[^<>]*?\\s+)*(?:" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3)?[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - optional attribute value
5 - element value
Search for attribute within attribute result from element parsing
"\\s+" + attr_name + "\\s*=\\s*(['\"])(.*?)\\1"
Result group 2 - attribute value
Here is discussion on stackoverflow regarding the regexes for xml:
http://stackoverflow.com/questions/5204022/regex-for-xml-parsing
wtorek, 1 marca 2011
Subskrybuj:
Komentarze do posta (Atom)
Brak komentarzy:
Prześlij komentarz