wtorek, 1 marca 2011

Regexes for xml parsing

Limitations of presented resolution:
- regexes - shall be used only for "uncertain" data (e.g. xmls are not well formed)
for "real" xml real parser shall be used (e.g. expat)
- elements structure where child element has same name is not allowed e.g.
<a><a></a></a>
- empty-element tags are not recognized e.g. <a/>

XML declaration - search for encoding
"<\\?xml(\\s+(?:[^\\?<>]*?\\s+)*encoding\\s*=\\s*(['\"])((?:(?!\\2).)*)\\2[^\\?<>]*)\\?>"
Result groups:
1 - attributes
3 - encoding attribute value

Element with arbitrary name
"<([^\\s<>]+)(?:(\\s[^<>]*)?>(.*?)</\\1)?\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value

Element with specified name
"<(" + elem_name + ")(\\s[^<>]*)?>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
3 - element value

Element with specified name and required attribute
"<(" + elem_name + ")(\\s+(?:[^<>]*?\\s+)*" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - required attribute value
5 - element value

Element with specified name and optional attribute
"<(" + elem_name + ")(\\s*>|\\s+(?:[^<>]*?\\s+)*(?:" + attr_name + "\\s*=\\s*(['\"])((?:(?!\\3).)*)\\3)?[^<>]*)>(.*?)</" + elem_name + "\\s*>"
Result groups:
1 - element name
2 - attributes
4 - optional attribute value
5 - element value

Search for attribute within attribute result from element parsing
"\\s+" + attr_name + "\\s*=\\s*(['\"])(.*?)\\1"
Result group 2 - attribute value

Here is discussion on stackoverflow regarding the regexes for xml:
http://stackoverflow.com/questions/5204022/regex-for-xml-parsing

czwartek, 24 lutego 2011

CMMB and not well-formed xml

Chinese mobile TV standard CMMB contains data in xml format.
Unfortunately broadcasters send data in files that are not well-formed xmls.
It is common that ampersand sign '&' is not in entity form '&amp;'.
Who knows what else can we find there...

Now I know that there is lot more:
- time is crazy, especially time shift from UTC, sometimes it is +8h, sometimes -8h, sometimes 0, different across country with special "cases" in Hong-Kong and Macau,
- moreover time in DTMB seems to be delayed from CMMB (and correct time) for ~15min. in Shanghai,
- EPG are not updated properly, sometimes delayed,
- ...

niedziela, 9 sierpnia 2009

Priority inversion interview

Interviews are great possibility to evalute one's own memory and cold blood during conversation. It is also good for remembering of some basic terms and problems.
Recentely I had to describe priority inversion problem. Basic stuff :) thread with lower priority is executed in place of higher priority thread. But why? Wait, ..., well, ..., shit, I do not remember.
Why Wikipedia is not connected to my brain - 3 threads, 2 competing for mutex, third executing, and so on (Mars Pathfinder problem, priority inheritance, priority ceiling, disabling interrupts).
Ok, but if I want to simulate such problem in Windows environment?
After quick search I found Priority Inversion and Windows NT Scheduler. I realized that:
1. real-time priority class shall be set for process - to disable kernel altering threads priorites,
2. example shall run on one core - in simple case of 3 threads,
3. on one core machine system will hang (real-time priority), therefore example can be run only on multi-core machine (but threads will use only one of the cores).
Example code for priority inversion:


class PrioriyInversion
{
static private object o = new object();

static void tf(object p)
{
string n = (string)p;
Console.WriteLine(p+" critical section needed");
lock (o)
{
Console.WriteLine(p+" critical section entered");
Thread.Sleep(5000);
Console.WriteLine(p+" after sleep");
}
Console.WriteLine(p+" critical section left");
}

static void tf2(object p)
{
string n = (string)p;
Console.WriteLine(p + " start");
for (int i = 0; i < 1000000; ++i)
for (int j = 0; j < 1000000; ++j)
;
Console.WriteLine(p + " stop");
}

static void Main(string[] args)
{
Console.ReadLine();

Thread t1 = new Thread(tf);
t1.Priority = ThreadPriority.BelowNormal;
Thread t2 = new Thread(tf);
t2.Priority = ThreadPriority.AboveNormal;
Thread t3 = new Thread(tf2);
t3.Priority = ThreadPriority.Normal;

t1.Start("t1");
Thread.Sleep(10);
t2.Start("t2");
t3.Start("t3");
}
}


Program has Console.ReadLine() at the beginning to let user change affinity to one of the cores only and set priority class of the process to real-time. If these conditions are not achieved, priority inversion will not appear.
Additionally to change affinity and priority Windows Task Manager can be used. But if you want to see threads inside process, Process Explorer from Sysinternals (now on Microsoft page) can be used.

środa, 8 lipca 2009

Domain Specific Language for WWW with Irony - Part 2

In last post I presented Irony usage for file download DSL. Library and console application was prepared and presented on CodeProject. This time I have added some GUI (WinForms) and multithreading to make application really useful. More in CodeProject article.

sobota, 6 czerwca 2009

Domain Specific Language for WWW with Irony

Recently I have published article at CodeProject. I was influenced by the idea of Domain Specific Languages for some specific tasks. In the article I have presented DSL used to automate some WWW operations (GET, POST, file download). To solve the problem I used Irony as DSL interpreter. More available here. More regarding great project Irony here.

sobota, 30 maja 2009

C# 4.0 (dynamic types, optional parameters, named arguments, covariants and contravariants handling)

New features of C# 4.o are presented in several places.
Ironically the most exposed is Doug Holland's at Intel's blog site :) - nice and very brief overview.
More info at Channel9 video provided by C# GoF :). The most striking facts are:
  1. dynamic typing is for better Office interaction,
  2. optional parameters and named arguments are evil, provided for VB developers to ease Office development in C#,
  3. covariance and contravariance are only interesting (good) changes in language, but it should have appeared in previous versions of C#.
Additionally more about covariance and contravariance is on Eric Lippert's blog and more about dynamic programming on Chris Burrows' blog.

czwartek, 21 maja 2009

volatile - what does it mean? (in C++, C, .NET and Java)

Well it is good felling that my interests are also Herb Sutter interests (in programming area of course:). In his article on Dr. Dobb's he presents differences between volatile meanings in different words (C++/C vs. .NET/Java).

Main points:
  • volatile in C++ is connected to optimization during access to variable - no optimization is allowed. Operations on nearby non volatile variables depends on compiler (can be move before or after volatile operation).
  • operations on volatile in C++ does not guarantee atomicity - resolution is atomic in C++ (or e.g. atomic_int in C) available in Boost (but I cannot find it) and will be in C++0x.
  • volatile in managed environments (.Net, Java) does not allow to move some operations on nearby non volatile variables - "ordinary reads and writes can't move upward across (from after to before) an ordered atomic read, and can't move downward across (from before to after) an ordered atomic write. In brief, that could move them out of a critical section of code, and you can write programs that can tell the difference.".
volatile in .Net/Java:
  • keeps order - lock free programming,
  • allows some code optimizations - not good for interactions with hardware - but BTW managed environments does not allow to treat memory as a resource to write somewhere programmer might like to - unmanaged code have to be used (e.g. C++ with its volatile:).
volatile in C++/C:
  • might not keep order - depends on compiler,
  • does not allow code optimization - good for interactions with hardware.
Generally speaking volatile in C++ means unoptimizable variable (optimizations in access are not allowed) and in managed worlds it means ordered atomics (other operations cannot move before/after and operations on volatile are atomic - sometimes?).

Not sure points:
  • atomicity of volatile operations in .Net/Java - for what variables, e.g. what about architectures 32, 64?,
  • "These (volatile) are suitable for nearly all lock-free code uses, except for rare examples similar to Dekker's algorithm. .NET is fixing these remaining corner cases in Visual Studio 2010" - what are the problems?