Blog

Link locator and regular expressions

Submitted by mimec on 2013-03-07

Remember the old joke about solving problems using regular expressions? It turns out it never gets out of date. I'm just putting together the markup processor for WebIssues, and since it also uses the link locator, I decided to take a closer look at it. The "link locator" is basically a small utility function which takes a piece of plain text, detects any URLs which appear in it and converts everything to HTML with links.

The heart of the link locator is the call to preg_split with an appropriate regular expression which matches any valid links. I've been using the simplest thing that I could come up with. It recognizes emails, URLs and issue identifiers. And identifier is straightforward; it consists of a "#" and one or more digits. But what makes an email address or URLs is much more difficult to define.

Initially I defined an email address as a sequence of non-whitespace characters starting and ending with a letter or digit and containing exactly one "@". It works, but gives false positives for meaningless strings like "a!@#$%^b". Looking for a better alternative I found this article. I decided to use a slightly modified version of the first regex, which allows the mailto: prefix and non-ASCII characters:

\b(?:mailto:)?[\w.%+-]+@[\w.-]+\.[a-z]{2,4}\b

Finding the start of an URL is easy if we assume that it can only start with one of the following prefixes: http://, https://, ftp://, www. or ftp. The last two make it possible to skip the protocol for common addresses like www.mimec.org. But where exactly does the URL end? In the previous sentence, the final dot is clearly punctuation, not part of the URL, even though dot can also be a part of the URL. My original regex assumed that the URL must end with a letter, digit, or slash.

This also works in most cases, but it's not perfect. We can allow more characters at the end of the URL, but the really interesting case is handling parentheses. Consider those two examples:

  • Visit my website (www.mimec.org).
  • For more information, visit http://en.wikipedia.org/wiki/Tool_(band).

In the first sentence, the closing parenthesis is not part of the URL, but in the second it is. That's obvious to a human reader, but what about a machine? Fortunately someone already invented a regex which solves this problem. The final regular expression which I'm going to use looks like this (split into three lines for readability):

(?:\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)|\\\\)
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$?!:,.-])*
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$])

I added file:// and \\ prefixes (the latter is for UNC paths, like \\server\folder\file.doc) and added backslash as valid character. They are already recognized by the Desktop Client as requested by one of the users. There is no reason not to handle them in the Web Client as well. Even though most browsers block access to such URLs, they can still be copied and pasted more easily.

While testing the regular expressions I made another interesting observation. When using character classes such as "\w" to match against a UTF-8 string, make sure to include the "u" modifier in the expression, for example "/(\w+)/u". Otherwise the result may break the UTF-8 encoding. For example, the Polish letter "ć" is represented in UTF-8 encoding as two bytes, equivalent to ASCII characters "ć". The first one is a "word" character, and the second is not, so the regular expression running in ASCII mode would break the string in the middle of the multi-byte character. Even the innocent "\s" pattern matches the "\xA0" character which can be part of a multi-byte character, so be careful.

Note that it took a bit of googling until I found information about that "u" modifier. The PHP manual should be more specific about it. What's worse, it seems that it's not always supported, even in recent versions of PHP. Just search for "this version of PCRE is not compiled with PCRE_UTF8 support" and you will see what I mean. Well, nothing is perfect, and PHP certainly isn't...

Text formatting in WebIssues

Submitted by mimec on 2013-02-18

Recently I wrote about version 1.1 of WebIssues and my plans to introduce issue descriptions and formatting of comments and descriptions. I also listed various markup languages which I considered for using in WebIssues. But first, let's look at how text is handled in the current version of WebIssues.

WebIssues currently uses plain text for comments. All whitespace characters, including indentation and line breaks, are preserved, making it easy to paste fragments of code directly into comments without breaking their formatting. At the same time, WebIssues wraps long lines, making it possible to write long paragraphs of text which are displayed correctly regardless of the width of the window. This basically corresponds to the "white-space: pre-wrap" CSS style. In addition, external URLs issue identifiers are automatically converted to links.

The key idea behind adding extended formatting options is to preserve compatibility with this "plain text" mode. It should be possible to edit an existing comment, enable formatting and add some markup to the existing text without breaking existing formatting. Now the problem is that most existing markup languages either ignore whitespace (for example HTML, unless wrapped in a <pre> block) or handle it in a specific way (for example, double line breaks are converted to paragraphs, indentation indicates a block of code, etc.). I'm not saying that they are wrong; this often makes sense when copying text from text files or plain text emails. However, I don't want to break habits of existing users of WebIssues. I would like to treat spaces and line breaks identically, whether formatting is enabled or not. Thanks to this, pieces of code will not break, even if they are not marked using special tags. These tags will only be used for decorative purposes, for example, by using different background color and enabling syntax highlighting.

There are generally two kinds of markup used in existing languages: punctuation (brackets, quotes, asterisks, etc.) and tags. Punctuation is used for inline formatting in various flavors of Wiki and languages such as Markdown or Textile, for example to indicate bold or italic text, though there is no single standard. It is also used for block formatting, for example trailing '>' may indicate a quote. HTML tags are commonly used by various languages, in addition to other format specifiers, though they are useful mostly for advanced formatting. Finally, various flavors of BBCode use custom tags, which are similar, but simpler than HTML. I decided to use a combination of punctuation for inline formatting and custom tags for block formatting. It's questionable whether yet another language should be invented when there's so many already, but I think it's going to be intuitive for everyone, and thanks to the embedded markItUp editor, there will be no need to remember it.

The following inline formatting tags will be supported: **bold**, __italic__, `monospace text` and [URL custom links]. The * and _ characters appear commonly in a technical text, so they need to be doubled to avoid false positives. Link syntax is quite similar to Wiki external links, however internal links can be created the same way, for example: [#123 some issue]. In the future it will be possible to introduce real Wiki functionality (where names can will be used instead of numeric identifiers).

Three different block level formatting tags will be suported. A [code][/code] pair will indicate a block of code, with optional syntax highlighing based on Google's prettify. A [quote][/quote] pair will indicate a quote with an optional header. Finally, a [list][/list] pair indicates a bullet list, where each item starts with one or more * (multiple asterisks indicate nested levels). Unlike automatic lists used by many markup languages, explicit tags will make it easy to clearly indicate where the list starts and ends. Also if will be possible to freely mix and nest all three kinds of tags.

I'm now playing with the prototype of the converter, and I may still do some minor changes, but so far I'm rather satisfied with the result. So when version 1.1 is going to be released? By the end of this year - that's all I can promise for now. I will probably release some beta version in a few months. But I still have my book and various other things to do, so don't expect miracles.

Tags

WebIssues 1.1

Submitted by mimec on 2013-02-08

Before I get to the main topic, just a short update on my novel :). I wrote a few more chapters and I have some new ideas, but I feel that I need some break. When I first started writing back in 2011, I was so involved that I could write all night, but now I have to get up earlier and generally I have too many things to do to be able to fully concentrate on this. So it's becoming a somewhat tedious process, quite like programming that I was trying to escape from. However, the result is still quite good and I'm certainly not going to leave it off.

Anyway, I can't ignore the fact that the idea of WebIssues 1.1 is growing in me. I've already had some items on the roadmap, but there's so many of them that I will have to split them into two releases. I'm probably going to postpone all improvements related to users, groups and permissions until version 1.2. The main improvement in version 1.1 will be issue descriptions. It's something that's clearly missing compared to other bug tracking systems. Of course, the first comment can act as a description, so the change is a bit cosmetic, but the ability to provide the description directly when creating an issue will certainly be an improvement. During the upgrade from version 1.0, the first comment will be automatically converted to a description.

Also projects will now have a description. There will be a project summary page, which in the future may contain other useful information, such as statistics, recent issues, etc.

The last (but not least) improvement in this area will be the ability to use simple formatting in both comments and descriptions. And that's an interesting problem, because there are lots of different markup languages that can be used to add formatting to a piece of text. Each existing standard has it's advantages and disadvantages:

HTML
Powerful and good for CMS, blogs, etc., but it's difficult to use by non-geeks. And it's even more difficult to display it correctly. A naive implementation opens the possibility for XSS attacks. Simple tools like kses still won't ensure that the markup is valid (e.g. check for unbalanced tags). More advanced tools like htmlpurifier are simply monstrous.
Textile / Markdown
They are quite different, but based on a similar idea: make the source text look as natural as possible. I prefer the latter, although they both seem to make more sense for writing longer articles (especially technical) than simple descriptions and comments.
Wiki markup
The main problem is that there is simply no such thing as a standard Wiki syntax. Although there are many similarities, each implementation has its own flavor. Also note that adding true Wiki support to WebIssues (i.e. being able to create cross-links based on titles, not just issue IDs) would be an entirely different story.
BBCode
Simple and widely used (also with many different flavors). On the other hand, square brackets don't seem more intuitive than angle brackets used in HTML.

This is a broader topic and I will write more about it in a separate post. So far I'm leaning towards a subset of Wiki syntax with some modifications, but I have to think more about it. And don't even get me started on the so called "WYSIWYG" editors. They are bloated and/or buggy and not 100% portable. I think I'm going to create something based on markItUp which is small, simple and easy to customize.

Yet another area of functionality that sooner or later must be (and will be) added to WebIssues is support for inbound emails. Some thoughts have been circling around and a few different persons have offered to help me implement this. If something gets done then I will include it in one of the next releases, but for now I can't promise anything.

Tags

Writing

Submitted by mimec on 2013-01-12

I want to write. Well, of course, I do; but I don't mean programs and technical documentation, but novels. My New Years resolution is to finish the book that I started some time ago and get it published. Why this sudden change of mind? Just a few months ago I wanted to start a business based on WebIssues. I even managed to briefly bring the attention of the management of the company I work for to it. But their idea of investing very little in order to hopefully get some profit wouldn't make too much sense. My own vision wasn't downright rejected, but considering all the political aspects that rule a corporation like this, and my complete lack of influence on these things, I can't realistically expect that this is ever going to happen.

Obviously, making a living from writing is an even more insane idea. It's a very demanding market, and in Poland also quite a narrow one. It also requires a lot of pure luck, probably even more than running a successful business. Not to mention that writing a novel requires huge amounts of time. But the real problem is that I'm really starting to hate programming. Commercial or open source, it's tedious, repetitive and rarely creative. And writing is not a new idea. I wrote some stories as a child. In high school I started writing a book with two friends; it didn't last long, but it was a lot of fun. But this time it's different, because I already have most of the plot in my mind, so the ideas are there waiting to be put on paper.

I already mentioned the novel I'm writing once or twice, but perhaps this time I will shed some more light on it. The idea came to my mind in summer 2011, while I was reading Neal Stephenson's Snow Crash, but it was also influenced by Lev Grossman's The Magicians which I read shortly before. It's basically a cyberpunk story, taking place largely in two different virtual worlds, but it also has some elements of contemporary fantasy and techno-thriller. The main characters are a few students of a school for young hackers, which is called the Academy of Magic, because in a virtual world, the boundaries between hacking and magic are blurred for the uninitiated. As my younger brother described it, when I told him about the novel today, it's like a "rolled pancake" :). I admit that mixing genres is risky, but if I do it well, maybe something interesting will come out of this.

And by the way, yesterday was my son's first birthday :). I must publish some new photos soon because I haven't done that in a while.

Tags

Habits and standards

Submitted by mimec on 2012-12-16

Yesterday was the seventh anniversary of mimec.org, but I will not elaborate on that. It suffices to say that the last year was very different from the previous ones. My son Adam changed from a blurry ultrasonographic image to a little boy who runs around the house. There is no time for anything. I can hardly keep up with my paid job, not to mention the open source projects, but I still managed to make four minor releases of WebIssues, an one release of Saladin (with another one pending), Fraqtive and Descend.

A few days ago I finally got a new laptop. It has a 15" Full HD display, which for some reason is very rare these days, powerful CPU and GPU and plenty of RAM. Minecraft runs at about 50 FPS at full screen with far viewing distance :). The bad news, though, is that my company run out of Windows 7 licenses, and I was forced to install Windows 8. I'm not going to rant about it, becuse enough has been said about it already. After installing the English language pack and removing the metro-garbage from the start menu, I'm getting used to it without having to change my habits too much. It's just hilarious that the now so called "desktop" applications suddenly became legacy and are only temporarily supported for backward compatibility. It reminds me of how all existing applications suddenly became "unmanaged" when .NET was created, as if they were crippled in some way. Microsoft suggested that in a few years all applications would become "managed", and finally support for those "unmanaged" ones would be dropped. Of course I don't mind .NET; it's just the kind of marketing speech that makes me laugh.

But when I saw Office 2013 with the black and white UI and icons designed for displays that support only 8 colors, it actually made me a bit upset. For a long time Office was setting the user interface design standards for a lot of Windows applications, especially regarding toolbars and menus, because the default ones always had a very plain look. Obiously I also always tried to keep up with the trends. Over ten years ago, in Grape3D, I used third party menu and toolbar classes for MFC which mimicked the flat, semi-transparent highlighting style know from Office XP. Later I wrote my own set of classes which broke out of the Office trends for a while and looked more like IE 6. But soon after that Microsoft released Office 2003 with the spectacular bright blue and orange UI which automatically changed its colors to match the Windows XP theme. Whether it looked good or not, it became a long time standard. Just take a look at version 0.9 of the WebIssues Client, or the so called "modern" Qt style which I wrote in 2008, and you will know what I mean.

The so called "ribbon" introduced in Office 2007 was something that people complained and ranted about nearly as much as the Metro UI in Windows 8, but it eventually turned out to be a very good idea. It was not just a cosmetic change, but something entirely new. Currently all my programs use a similar concept, which is available as part of the XmlUi component. At the same time the bright colors were toned down and the whole thing looked equally good with classic Windows style as with Luna and Aero. But now that I'm getting more and more used to Windows 8 and Office 2013, even the soft gradients and slightly rounded corners of XmlUi are beginning to look a bit odd. So what is the next logical step? Should we, developers, all turn to creating rectangular, black and white UI? How soon will Microsoft change its mind and what will be the next "standard"? Or perhaps it's time to stop bothering?