programming

QTextBrowser vs. QWebView

Submitted by mimec on 2013-07-12

There are two classes in Qt that can display HTML content: QTextBrowser and QWebView. They seem similar, but in fact they are quite different. The most obvious difference is that QTextBrowser is part of QtGui, while QWebView belongs to the QtWebKit module. It's a pretty big library, about 12.5 MB (32-bit Windows DLL), which is more than QtCore and QtGui combined. I'm going to try to explain why there's such difference.

Although QTextBrowser can display a piece of HTML content, technically it's a rich text viewer, not an HTML viewer. It may seems like the same thing, and in fact there are many similarities. A rich text document consists of blocks of text (similar to HTML paragraphs) and frames (similar to HTML div's). It also supports tables, lists and images. However, the layout of the rich text document is much simpler than that of HTML. The text of the document simply flows from top to bottom. There is no concept of absolute positioning, floating frames, etc. You can forget about tableless layouts and most CSS styles. Even the margin and padding settings are not always respected and it takes some experimentation to get the spacing right.

The QTextBrowser can export the rich text in HTML format, and import it back without loosing information. However, this is not fully standard compliant HTML, and when opened in a regular browser, it may not necessarily look the same as in the QTextBrowser. What's worse, although HTML can be imported from an external source, only a limited subset of HTML tags, attributes and CSS properties will be recognized, and the document will almost surely look very different than in a browser.

Side note: The difference between rich text model and HTML model is not just specific to Qt. Also many common word processors, including MS Word, have a similar limitation. You can import HTML to Word, but the layout of a web page will not be strictly preserved. And because MS Outlook uses the same engine as Word to render HTML emails, it also only supports a limited subset of HTML and CSS. This makes it difficult to create HTML emails which look good in all email clients including Outlook.

This obviously doesn't mean that you shouldn't use QTextBrowser at all. First of all, it's got an excellent cursor-based API which lets you create rich documents very efficiently without writing a single piece of HTML markup. This is excellent for creating various reports, etc. Just ensure that when you make a lot of changes in the document, it shouldn't be connected to the browser, otherwise updating it will be slow. Also don't forget that rich text is also supported out-of-the by many other widgets, for example QLabel or QToolTip. Not to mention that you can also edit rich text using the QTextEdit widget (which is, in fact, inherited by QTextBrowser).

QWebView, on the other hand, is a full blown, standards compliant HTML browser. Actually it's based on the same code which powers Chrome and Safari browers. It works natively with HTML and supports all tags, attributes and CSS properties. It also has a built-in JavaScript interpreter. You should definitely use QWebView when you need to display external web content, create complex layouts or use dynamic, scripted content. This comes at the cost of the extra 12.5 MB linked library and slightly higher resource usage. It's hard to measue the difference in performance. Simple documents work very fast in both controls, and complex documents can only be handled well by QWebView.

Despite many advantages, QtWebKit is far from being perfect. In the next part I will write about various bugs and problems I've encountered so far while porting the WebIssues Desktop Client from QTextBrowser to QWebView.

Positioning multiple windows in Qt

Submitted by mimec on 2013-05-28

It's been a while since the last post, but I've been quite busy with WebIssues. Now that the first beta of version 1.1 is released, it's time to catch up with other things. Actually I started writing this post some time ago, but then my HDD crashed, so I had to do it once again.

QWidget provides the saveGeometry/restoreGeometry pair of methods for a convenient way of storing and restoring the position of a window. This works fine when our application has just one top level window. But what if the application can open multiple windows of a given type? Consider a chat or email application, where each conversation or message can be opened in a separate window. It should be possible to store the position and size of the windows, but we also want to make sure that when the user opens multiple windows, they are displayed at slightly different positions, so that they don't cover one another entirely.

Doing this manually is not an easy task. Besides handling the maximized state of the window, restoreGeometry also ensures that the window is not displayed off screen when a display device is disconnected or its resolution has changed since the geometry was saved. Instead of implementing a custom mechanism, we can take advantage of saveGeometry/restoreGeometry, with a slight modification which makes multiple windows behave correctly.

The idea is that in addition to the geometry (which is serialized as a QByteArray) we also store a boolean flag called "offset". Whenever a new window is opened, we remember its position and set the offset to true, meaning that when another window is opened, an offset should be added to the stored position. When a window is closed, we remember its final position and set the offset to false. Another window will be then opened at the exact position of the old one. One way of doing this is handling the show and hide events:

void MyWindow::showEvent( QShowEvent* e )
{
    if ( !e->spontaneous() )
        storeGeometry( true );
}

void MyWindow::hideEvent( QHideEvent* e )
{
    if ( !e->spontaneous() )
        storeGeometry( false );
}

void MyWindow::storeGeometry( bool offset )
{
    QSettings settings;
    settings.setValue( "MyWindowGeometry", saveGeometry() );
    settings.setValue( "MyWindowOffset", offset );
}

Note that we ignore spontaneous events, which are fired when the window is minimized or restored by the user. All that's left is restoring the window's geometry. This can be done, for example, in the constructor of the window's class:

QSettings settings;
restoreGeometry( settings.value( "MyWindowGeometry" ).toByteArray() );
if ( settings.value( "MyWindowOffset" ).toBool() ) {
    QPoint position = pos() + QPoint( 40, 40 );
    QRect available = QApplication::desktop()->availableGeometry( this );
    QRect frame = frameGeometry();
    if ( position.x() + frame.width() > available.right() )
        position.rx() = available.left();
    if ( position.y() + frame.height() > available.bottom() - 20 )
        position.ry() = available.top();
    move( position );
}

We add an arbitrary offset of 40 pixels to the window's X and Y position. Then we ensure that the window doesn't span outside the right and bottom edge of the screen and otherwise move it to the left or top edge, respectively.

Note that this is not a perfect solution. If the user opens and closes the windows in certain order, they will overlap (I leave it as an excercise to the reader to come up with such scenario). But it nicely handles the two most common scenarios: reopening a closed window in the same location and ensuring that multiple windows, opened one after another, have slightly different positions.

The markItUp text editor

Submitted by mimec on 2013-04-03

As I already explained, version 1.1 of WebIssues will allow using a special markup language when editing comments and descriptions. The syntax will be a hybrid of BBCode, Wiki and various other markup languages. Obviously it's hard to expect that the users will remember Yet Another Markup Language. Instead, they will use a familiar toolbar and key shortcuts to make selected text bold or italic, create a hyperlink or insert a block of code. I decided to use markItUp because it's simple, lightweight (the original uncompressed script is just 20 KB long) and fully customizable. Unlike some other editors, it's not designed for any particular markup language (like markdown or wiki syntax), but lets you design a custom toolbar with whatever markup you need.

After playing with markItUp for a while, I decided to customize it a bit more by modifying the script. It could already generate a preview using AJAX and has multiple ways of showing it - in a popup window, embedded iframe or a custom HTML element. I decided to use the custom element, but I wanted it to be shown dynamically when the preview is first invoked, just like in the two other modes. I also integrated it with prettify, about which I wrote last time, so that syntax highlighting works in the preview.

I also slightly changed the way the markup is added. First, in my version, the openWith/closeWith text is not added to empty lines (or lines with nothing but whitespace). Second, the closeBlockWith text is added before any trailing newlines and other whitespace. It works better this way, especially if you want to apply bold or italic to multiple lines (each line is treated as a separate block, so it must be wrapped in separate bold/italic tags). Finally I removed the special handling for Ctrl and Shift keys when clicking on the toolbar. It's hard to remember and can be confusing, so I decided to simply remove it.

Just like with Prettify, I minified the whole thing using the Closure Compiler, this time in simple mode, because advanced doesn't work too well with jQuery plug-ins. I also had to replace an eval() with direct call to the preview() function, because eval() wouldn't work with minified code. The final script is just 10 KB long. The unminified version of both this script and my version of Prettify are available in the trunk/tools subdirectory of the WebIssues SVN repository, in case you're interested.

Note that obviously I had to implement a similar editor in the Desktop Client, which uses a native QPlainTextEdit to edit the comments and descriptions. JavaScript was not an option here, but adding a few toolbuttons and reimplementing the relevant function in C++ was very straightforward. By the way, I recently also rewrote the entire markup processor, which converts the markup to HTML and also exists in two versions, PHP and C++. It was the third or forth time I wrote it from scratch, but this time I decided to do it "the right way". Instead of a very long, convoluted loop with a state machine, I decided to use a recursive descent parser, which is very simple, because the markup language has a LL(1) grammar with just a few production rules. The code is now slightly longer, but incomparably easier to understand, plus I fixed a few remaining bugs.

Some of you might wonder why I didn't decide to implement a WYSIWYG editor in WebIssues. They are large, complex beasts, which may be useful for large CMS systems designed to be used by non-technical people who like to edit their articles as if they were using MS Word. For a relatively small project like WebIssues, it would be an overkill to include a word processing package in it. Besides, these editors don't always produce valid HTML, and they don't work consistently across various browsers (not to mention the Desktop Client). What's worse, despite the tempting naive implementation (which uses htmlspecialchars_decode to circumvent WebIssues' built-in XSS protection!), it's actually very difficult to sanitize and validate the resulting HTML. Instead, WebIssues will still support the old style plain text format, with no special processing (except for turning URLs into links), which indeed is truly WYSIWYG. Depending on the level of technical skills of the majority your users, you will be able to choose the either plain text or text with markup as the default format.

Syntax highlighting with Prettify

Submitted by mimec on 2013-03-20

In version 1.1 of WebIssues it will be possible to use the [code] tag in comments and descriptions. Text included in this tag will be displayd using monospace font, with all formatting disabled. This is useful for including fragments of output, log files, etc., but it can also be used for code snippets; after all it's an issue tracking software. Developers generally like their code colored, so all kinds of editors and other development tools support syntax highlighting for various languages.

Of course creating a syntax hightlighter is a very complex task, especially given the vast number of programming languages with very different syntax. No wonder than one of the popular tools, SyntaxHighlighter, contains about 100 kilobytes of (partially minified) JavaScript code, slightly more than jQuery. Another example is GeSHi, a 200 kilobyte PHP class with 3 megabytes (!) of language definitions. But syntax highlighting is just decoration, not a key future, so I want to avoid having to download tons of .js and .css files just to achieve this.

The problem is that these tools try to be much too thorough. I don't care if every single PHP function is highlighted, as long as the most important keywords are, along with comments, strings and fragments of HTML that are embedded into the PHP file (which in turn can contain embedded CSS and JavaScript). That's exactly what Google Code Prettify does. It is actively maintained by folks from Google, and it's used by Google Code itself and Stack Overflow, among others. I decided to use it as well.

The version which is included in the current development version of WebIssues is just 16 kilobytes of code. I removed a few unnecessary features and incorporated some of the additional languages into the main file. Currently supported languages include HTML and XML, C and C++, C#, Java, Bash, Python, Perl, Ruby, JavaScript, CSS, SQL, Visual Basic and PHP. I also packed the final script using the Closure Compiler (also from Google) which decreased the file almost four times.

When I was looking for a syntax highlighter, initially I was thinking about doing it server side, using PHP code. It didn't occur to me that this can be done on the client using JavaScript. At first the idea seemed strange to me. However it's actually great and can significantly reduce the server load. From the user's perspective it doesn't really matter. After all, have you ever noticed that Stack Overflow highlights code snippets on the fly using JavaScript?

There is yet another benefit of using client script instead of PHP: it is possible to highlight code also in the Desktop Client. Otherwise the entire mechanism would have to be reimplemented in C++. Version 1.0 of the Desktop Client displays issue details using QTextBrowser, which doesn't support JavaScript and has very limited support for HTML and CSS. But version 1.1 will use QtWebKit, the Qt port of the same engine which powers Chrome and Safari. The advantage is that issue details will have the same look and feel in both the Web Client and the Desktop Client, and obviously it's possible to embed Prettify. I found some minor issues with QtWebKit, probably worth a separate post, but generally, everything works very well.

Link locator and regular expressions

Submitted by mimec on 2013-03-07

Remember the old joke about solving problems using regular expressions? It turns out it never gets out of date. I'm just putting together the markup processor for WebIssues, and since it also uses the link locator, I decided to take a closer look at it. The "link locator" is basically a small utility function which takes a piece of plain text, detects any URLs which appear in it and converts everything to HTML with links.

The heart of the link locator is the call to preg_split with an appropriate regular expression which matches any valid links. I've been using the simplest thing that I could come up with. It recognizes emails, URLs and issue identifiers. And identifier is straightforward; it consists of a "#" and one or more digits. But what makes an email address or URLs is much more difficult to define.

Initially I defined an email address as a sequence of non-whitespace characters starting and ending with a letter or digit and containing exactly one "@". It works, but gives false positives for meaningless strings like "a!@#$%^b". Looking for a better alternative I found this article. I decided to use a slightly modified version of the first regex, which allows the mailto: prefix and non-ASCII characters:

\b(?:mailto:)?[\w.%+-]+@[\w.-]+\.[a-z]{2,4}\b

Finding the start of an URL is easy if we assume that it can only start with one of the following prefixes: http://, https://, ftp://, www. or ftp. The last two make it possible to skip the protocol for common addresses like www.mimec.org. But where exactly does the URL end? In the previous sentence, the final dot is clearly punctuation, not part of the URL, even though dot can also be a part of the URL. My original regex assumed that the URL must end with a letter, digit, or slash.

This also works in most cases, but it's not perfect. We can allow more characters at the end of the URL, but the really interesting case is handling parentheses. Consider those two examples:

  • Visit my website (www.mimec.org).
  • For more information, visit http://en.wikipedia.org/wiki/Tool_(band).

In the first sentence, the closing parenthesis is not part of the URL, but in the second it is. That's obvious to a human reader, but what about a machine? Fortunately someone already invented a regex which solves this problem. The final regular expression which I'm going to use looks like this (split into three lines for readability):

(?:\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)|\\\\)
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$?!:,.-])*
(?:\([\w+&@#\/\\%=~|$?!:,.-]*\)|[\w+&@#\/\\%=~|$])

I added file:// and \\ prefixes (the latter is for UNC paths, like \\server\folder\file.doc) and added backslash as valid character. They are already recognized by the Desktop Client as requested by one of the users. There is no reason not to handle them in the Web Client as well. Even though most browsers block access to such URLs, they can still be copied and pasted more easily.

While testing the regular expressions I made another interesting observation. When using character classes such as "\w" to match against a UTF-8 string, make sure to include the "u" modifier in the expression, for example "/(\w+)/u". Otherwise the result may break the UTF-8 encoding. For example, the Polish letter "ć" is represented in UTF-8 encoding as two bytes, equivalent to ASCII characters "ć". The first one is a "word" character, and the second is not, so the regular expression running in ASCII mode would break the string in the middle of the multi-byte character. Even the innocent "\s" pattern matches the "\xA0" character which can be part of a multi-byte character, so be careful.

Note that it took a bit of googling until I found information about that "u" modifier. The PHP manual should be more specific about it. What's worse, it seems that it's not always supported, even in recent versions of PHP. Just search for "this version of PCRE is not compiled with PCRE_UTF8 support" and you will see what I mean. Well, nothing is perfect, and PHP certainly isn't...