New Adventures in Software

The Perils of Web Development, the Importance of Testing and Why 95% of the World Couldn’t See My Page

Posted in Software Development by Dan on December 17th, 2007

Write the web page. Check how it looks in browser of choice. Validate the XHTML. Validate the CSS. Double-check in alternative browser. Job done. Surely the worst that can happen in other browsers is that some things are a few pixels out of alignment?

Not quite. Because, despite being well-formed and perfectly valid (it even has a nice W3C badge on the bottom), most of the content was completely invisible in the two most popular web browsers.

The problem is Internet Explorer’s handling of the <script> tag. IE will not let you use a “self-closed” script tag (e.g. <script src=”myscript.js” />). It’s well-formed XML, and perfectly valid XHTML, just like the other self-closed tags IE allows, but IE assumes it is unclosed and therefore ignores the remainder of the content (at least until it hit a </script> tag belonging to another script further down the page). Brilliant. I’ve no idea what the justification for this is and, to be honest, I don’t care. I’d like to curse the ineptitude of Microsoft, change the page and that would be that. But no, I can’t just ignore it because it’s a bloody conspiracy. Firefox behaves in exactly the same way. So the lesson to take from this is: Opera and Safari – sensible; IE and Firefox – insane.

To be honest, I was already aware of this bug, in IE at least, because I had come across it once before (this makes it doubly frustrating). On that occasion I was editing the XHTML with IntelliJ IDEA, which very helpfully highlighted the problematic tag with a warning that it would not work with IE. For this page I used Vim, so no such help.

What is most annoying, and embarrassing, is that this page has been like this for ages (I can’t remember how long), displaying nothing but the Feedburner view of this blog, and a big white space, to most visitors (and sometimes I get literally several hits a week). I admit that the rest of the content is very minimal and not particularly interesting, but I’d prefer it if it was the visitors that chose to ignore it rather than their browsers.

Yes, yes… I know I was asking for trouble by neglecting to test in either of the most popular browsers. I have learned my lesson (until next time at least). Assumption truly is the mother of all fuck-ups.

Google takes on Wikipedia

Posted in The Internet by Dan on December 15th, 2007

The BBC brings news of Google’s plans for an online encyclopedia to rival Wikipedia.

The new project, called Knol, atttempts to address some of Wikipedia’s short-comings by putting more emphasis on respected authors and peer-reviewed content. In exchange for contributing, authors will receive a share of the ad revenue for their pages. Meanwhile, Wikipedia steadfastly refuses to display adverts, and instead relies on charitable donations to cover its costs.

This sounds a lot like the Scholarpedia project that I wrote about previously. But Scholarpedia lacks the considerable backing of the Google machine or the financial incentives of Adsense.

So will Google crush Wikipedia? Will Wikipedia have to adapt to survive? Or is it too entrenched already for Google’s efforts to have any real impact?

Wikipedia’s ad hoc editing certainly results in some interesting articles. During this year’s World Cup I found 3 separate pages detailing rugby player Jonny Wilkinson’s international points-scoring record, each with a wildly different number (including one that put him hundreds of points ahead of all-time record holder Neil Jenkins). Other things Wikipedia has taught me in the last year are that Clash frontman Joe Strummer was in favour of AIDS and global warming (or perhaps it was just a poorly constructed sentence), and that billionaire Chelsea Football Club owner Roman Abramovich is in fact a dustman.

Watchmaker Framework for Evolutionary Computation 0.4.3

Posted in Evolutionary Computation, Java by Dan on December 14th, 2007

This is mostly a maintenance release. Uncommons Maths is now a separate project so the Watchmaker Framework has been modified to use the official version of that library. There are a few other minor tweaks (a couple of classes have been moved around, but nothing in the core framework).

Version 0.4.3 also introduces an experimental EvolutionMonitor component. This a Swing view that gives you some insight into the current state of the population while your evolutionary algorithm is running. In this first version all it does is graph the mean and peak fitness scores (using JFreeChart). Future versions will hopefully display more information (perhaps I will add an API to enable data to be extracted from the population while running). The EvolutionMonitor implements the EvolutionObserver interface so you can hook it up easily by calling the addEvolutionObserver method of your EvolutionEngine.

The other new feature is a new termination condition for terminating the algorithm when the population fitness begins to stagnate. If this condition is used and there is no fitness improvement within a specified number of generations, the evolution engine will assume that no further improvement can be made and will return the fittest individual found so far. This is often a more practical approach than specifying a maximum total number of generations or a fixed time limit in advance.

Book Review: Programming Collective Intelligence

Posted in Evolutionary Computation, Python, Software Development by Dan on December 13th, 2007

It’s called “Programming Collective Intelligence” and is presented as a book for building “Smart Web 2.0 Applications” but it is essentially an extremely accessible explanation of a wide array of machine learning and data-mining algorithms. How do sites like Amazon and Last.FM make recommendations? How do search engines work? How does Google News manage to categorise and present the most important news articles without human intervention? How do you build a useful spam filter?

All of these questions are answered and compelling example applications are built step-by-step to demonstrate the power of the ideas presented here. Decision trees, genetic algorithms, neural networks, support vector machines, genetic programming, Bayesian classifiers and non-negative matrix factorisation are some of the techniques covered and all without the dry, maths-heavy text that normally fills books on these topics.

The examples throughout are exclusively in Python, which may have put me off had I realised this when I ordered it. I have nothing against Python except for my complete lack of experience with it. However, the examples are easy enough to understand for anybody familiar with other high-level languages. As result of reading the book, I may actually try my hand at a bit of Python hacking now.

How well do these techniques work? Well I’d never have found out about this book but for Amazon’s automated recommendations system. I’d thoroughly recommend this book to anyone looking to learn about interesting AI techniques without wading through opaque academic papers.

(If you find the genetic algorithms and genetic programming topics interesting, check out the Watchmaker Framework for Evolutionary Computation and some of the books recommended there.)

Uncommons Maths – The Name Game

Posted in Java by Dan on November 26th, 2007

Just to clear up a possible misconception, Uncommons Maths is not intended as a slight against Apache Commons Math (regardless of the incidental mention Commons Math received in my previous rant against Commons Lang). Commons Math provides a lot of functionality that Uncommons Maths does not and probably never will.

The domain name that I use for my software is uncommons.org. The library does maths-type stuff. The name was obvious. Why “Maths” not “Math”? Because that’s what we call it here.

I looked at Commons Maths once and it didn’t meet my requirements (I can’t even remember what they were), which is why I chose not to use it, not because it was defective in any way. The only reason I ever mentioned Commons Math is because it was at that moment, after previous disappointments with some of the Apache Commons libraries, that I exclaimed “We don’t need common software, what we need is uncommon software!”. It was a statement of intent. Hence the domain name.

Announcing Uncommons Maths

Posted in Evolutionary Computation, Java by Dan on November 19th, 2007

Uncommons Maths is a Java library consisting of a comprehensive random numbers package and other useful mathematical utility classes. It was originally part of the Watchmaker Framework for Evolutionary Computation but, due to its usefulness in other domains, it has now been converted into a standalone project (Apache Licence).

Download
API Documentation

This article briefly describes what’s available in this first public release. I am most definitely not a mathematician and, as such, this library is written by a programmer for programmers (but if mathematicians find it useful that’s good too). It includes classes that are useful in real world programs and is not intended to ever cover the full spectrum of mathematics. However, I hope that it will expand in scope over time in this spirit of pragmatism. To that end, suggestions and contributions are actively encouraged.

Random Number Generators

The Uncommons Maths library provides three easy-to-use, statistically-sound, high-performance pseudorandom number generators (RNGs). They are:

MersenneTwisterRNG: A Java port of the fast and reliable Mersenne Twister RNG originally developed by Makoto Matsumoto and Takuji Nishimura. This is faster¹ than java.util.Random and does not have the statistical flaws² of that RNG.
CellularAutomatonRNG: A Java port of Tony Pasqualoni’s ultra-fast Cellular Automaton RNG. It uses a 256-cell automaton to generate random values. To the best of my knowledge, this is the fastest¹ available pure Java RNG that completes the Diehard test suite without any problems.
AESCounterRNG: This is a cryptographically-strong³ non-linear RNG that is around 10x faster¹ than java.security.SecureRandom. Reverse-engineering the generator state from observations of its output would involve cracking the AES block cipher.

A benchmark comparing the performance of these three RNGs and the two JDK RNGs can be found here (under the title “RNG Performance”).
This applet demonstrates the non-randomness of java.util.Random.
The algorithm is not the only security consideration for RNGs. The source, secrecy and integrity of the seed data is also vital. For highly sensitive applications, consider using something like Fortuna.

Probability Distributions

Using the included probability distribution wrappers, these RNGs (and the standard JDK ones) can be used to generate values from Uniform, Normal, Binomial, Poisson and Exponential distributions.

Permutations & Combinations

Uncommons Maths also includes generics-enabled combination and permutation generators. These are based on Java classes originally written by Michael Gilleland.

Statistics

Uncommons Maths provides a statistical data set class that can calculate a variety of descriptive statistics (variance, median, standard deviation, arithmetic and geometric means, etc.) for a set of values.

Other

Other useful features in Uncommons Maths include utility methods to complement those in java.lang.Math, and utility classes for manipulating binary data.

15 Tips for Better Ant Builds

Posted in Ant, Java by Dan on October 25th, 2007

1. Automate everything.

You’re in charge, make the computer do the work. Ant can insert configuration into properties files, run SQL scripts, change file permissions, upload to FTP sites, and lots more. RTFM and make sure you are aware of everything that Ant can do for you.

2. Make builds Atomic.

Once you’ve automated everything, make sure you can run it all with a single command.Most software engineering endeavours are part of the eternal struggle to limit the consequences of human incompetence. Unlike a computer, the human mind is not well suited to performing a long sequenece of instructions without error. People get distracted, they forget things. My own experience tells me that the longest sequence of actions that a human can reliably reproduce consists of at most one step. Anything more than that and there is a tendency to miss steps out or do things in the wrong order. A slow build is much more bearable if it is atomic because you don’t need to get involved. You can leave it to complete while you do other important stuff.

3. Make it right first, then make it fast.

A build that is fast but wrong is not fast at all; it’s just wrong. A target that does not correctly ensure that its dependencies are rebuilt (when appropriate) will save you 3 seconds on each of the 30 rebuilds that it takes to debug the problem.A useful tip for improving the runtime of slow builds is to use the ANT_OPTS environment variable to tweak the JVM settings used by Ant. For long-running builds you may find that the -server switch provides a noticeable improvement. Alternatively, increasing the heap size (via the -Xmx switch) may help.

4. If you use Ant, use only Ant.

The build.xml is the one true build script. Don’t disrespect the build.xml by using your IDE’s build process in its place. Even if you are really diligent about configuring your IDE, you are very likely to end up with a slightly different build process. Better still, when your colleague checks in a change to build.xml, he or she won’t tell you and your build will be broken. If you’re lucky this will become obvious sometime after you start complaining loudly about others checking in broken code. In the worst case your build will be subtly broken and you won’t notice. Avoid the contempt of your co-workers and make your IDE use Ant. Don’t let the old UNIX hacker in the corner use make just because he doesn’t like XML (he doesn’t like Java either, he’s just doing it because the company hasn’t had any C work in the last ~~6 months~~ 10 years). If he uses make, you all have to use make (good luck), since there can be only one true build script.

5. Follow conventions.

Ant expects your build file to be called build.xml (unless you tell it otherwise). Developers expect the build file to be in the root of the project. Other than the understandable desire to confuse and irritate the rest of your team, including the lucky souls that get to maintain your code long after you get head-hunted by Google, there’s no reason not to follow these conventions. Another less well-known convention is to prefix all “internal” targets (those that should not be called directly) with a hyphen. This is suggested in the Ant manual. It has the advantage that it is not possible to invoke targets that follow this naming scheme from the command line.

6. Provide a clean-up target.

It is usual for makefiles and Ant scripts to provide a target called “clean” that removes all generated files and returns the project to its initial state. Without this functionality failed builds can be a real problem. Providing a clean target is made easier if source files (those that are under version control) and generated files (those that are derived from versioned files) are kept separate.

7. Compile test code at the same time as production code.

Test code, by definition, depends on production code (if it didn’t, your test coverage scores would not be very impressive). Therefore, changes to production code can potentially break test code. For this reason it is a good idea to enforce the rule that test classes are always compiled at the same time as production classes (you don’t have to run the tests at this point). Since you are firing up the compiler anyway, it won’t add too much overhead and it avoids breaking windows.

8. Make builds self-contained.

A build that has external dependencies is a build that is difficult to to configure. When your hard drive fails or a new developer joins the team, you want to be able to configure a new machine for building ASAP. If your build depends on tools like Checkstyle or TestNG, add these to the project repository. This makes your build simpler because you know exactly where to find the files and you don’t have to worry about version mis-matches. Other than Ant itself and the JDK, external dependencies should be eliminated wherever possible (Maven has a different approach to these kind of dependencies).

9. Parameterise essential configuration.

It’s not always practicable to eliminate all external dependencies. For example, you may need to configure a database connection as part of your build. Don’t hard-code this kind of configuration in the build.xml, use Ant’s support for properties files to allow individual developers to set configuration parameters. Don’t complicate matters by making everything configurable. Only parameterise those settings that really need to be configurable.

10. Provide sensible defaults.

Where parameters are required for building, you can make things much simpler by providing sensible default values rather than leaving everything to be configured by the individual developer. For example, if one of the parameters is a JDBC URL, a sensible default might be to point to localhost (most developers will probably be developing against their local database). If one of the required properties is the file system path to a particular tool, provide the default installation directory. By providing sensible defaults you can eliminate most of the configuration hassles.

11. Visualise dependencies.

The build script for a large project can have complex, often ad-hoc dependencies. Tools such as yWorks Ant Explorer can provide insight into dependencies by providing a graphical representation of the structure of the build. If the graph looks like one of these, consider refactoring.

12. Constrain classpaths.

Avoid the temptation to use a single global classpath for compilation. Always ensure that code is built with a classpath that contains only those classes that will be available to it at runtime. For example, if your build creates an EAR file and a Swing client, make sure that the Swing client is compiled with access only to those libraries that will be deployed with it. Do the same for the EAR file and any other modules. This approach minimises the potential for embarrassing NoClassDefFoundErrors at runtime and also helps to detect inappropriate dependencies.

13. Modularise your build.

Arrange your project into coherent, self-contained modules. For example, you may have a Swing GUI module and web module among others. Layout your project files to support this arrangement. The goal is to make modules as self-contained as possible and to minimise dependencies between them. The following example shows one possible way of arranging your project files:

myproject
|__modules
|  |__gui
|  |  |__lib
|  |  |__src
|  |     |__java
|  |        |__main
|  |        |__test
|  |__web
|     |__lib
|     |__src
|        |__html
|        |__java
|           |__main
|           |__test
|__build.xml

14. Favour convention over configuration.

If the source code for module1 is in module1/src/java, where would you expect to find the source code for module2? If files are in consistent predictable locations within each module, it’s not necessary to explicitly configure each location. By always following the same conventions about where things go in each module, you minimise complexity and make it easier to implement the next recommendation…

15. Eliminate duplication with macros.

If all of your modules are laid out identically, the common tasks (such as compiling source trees and creating JAR files) are essentially identical for each module. Ant 1.6 introduced macros to enable these operations to be defined just once and applied repeatedly with different parameters. Familiarise yourself with the Macrodef task and consider how it can be applied in your build files. The example macros below are for compiling generic modules. The first macro defines how a single source tree is compiled. Because we have followed the advice from the rest of this article, all we need to provide to run the macro is the name of the module (by convention modules are in directories of the same name, all Java source is in the src/java directory of the module and below this are the main and test source trees). The second macro simplifies things further by combining the two separate calls to the first macro. The main classes are built, then the test classes are built with the main classes added to the compiler’s classpath. Using this macro we can compile an entire module with a single line:

<compilemodule name="mymodule" />

<!-- This macro compiles one source tree (i.e. the main
     source tree or the unit test source tree) of a given
     module. -->
<macrodef name="compiletree">
  <attribute name="module"/>
  <attribute name="tree"/>
  <element name="treeclasspath" optional="true"/>
  <sequential>
    <mkdir dir="./@{module}/build/classes/@{tree}" />
    <javac destdir="./@{module}/build/classes/@{tree}"
                   debug="on"
                   deprecation="on"
                   optimize="on"
                   source="1.5"
                   target="1.5"
                   srcdir="./@{module}/src/java/@{tree}">
      <classpath>
        <treeclasspath/>
        <path refid="base.path"/>
      </classpath>
      <compilerarg value="-Xlint:unchecked" />
    </javac>
  </sequential>
</macrodef>
 
<!-- This macro compiles all source (including unit tests)
     for a single module.  -->
<macrodef name="compilemodule">
  <attribute name="name"/>
  <element name="moduleclasspath" optional="true"/>
  <sequential>
    <compiletree module="@{name}" tree="main">
      <treeclasspath>
        <moduleclasspath />
      </treeclasspath>
    </compiletree>
    <compiletree module="@{name}" tree="test">
      <treeclasspath>
        <!-- Add the main classes to the classpath for unit
                             test compilation. -->
        <path location="./@{name}/build/classes/main" />
        <moduleclasspath />
      </treeclasspath>
    </compiletree>
  </sequential>
</macrodef>

These macro definitions are somewhat verbose (due to Ant’s XML syntax), but the benefit is that they ensure that all modules are built in the same way. Rules for compilation only have to be defined once. Also, keep in mind that once you have written macros for one project, you can reuse them elsewhere. You may choose to put all of your macros in a separate file and import this wherever it is required.

ReportNG – Simple HTML reporting for TestNG

Posted in Java by Dan on October 21st, 2007

The default HTML reports generated by TestNG, though comprehensive, are a little bit horrid. Since Cedric introduced a reporting API to TestNG a little while ago, I have been using my own hacked together, Velocity-powered reporter.

It’s very simple and doesn’t have all of the information from the default report but it is easier to work out what’s going on with your test results at a high level. I’ve had it in a Java.net repository for a while and now I’ve just produced a slightly improved version for download by anybody that might find it useful.

A real-world example (unit tests for the Watchmaker Framework)
A very simple sample report showing some failures

It’s not particularly slick and there is room for improvement. Suggestions are welcome.

Useful Haskell Links

Posted in Haskell by Dan on August 16th, 2007

Compilers and Interpreters

Monads

Software Transactional Memory

Beautiful Concurrency (PDF) by Simon Peyton-Jones (from the book Beautiful Code).
Transactional Memory for Concurrent Programming (16 minute video) – An OSCON presentation by Simon Peyton-Jones (PDF slides)

The Unbeatable Draughts (Checkers) Player

Posted in Evolutionary Computation, Software Development by Dan on July 21st, 2007

While some of us can just about muddle through a Sudoku, others are aiming higher. The BBC has the story of how a team from the University of Alberta has solved Checkers (or “Draughts” as we like to call it in this part of the world). The Chinook program was already well capable of beating the best human players, now it’s not worth even trying since it can’t be beaten. Earlier incarnations of Chinook made use of evolutionary approaches. The latest release is the result of a phenomenal amount of CPU time dedicated to analysing every possible game position and determining the best move for each. A solution for Chess is still some way off.

« Older Posts

New Adventures in Software by Dan Dyer