Working on my recent projects I once again cursed about the flat structure of my source code file. In modern source code editors it is quite common to be able to “fold” block of coding to enhance readability and overview, but nobody seemed to have cared about comments here: In almost all cases, source code editors just allow to collapse the comment block itself. Thinking about this I came up to the point that source code is not that “flat” as the plain text file might suggest. The same way as you collapse a block in the source code editor, you often do it when browsing through the content of an XML file. That brought me to the idea of structuring source code with the help of XML.
srcML
Using XML for storing an AST like approach is not new to research. For example Maletic, Collard and Marcus provided a discussion of this approach and their disadvantages in their paper on Source Code Files as Structured Documents which was presented at the IWPC in 2002. Based on this approach a dedicated project was forked off. Its progress is documented at the website of the SDML srcML Project. You can also find the DTD file for defining the structure of such an srcML file (for the programming languages C, C++ and Java) on their pages.
In my oppinion their approach is too lax in some regards: For example they argue that it should be kept up to the organization, the coding guidelines or at least to the developer in which regard a comment should be denoted to associate to the “productive” coding. However, especially the “scope” of a comment can reveal much information about the structure of the coding. Often comments and their ranges of validity imply vital information about the partition of the statements when looking from a semantical point of view. Thus, the attributes provided for the comment tag in their DTD which just differenciate between the type of comment that was used in the original source code is too scarce for gaining this piece of information. The root cause of this issue most likely is located in the fact that their toolkit is trying to generate a srcML’s file from a “canonical” source code file.
Better use srcML as Original
Moreover, I believe the toolkit approach described above is heading towards the false direction. Instead of trying to “augment” and enrich existing coding with XML tags to make “more valuable” by making it easier to analyse and transform, the better approach should be to declare the XML-enabled file as being the source! Let’s face the truth: Source code intended for compilation is no longer written with simple editors such as vi(m), jed, leafpad or Windows Notepad. Gone are these days where the developer was forced to use primitive means to squeeze out his thoughts and write it down letter by letter into a text format. The best example for this evolution from “writing down statements” towards a tool for development is Emacs which provides a rich set of plugins that allow you to write source code in almost any programming language you can imagine. Moreover, today’s reality is that the “standard developer” uses a source code editor which is part of an Integrated Development Environment (IDE) such as Eclipse, Netbeans, Visual Studio or similar for his task of implementing the next cutting-edge product. Last but not least those editors are already entire applications for themselves and feature advantages to their users such a like:
- Automatic Code Completion
- Online Parsing with Structure Analysis
- Bracket Matching
- Syntax Highlighting
So the most natural thing of the world is to enhance those feature-rich editors and not just to store “the text files” but allowing the developer to – for example – state the scope of a comment (in terms of lines) which is then stored in an “enhanced srcML” file on the local disk. For the sake of compatibility a redudant copy of the source code in the plain text format (which then does lack this information) can be stored side by side. The compiler then can use the plain text format for doing its processing — although it might be worth a though if using the srcML file would not provide additional benefit anyway (for example, the validity of the syntax is partially assured, because the well-formness of an XML file can be checked much easier than checking the consistency of an C++ source file; additionally, the AST is already partly present and does not need to be derived via costly string operations again). Let the idea of structure coding with the help of XML not new to the world, I was however still shocked, when I googled for the search terms “structured source code” editor: The number of hits is quite low and when looking into the details of the links it gets obvious that storing structured source code as original neither has hit reasearch nor the industry yet. This isn’t rocket science, is it?