Friday 15 August 2008

Convert C++ to syntax colored HTML code with line numbers

Today I made a small C++ program which converts C/C++ code to syntax highlighted HTML code. Changes on the code snippets on previous post are already been made. You can also see the program in action in this post. Like it? :)
This code is not so small so I will not explain it in detail. For the code compilation the Boost C++ and its regular expression libraries must be installed on a system. Boost can be either built from source or installed via some package manager; your choice. When compiling program that uses boost regex (like this one) use the -lboost_regex flag when linking with GCC C++ linker.
OK, here's the code:

  1. #include <iostream>

  2. #include <fstream>

  3. #include <string>

  4. using namespace std;



  5. #include <boost/regex.hpp>



  6. const string help =

  7. "Usage:"

  8. "\n\tccode2html [input file] [output file]"

  9. "\n\tIf output filename is omitted it will be saved as [input file].html\n";



  10. const string pre_expression = "(>)|(<)|(&)";

  11. const string pre_format = "(?1\\&gt;)(?2\\&lt;)(?3\\&amp;)";



  12. const string line_expression = "^.*?$";

  13. const string line_format = "<li>$&</li>";



  14. const string whole_code_expression = "^.*$";

  15. const string whole_code_format = "<pre><ol>$&</ol></pre>";



  16. const string expressions =

  17. // single line comments

  18. "(//.*?(?=</li>))|"

  19. // multi-line comments

  20. "(/\\*.*?\\*/)|"

  21. // string literals

  22. "(\"(?:[^\\\\\"]|\\\\.)*\"|'(?:[^\\\\']|\\\\.)*')|"

  23. // precompile directives

  24. "(#.*?(?=</li>))|"

  25. // floating point numbers

  26. "(\\<[[:digit:]]+\\.[[:digit:]]+)|"

  27. // integer numbers

  28. "(\\<[[:digit:]]+\\>)|"

  29. // boolean literals

  30. "((?:\\<true\\>)|(?:\\<false\\>))|"

  31. // keywords

  32. "(\\<(__asm|__cdecl|__declspec|__export|__far16|__fastcall|__fortran|__import"

  33. "|__pascal|__rtti|__stdcall|_asm|_cdecl|__except|_export|_far16|_fastcall"

  34. "|__finally|_fortran|_import|_pascal|_stdcall|__thread|__try|asm|auto|bool"

  35. "|break|case|catch|cdecl|char|class|const|const_cast|continue|default|delete"

  36. "|do|double|dynamic_cast|else|enum|explicit|extern|false|float|for|friend|goto"

  37. "|if|inline|int|long|mutable|namespace|new|operator|pascal|private|protected"

  38. "|public|register|reinterpret_cast|return|short|signed|sizeof|static|static_cast"

  39. "|struct|switch|template|this|throw|true|try|typedef|typeid|typename|union|unsigned"

  40. "|using|virtual|void|volatile|wchar_t|while|NULL)\\>)";

  41. const string formats =

  42. "(?1<font color = \"#999999\"><i>$&</i></font>)"

  43. "(?2<font color = \"#D3D3D3\">$&</font>)"

  44. "(?3<font color = \"#009900\">$&</font>)"

  45. "(?4<font color = \"#006699\">$&</font>)"

  46. "(?5<font color = \"#996600\">$&</font>)"

  47. "(?6<font color = \"#993366\">$&</font>)"

  48. "(?7<font color = \"#990000\"><b>$&</b></font>)"

  49. "(?8<font color = \"#003399\"><b>$&</b></font>)";



  50. int main (int argc, char* argv[]) {

  51. string input_filenm;

  52. string output_filenm;

  53. if ( argc > 3 || argc == 1 ) {

  54. cout << help;

  55. exit (-1);

  56. }

  57. else if ( argc == 2) {

  58. input_filenm = argv[1];

  59. output_filenm = input_filenm + ".html";

  60. }

  61. else {

  62. string input_filenm = argv[1];

  63. string output_filenm = argv[2];

  64. }

  65. ifstream in ( input_filenm.c_str() );

  66. if (!in.is_open()) {

  67. cout << "Failed to open: " << input_filenm << '\n';

  68. }

  69. else {

  70. cout << input_filenm << " opened successfully\nProcessing...\n";

  71. }

  72. ofstream out ( output_filenm.c_str() );

  73. string in_string;

  74. char c;

  75. while (in.get(c)) {

  76. in_string.append(1,c);

  77. }

  78. boost::regex reg;


  79. // replace <, > and & signs with appropriate html escape characters

  80. reg.assign(pre_expression);

  81. in_string = boost::regex_replace(in_string, reg, pre_format, boost::match_default | boost::format_all);


  82. // add <li> ... </li> tags on each line

  83. reg.assign(line_expression);

  84. in_string = boost::regex_replace(in_string, reg, line_format, boost::match_default |

  85. boost::format_all | boost::match_not_dot_newline);


  86. // format and color code syntax

  87. reg.assign(expressions);

  88. in_string = boost::regex_replace(in_string, reg, formats, boost::match_default | boost::match_default |

  89. boost::format_all);


  90. // add <pre><ol> on start and </ol></pre> at the end of the file

  91. reg.assign(whole_code_expression);

  92. in_string = boost::regex_replace(in_string, reg, whole_code_format);


  93. in.close();

  94. out << in_string;

  95. out.close();

  96. return 0;

  97. }


First three lines include important header files for the program. Second include is required for file i/o operations on lines 72, 73 and 79. On the 6. line the file boost/regex.hpp is included. That file is required for regular expression functions and objects. I used only regex class and regex_replace functions in this code. String help is declared from lines 8 to 11 and it explains the usage of the program if zero or more than two command-line arguments are passed when program is ran.

The heart of the program is in the lines 13-55 and 85-103. See the Boost Regex Documentation for more information about regular expressions used in C++. Many regular expression tutorials and explanations can be found here also. There is even a book dedicated to that topic called: Mastering Regular Expressions which is strongly suggested.

I'll just shortly explain how I used regexes in this program. Two strings are declared at 13. and 14. lines. The first string pre_expression is passed to assign function of the regex class (see 88. line). It describes search pattern, in this case the document is searched for '<','>' and '&' signs that needs to be replaced with the HTML escape sequences. Those escape sequences are passed to the pre_format string inside. The ?1, ?2 and ?3 are representing the sub-expression indexes. After assigning the pre_expression string to reg regex on line 88 the function regex_replace, which takes four arguments, is called.
First argument in_string is the string filled with the whole code from the file. That string is searched with the pre_expression patterns. When a match is found it is replaced with the data in the pre_format string. Last parameter is used for boost specific flags.

First, the '<','>' and '&' signs are replaced with the HTML escape sequence equivalents. The <li> and </li> tags are added at start and end of each line for numbering. Than the code syntax is highlighted with appropriate HTML <font> tags declared in string formats on lines 47-55. Finally <pre> and <ol> and their closing tags are put on the start and the end of the file.

Only problem in this code is regex for multi-line comments. The expression '/\\*.*?\\*/' matches all text between the '/*' and '*/' including <li> and </li> tags inside. Never figured out how to exclude those tags from formating. If some regex professional is reading this please help out!

0 comments: