The Art of File Format

Sunday, September 4, 2011

How to avoid ‘automatic’ cell formula conversion in NPOI

In all versions of NPOI including 1.2.4 alpha, someone may find that there is a weird issue which is different from the default behavior of Excel 2003/2007. If the xls is created by NPOI, no matter what you inputed, for example, 1-1, the cell will be converted to a formula cell. However, when you create a empty xls using Office 2003/Office 2007, this issue never happen.

After some research, I found that the issue is caused by two property in WSBoolRecord. They are AlternativeExpression and AlternativeFormula. To find the corresponding option in Office 2007, click Office icon – > Excel Options –> Advanced –> Lotus Compatibility settings:

Transition Formula Entry and Transition Formula Expression.

By default, both should not be checked. However, if the file is created by NPOI, both are checked. That’s why 1-1 is recognized as =1-1 instead of Jan 1.

We will fix this issue in NPOI 1.2.4 final release. However, to avoid this issue in previous releases, you can set AlternativeExpression and AlternativeFormula to false for each sheet. Here is the code example to modify sheet 0

((HSSFSheet)hssfworkbook.GetSheetAt(0)).AlternativeFormula = false;
((HSSFSheet)hssfworkbook.GetSheetAt(0)).AlternativeExpression = false;

Wednesday, August 17, 2011

Weird Office Excel Date System

Before coding NPOI, I always thought Excel use standard file date system and no doubt on its design. However, several days ago, I found that I had to convert date number read from raw stream to REAL date system because Excel use a very different date system, which are called 1900 and 1904 date system.

Here is a description from KB180162:

The 1900 Date System

In the 1900 date system, the first day that is supported is January 1, 1900. When you enter a date, the date is converted into a serial number that represents the number of elapsed days since January 1, 1900. For example, if you enter July 5, 1998, Microsoft Excel converts the date to the serial number 35981.
By default, Microsoft Excel for Windows and Microsoft Excel for Windows NT use the 1900 date system. The 1900 date system allows greater compatibility between Microsoft Excel and other spreadsheet programs, such as Lotus 1-2-3, that are designed to run under MS-DOS or Microsoft Windows.

The 1904 Date System

In the 1904 date system, the first day that is supported is January 1, 1904. When you enter a date, the date is converted into a serial number that represents the number of elapsed days since January 1, 1904. For example, if you enter July 5, 1998, Microsoft Excel converts the date to the serial number 34519.
By default, Microsoft Excel for the Macintosh uses the 1904 date system. Because of the design of early Macintosh computers, dates before January 1, 1904 were not supported; this design was intended to prevent problems related to the fact that 1900 was not a leap year. Note that if you switch to the 1900 date system, Microsoft Excel for the Macintosh does support dates as early as January 1, 1900.

As above says, there are two kinds of date system in Excel! What a hell!

1904 date system is always 1,462 days bigger than 1900 date system, which sounds simple but actually it’s not as simple as we think!

As we know, four years and one day=365*4+1=1461. So where is another day from? In fact, it is a Date bug in Excel according to KB214326. Excel treats 1900 as a leap year which means Feb. 29, 1900 is a valid day. That’s why 1904 date system is 1462 days bigger than 1900 date system. As KB214326 says, Microsoft is not willing to fix this bug as this bug will make Excel date system compatible with Lotus 1-2-3 date system.

So if you read a date serial number from raw Excel file, don’t forget to decrease 1 day.

By the way, to determine whether your Excel file uses 1904 date system, you can read 1904 Date System record(22h) from raw xls stream.

Offset	Field Name	Size	Contents
4	f1904	2	=1 if the 1904 date system is used

Reference: Office Excel 97-2007 Binary File Format Specification, page 34/349

How is Sheet Name handled in Excel?

While fixing bugs of SheetNameFormatter class, I found some rules of handling the sheet names in Excel.

Rules are as followes:

a. If there are spaces among the letters of the sheet name, single quotes will be added.

my sheet!A1 => ‘my sheet’!A1

b. If the sheet name starts with a digit instead of letter, single quotes should be added.

3rdTimeLucky!A1 => ‘3rdTimeLucky’!A1

c. If single quotes exists in the sheet name, single quotes should be added on both sides and the single quote in the middle should be replaced with two single quotes.

0’Hero!A1 => ‘0’’Hero’!A1

d. If the sheet name looks like cell reference like A1, B2, CC5, it should be braced with single quotes.

A12000!A1 => ‘A12000’!A1

e. If there are underscores in the sheet name, it’s NOT necessary to put single quotes at both sides.

my_sheet!A1 => my_sheet!A1

Hyperlink stream in Excel File

After I reviewed the code in HyperLink Record in POI, I notice that a small bug exists while reading the url. To fix the bug, I tried to find some document about it but faild. The Hyperlink stream is undocumented in the Microsoft Excel binary format document (though I don’t know the reason).

To help others understand it well, I’d like to summarize it in this post.

//link to http://www.lakings.com/ 
byte[] data1 = { 
0x02, 0x00, //First row of the hyperlink 
0x02, 0x00, //Last row of the hyperlink 
0x00, 0x00, //First column of the hyperlink 
0x00, 0x00, //Last column of the hyperlink 
//16-byte GUID. Seems to be always the same. Does not depend on the hyperlink type 
(byte)0xD0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B, 
0x02, 0x00, 0x00, 0x00, //integer, always 2 
// flags. Define the type of the hyperlink: // HyperlinkRecord.HLINK_URL | HyperlinkRecord.HLINK_ABS | HyperlinkRecord.HLINK_LABEL 
0x17, 0x00, 0x00, 0x00, 
0x08, 0x00, 0x00, 0x00, //length of the label including the trailing '' //label: 
0x4D, 0x00, 0x79, 0x00, 0x20, 0x00, 0x4C, 0x00, 0x69, 0x00, 0x6E, 0x00, 0x6B, 0x00, 0x00, 0x00, //16-byte link moniker: HyperlinkRecord.URL_MONIKER 
(byte)0xE0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B, //count of bytes in the address including the tail 
0x48, 0x00, 0x00, 0x00, //integer //the actual link, terminated by '\u0000' 
0x68, 0x00, 0x74, 0x00, 0x74, 0x00, 0x70, 0x00, 0x3A, 0x00, 0x2F, 0x00, 0x2F, 0x00, 0x77, 0x00, 0x77, 0x00, 0x77, 0x00, 0x2E, 0x00, 0x6C, 0x00, 0x61, 0x00, 0x6B, 0x00, 0x69, 0x00, 0x6E, 0x00, 0x67, 0x00, 0x73, 0x00, 0x2E, 0x00, 0x63, 0x00, 0x6F, 0x00, 0x6D, 0x00, 0x2F, 0x00, 0x00, 0x00, //standard 24-byte tail of a URL link. Seems to always be the same for all URL HLINKs 
0x79, 0x58, (byte)0x81, (byte)0xF4, 0x3B, 0x1D, 0x7F, 0x48, (byte)0xAF, 0x2C, (byte)0x82, 0x5D, (byte)0xC4, (byte)0x85, 0x27, 0x63, 0x00, 0x00, 0x00, 0x00, (byte)0xA5, (byte)0xAB, 0x00, 0x00};

I copied the above data of a URL-type hyperlink from TestHyperlinkRecord.cs in NPOI.

begin offset	length (bytes)	Description
0	2	First row number
2	2	Last row number
4	2	First column number
6	2	Last column number
8	16	GUID
24	4	unknown field, always equals 2
28	4	type of the hyperlink, HyperlinkRecord.HLINK_URL \| HyperlinkRecord.HLINK_ABS \| HyperlinkRecord.HLINK_LABEL
32	4	length of label including the trailing ‘’
36	variable1	label string
36+variable1	16	link moniker
52+variable1	4	count of bytes in the address including the tail
56+variable1	variable2	the actual link terminated by ‘\u0000’
56+variable1+variable2	24	standard 24-byte tail of a URL link. it seems to be always same

The tail of a url link is always as follows:

public static byte[] URL_TAIL = {0x79, 0x58, (byte)0x81, (byte)0xF4, 0x3B, 0x1D, 0x7F, 0x48, (byte)0xAF, 0x2C, (byte)0x82, 0x5D, (byte)0xC4, (byte)0x85, 0x27, 0x63, 0x00, 0x00, 0x00, 0x00, (byte)0xA5, (byte)0xAB, 0x00, 0x00};

The tail of a file link is always as follows:

public static byte[] FILE_TAIL = {(byte)0xFF, (byte)0xFF, (byte)0xAD, (byte)0xDE, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};

The link moniker are always same, too.

public static byte[] STD_MONIKER = {(byte)0xD0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B };public static byte[] URL_MONIKER = {(byte)0xE0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B };public static byte[] FILE_MONIKER = { 0x03, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, (byte)0xC0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x46 };

The bug exists in the HyperlinkRecord constructor with only one parameter RecordInputStream. Here is the code:

if (Arrays.Equals(URL_MONIKER, moniker)){ int len = in1.ReadInt();  address = in1.ReadUnicodeLEString( len /2);  tail = in1.ReadRemainder();}

The code reads “tail data” after reading the address. However, the length readed with in1.ReadInt() is the length including the tail data. So the correct length should be (len-tail_length)/2. The correct code is as follows:

if (Arrays.Equals(URL_MONIKER, moniker)){ int len = in1.ReadInt();  address = in1.ReadUnicodeLEString( (len - URL_TAIL.Length)/2); //minus the length of tail  tail = in1.ReadRemainder();}