Wednesday, August 17, 2011

Hyperlink stream in Excel File

After I reviewed the code in HyperLink Record in POI, I notice that a small bug exists while reading the url. To fix the bug, I tried to find some document about it but faild. The Hyperlink stream is undocumented in the Microsoft Excel binary format document (though I don’t know the reason).

To help others understand it well, I’d like to summarize it in this post.

//link to http://www.lakings.com/ 
byte[] data1 = { 
0x02, 0x00, //First row of the hyperlink 0x02, 0x00, //Last row of the hyperlink 0x00, 0x00, //First column of the hyperlink 0x00, 0x00, //Last column of the hyperlink //16-byte GUID. Seems to be always the same. Does not depend on the hyperlink type (byte)0xD0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B,
0x02, 0x00, 0x00, 0x00, //integer, always 2 // flags. Define the type of the hyperlink: // HyperlinkRecord.HLINK_URL | HyperlinkRecord.HLINK_ABS | HyperlinkRecord.HLINK_LABEL 0x17, 0x00, 0x00, 0x00,
0x08, 0x00, 0x00, 0x00, //length of the label including the trailing '' //label: 0x4D, 0x00, 0x79, 0x00, 0x20, 0x00, 0x4C, 0x00, 0x69, 0x00, 0x6E, 0x00, 0x6B, 0x00, 0x00, 0x00, //16-byte link moniker: HyperlinkRecord.URL_MONIKER (byte)0xE0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11, (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B, //count of bytes in the address including the tail 0x48, 0x00, 0x00, 0x00, //integer //the actual link, terminated by '\u0000' 0x68, 0x00, 0x74, 0x00, 0x74, 0x00, 0x70, 0x00, 0x3A, 0x00, 0x2F, 0x00, 0x2F, 0x00, 0x77, 0x00, 0x77, 0x00, 0x77, 0x00, 0x2E, 0x00, 0x6C, 0x00, 0x61, 0x00, 0x6B, 0x00, 0x69, 0x00, 0x6E, 0x00, 0x67, 0x00, 0x73, 0x00, 0x2E, 0x00, 0x63, 0x00, 0x6F, 0x00, 0x6D, 0x00, 0x2F, 0x00, 0x00, 0x00, //standard 24-byte tail of a URL link. Seems to always be the same for all URL HLINKs 0x79, 0x58, (byte)0x81, (byte)0xF4, 0x3B, 0x1D, 0x7F, 0x48, (byte)0xAF, 0x2C, (byte)0x82, 0x5D, (byte)0xC4, (byte)0x85, 0x27, 0x63, 0x00, 0x00, 0x00, 0x00, (byte)0xA5, (byte)0xAB, 0x00, 0x00};

I copied the above data of a URL-type hyperlink from TestHyperlinkRecord.cs in NPOI.

begin offset length (bytes) Description
0 2 First row number
2 2 Last row number
4 2 First column number
6 2 Last column number
8 16 GUID
24 4 unknown field, always equals 2
28 4 type of the hyperlink,

HyperlinkRecord.HLINK_URL | HyperlinkRecord.HLINK_ABS | HyperlinkRecord.HLINK_LABEL

32 4 length of label including the trailing ‘’
36 variable1 label string
36+variable1 16 link moniker
52+variable1 4 count of bytes in the address including the tail
56+variable1 variable2 the actual link terminated by ‘\u0000’
56+variable1+variable2 24 standard 24-byte tail of a URL link. it seems to be always same

The tail of a url link is always as follows:

public static byte[] URL_TAIL = {0x79, 0x58, (byte)0x81, (byte)0xF4, 0x3B, 0x1D, 0x7F, 0x48, (byte)0xAF, 0x2C,
 (byte)0x82, 0x5D, (byte)0xC4, (byte)0x85, 0x27, 0x63, 0x00, 0x00, 0x00,
 0x00, (byte)0xA5, (byte)0xAB, 0x00, 0x00};

The tail of a file link is always as follows:

public static byte[] FILE_TAIL = {(byte)0xFF, (byte)0xFF, (byte)0xAD, (byte)0xDE, 0x00, 0x00, 0x00, 0x00,
 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};

The link moniker are always same, too.

public static byte[] STD_MONIKER = {(byte)0xD0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11,
 (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B };
public static byte[] URL_MONIKER = {(byte)0xE0, (byte)0xC9, (byte)0xEA, 0x79, (byte)0xF9, (byte)0xBA, (byte)0xCE, 0x11,
 (byte)0x8C, (byte)0x82, 0x00, (byte)0xAA, 0x00, 0x4B, (byte)0xA9, 0x0B };
public static byte[] FILE_MONIKER = { 0x03, 0x03, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, (byte)0xC0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x46 };

The bug exists in the HyperlinkRecord constructor with only one parameter RecordInputStream. Here is the code:

if (Arrays.Equals(URL_MONIKER, moniker))
{
 int len = in1.ReadInt();
 
 address = in1.ReadUnicodeLEString( len /2);
 
 tail = in1.ReadRemainder();
}

The code reads “tail data” after reading the address. However, the length readed with in1.ReadInt() is the length including the tail data. So the correct length should be (len-tail_length)/2. The correct code is as follows:

if (Arrays.Equals(URL_MONIKER, moniker))
{
 int len = in1.ReadInt();
 
 address = in1.ReadUnicodeLEString( (len - URL_TAIL.Length)/2); //minus the length of tail
 
 tail = in1.ReadRemainder();
}

No comments:

Post a Comment