I came across an interesting problem recently, while writing something of a web spider. The data it had to produce was pretty comprehensive and involved cross-page statistics, and the way the information was structured unfortunately meant I had to load pretty much the entire HTML content of a website into memory. Which worked fine for a lot of sites, it's hardly a lot of data, but when we came to some of the massive sites(90,000 pages+) there was a severe memory problem, with the spider sometimes balooning up to 4GB of ram before a crash occured.
So there were a limited number of solutions. But in the end the only viable one seemed to be in-memory compression of the text content. Processor usage was not a huge concern, we were bare spiking to 50% usage on 3 cores of a quad xeon processor.
After some digging around for zip components, I came across the gZip functions stored within .net3.5+ itself, and the decision was an easy one. Just for reference, at it's most extreme our service was running up to 4GB of RAM and dying after processing the complete HTML for about 72,000 webpages. After using this compression functionality on the strings we went down to just over 500MB of ram used to complete the processing of all 90,000+ pages.
Now, for some code..
First a function to take a string and compress it into a byte array
Public Shared Function Compress(ByVal data As String) As Byte()
If data Is Nothing Or data = "" Then
Return New Byte(0) {}
End If
Dim strByteArray As Byte() = StrToByteArray(data)
Dim output As New MemoryStream()
Dim gzip As New GZipStream(output, CompressionMode.Compress, True)
gzip.Write(strByteArray, 0, strByteArray.Length)
gzip.Close()
Return output.ToArray()
End Function
There's nothing to amazing in there, just standard conversion from a string to a byte array, and then a memory stream to facilitate the transfer to the gZip stream. There is a function to do string to byte array in there, I'll post that at the bottom. Next, decompress it.
Public Shared Function Decompress(ByVal data As Byte()) As String
If data Is Nothing Or data.Length = 1 Then
Return ""
End If
Dim input As New MemoryStream()
input.Write(data, 0, data.Length)
input.Position = 0
Dim gzip As New GZipStream(input, CompressionMode.Decompress, True)
Dim output As New MemoryStream()
Dim buff As Byte() = New Byte(63) {}
Dim read As Integer = -1
read = gzip.Read(buff, 0, buff.Length)
While read > 0
output.Write(buff, 0, read)
read = gzip.Read(buff, 0, buff.Length)
End While
gzip.Close()
Dim readBuff As Byte() = New Byte(63) {}
output.Position = 0
Dim strRead As StreamReader = New StreamReader(output)
Return strRead.ReadToEnd
End Function
This is a little more fiddly, because we have to buffer the stream before it comes out, but still all simple stuff. It's worth noting that this function is not safe from some bad or null data, so check your string inputs before you pass them into the function. And last but not least, the string to byte array function.
Public Shared Function StrToByteArray(ByVal str As String) As Byte()
Dim encoding As New System.Text.UTF8Encoding()
Return encoding.GetBytes(str)
End Function
All done. If you wanted to integrate this into an existing bit of code, you can inject the functions into a property, like this.
Private pageContentCompressed As Byte() = New Byte(0) {}
Public Property pageContent() As String
Get
Return CommonFunctions.Decompress(pageContentCompressed)
End Get
Set(ByVal value As String)
pageContentCompressed = CommonFunctions.Compress(value)
End Set
End Property
c# code follows below.
public static byte[] Compress(string data)
{
if (data == null | string.IsNullOrEmpty(data)) {
return new byte[1];
}
byte[] strByteArray = StrToByteArray(data);
MemoryStream output = new MemoryStream();
GZipStream gzip = new GZipStream(output, CompressionMode.Compress, true);
gzip.Write(strByteArray, 0, strByteArray.Length);
gzip.Close();
return output.ToArray();
}
public static string Decompress(byte[] data)
{
if (data == null | data.Length == 1) {
return "";
}
MemoryStream input = new MemoryStream();
input.Write(data, 0, data.Length);
input.Position = 0;
GZipStream gzip = new GZipStream(input, CompressionMode.Decompress, true);
MemoryStream output = new MemoryStream();
byte[] buff = new byte[64];
int read = -1;
read = gzip.Read(buff, 0, buff.Length);
while (read > 0) {
output.Write(buff, 0, read);
read = gzip.Read(buff, 0, buff.Length);
}
gzip.Close();
byte[] readBuff = new byte[64];
output.Position = 0;
StreamReader strRead = new StreamReader(output);
return strRead.ReadToEnd;
}
public static byte[] StrToByteArray(string str)
{
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
return encoding.GetBytes(str);
}
private byte[] pageContentCompressed = new byte[1];
public string pageContent {
get { return CommonFunctions.Decompress(pageContentCompressed); }
set { pageContentCompressed = CommonFunctions.Compress(value); }
}
Tags: