Mar 2 2011

Wasps(Actually bees) and Asteroids

Category: Codingtonyenkiducx @ 08:17

So I saw the brilliant JS browser code by Eric to add an asteroids type game into a web browser, and I thought I would make my own, and add to his a little..  Click the link below to start adding some wasps(Actually bees) to my site.  You can right-click the link and add as a bookmark and it can be used on any website then the same way as Erics.  When I'd finished it, I dumped a load of wasps(Actually bees) on a website and fired up Erics code as well, and you can actually hunt down and kill the bees!  Awesome coincidence.

 

Bookmark me

Tags:

Feb 12 2011

Compressing strings in memory - .net

Category: Codingtonyenkiducx @ 08:50

I came across an interesting problem recently, while writing something of a web spider.  The data it had to produce was pretty comprehensive and involved cross-page statistics, and the way the information was structured unfortunately meant I had to load pretty much the entire HTML content of a website into memory.  Which worked fine for a lot of sites, it's hardly a lot of data, but when we came to some of the massive sites(90,000 pages+) there was a severe memory problem, with the spider sometimes balooning up to 4GB of ram before a crash occured.

So there were a limited number of solutions.  But in the end the only viable one seemed to be in-memory compression of the text content.  Processor usage was not a huge concern, we were bare spiking to 50% usage on 3 cores of a quad xeon processor. 

After some digging around for zip components, I came across the gZip functions stored within .net3.5+ itself, and the decision was an easy one.  Just for reference, at it's most extreme our service was running up to 4GB of RAM and dying after processing the complete HTML for about 72,000 webpages.  After using this compression functionality on the strings we went down to just over 500MB of ram used to complete the processing of all 90,000+ pages.

Now, for some code..

First a function to take a string and compress it into a byte array

 

 

        Public Shared Function Compress(ByVal data As String) As Byte()
            If data Is Nothing Or data = "" Then
                Return New Byte(0) {}
            End If
            Dim strByteArray As Byte() = StrToByteArray(data)
            Dim output As New MemoryStream()
            Dim gzip As New GZipStream(output, CompressionMode.Compress, True)
            gzip.Write(strByteArray, 0, strByteArray.Length)
            gzip.Close()
            Return output.ToArray()
        End Function

 

 

There's nothing to amazing in there, just standard conversion from a string to a byte array, and then a memory stream to facilitate the transfer to the gZip stream.  There is a function to do string to byte array in there, I'll post that at the bottom.  Next, decompress it.

 

 

        Public Shared Function Decompress(ByVal data As Byte()) As String
            If data Is Nothing Or data.Length = 1 Then
                Return ""
            End If
            Dim input As New MemoryStream()
            input.Write(data, 0, data.Length)
            input.Position = 0
            Dim gzip As New GZipStream(input, CompressionMode.Decompress, True)
            Dim output As New MemoryStream()
            Dim buff As Byte() = New Byte(63) {}
            Dim read As Integer = -1
            read = gzip.Read(buff, 0, buff.Length)
            While read > 0
                output.Write(buff, 0, read)
                read = gzip.Read(buff, 0, buff.Length)
            End While
            gzip.Close()
            Dim readBuff As Byte() = New Byte(63) {}
            output.Position = 0
            Dim strRead As StreamReader = New StreamReader(output)
            Return strRead.ReadToEnd
        End Function

 

 

This is a little more fiddly, because we have to buffer the stream before it comes out, but still all simple stuff.  It's worth noting that this function is not safe from some bad or null data, so check your string inputs before you pass them into the function.  And last but not least, the string to byte array function.

 

 

        Public Shared Function StrToByteArray(ByVal str As String) As Byte()
            Dim encoding As New System.Text.UTF8Encoding()
            Return encoding.GetBytes(str)
        End Function

 

 

All done. If you wanted to integrate this into an existing bit of code, you can inject the functions into a property, like this.

 

 

Private pageContentCompressed As Byte() = New Byte(0) {}

Public Property pageContent() As String
  Get
    Return CommonFunctions.Decompress(pageContentCompressed)
  End Get
  Set(ByVal value As String)
    pageContentCompressed = CommonFunctions.Compress(value)
  End Set
End Property

 

 

 

c# code follows below.

 

 

public static byte[] Compress(string data)
{
    if (data == null | string.IsNullOrEmpty(data)) {
        return new byte[1];
    }
    byte[] strByteArray = StrToByteArray(data);
    MemoryStream output = new MemoryStream();
    GZipStream gzip = new GZipStream(output, CompressionMode.Compress, true);
    gzip.Write(strByteArray, 0, strByteArray.Length);
    gzip.Close();
    return output.ToArray();
}

public static string Decompress(byte[] data)
{
    if (data == null | data.Length == 1) {
        return "";
    }
    MemoryStream input = new MemoryStream();
    input.Write(data, 0, data.Length);
    input.Position = 0;
    GZipStream gzip = new GZipStream(input, CompressionMode.Decompress, true);
    MemoryStream output = new MemoryStream();
    byte[] buff = new byte[64];
    int read = -1;
    read = gzip.Read(buff, 0, buff.Length);
    while (read > 0) {
        output.Write(buff, 0, read);
        read = gzip.Read(buff, 0, buff.Length);
    }
    gzip.Close();
    byte[] readBuff = new byte[64];
    output.Position = 0;
    StreamReader strRead = new StreamReader(output);
    return strRead.ReadToEnd;
}

public static byte[] StrToByteArray(string str)
{
    System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
    return encoding.GetBytes(str);
}

private byte[] pageContentCompressed = new byte[1];

public string pageContent {
    get { return CommonFunctions.Decompress(pageContentCompressed); }
    set { pageContentCompressed = CommonFunctions.Compress(value); }
}

 

 

 

Tags:

Feb 12 2011

Regex for URL grabbing from HTML content

Category: Codingtonyenkiducx @ 08:46

Something I had to work on recently, which I wont go into, but the project had to grab HREF properties from A tags inside a mass of HTML, but with a 100% success rate no matter how poorly structured the HTML was.  I think this pretty much covers it, except for one small case..  If you were to not use quotes on your href, and you made up your own properties for the A tag, AND you had actual spaces in the href, it may end up missing the end of the URL off.  Otherwise, bullet-proof.

 "href+ ?=+ ?(?:(?:(?:""|')(.+?)(?:""|'))|(.+?)(?: class ?=| onclick ?=| id ?=| accesskey ?=| dir ?=| ltr ?=| lang ?=| style ?=| tabindex ?=| title ?=| onblur ?=| ondblclick ?=| onfocus ?=| onmousedown ?=| onmousemove ?=| onmouseout ?=| onmouseover ?=| onmouseup ?=| onkeydown ?=| onkeypress ?=| onkeyup ?=|>))"

Tags: