Capping IT Off

Capping IT Off

MegaUpload, an Information Management cloud

This post has not been written to discuss the legal aspects of the recent MegaUpload events. This post’s goal is presenting a high level and technical point of view of how MegaUpload was designed.

As you know, MegaUpload was a collection of websites allowing users to upload and download any file. MegaUpload information system is a pure cloud based solution. To propose services, to store petabytes of data or to bill - all have been done using cloud services. As the heart of these cloud services was Information management, I think it would be interesting to investigate, from a technical point of view.

Based on the MegaUpload indictment, some figures and some technical facts are really interesting.

Hosting capacity:

Carpathia hosting is a North-American cloud provider based in Virginia. MegaUpload was a customer of Carpathia and was renting about 25 petabytes of storage space and about a thousand servers (half of them physically located in the USA). It means an average of 25 Terabytes of storage per server.

Leaseweb is based in Netherlands and provided more than 600 servers to MegaUpload shared between the Netherlands, Belgium, Germany and the US. Cogent communications was providing 36 servers in the US and in France. I’ve got no information about their provided storage capacity.

Compression techniques:

As most of the files stored were already highly compressed (video or audio formats), every time anew file was uploaded on MegaUpload, a unique identifier was generated through an MDS hash calculation to determine if the same file had been previously stored and avoid storing the same file twice (or more).. Note that MegaUpload gave the new upload a different URL from the URL given to the initial upload, so this compression by avoiding twin files was fully transparent for the users.

Information Access:

To allow internet users to find the stored files, websites have been created describing the content (name, kind of data, photos, name of the actors or singers …). In Business Intelligence we call it generating metadata and this is part of the presentation layer (like a Business Objects universe or the metadata description on Documentum).

The internet search engines like Google, Bing or Yahoo were automatically indexing all this public metadata content, allowing users to find the right website and so the file location through a MegaUpload URL.

Billing:

As with all commercial internet websites billing is a mandatory component. There was indirect billing through the presence of advertisements on the MegaUpload websites and direct billing offered to the final users to improve their upload and download capacity (bandwidth and quantity of simultaneous downloads) with a subscription.

Apparently subscription represented 75% of the revenue, whilstadvertisement was about 25%.

The final billing relationship (payment) with customers, providers and advertising was once again purely done on a cloud mode, managed through Paypal, Moneybookers, Adbrite or Partygaming.

High level architecture design:

Megaupload Information Management Architecture
Megaupload Information Management Architecture
 

If you look at this simple high level architecture design, it looks like an information management public cloud based solution, with an integration layer, a storage layer, a presentation layer, a usage layer and, as it was an e-commerce web site, a billing layer.

As you may see, MegaUpload was a very primitive information system cloud solution. I’m sure there were also BI tools for internal usage (analyzing click streams and customer’s activity), but I have no precise information on it, maybe next episodes will cover this domain.

About the author

Manuel Sevilla
4 Comments Leave a comment
I'm not quite sure if compression is really the right term. What they did sounds more like de-duplication.
I also don't see how supplying two different URLs for the same content would lead to more transparency.
Supplying the same URL with the message that this content had already been shared would do this somewhat more.
But ofcourse individual URLs would enable more tracking possibilities.

Please note that I have no personal experience with mega upload, so I'm basing this pure on your content.
msevilla's picture
Technically, I agree with you; it is much more de-duplication than compression. But in the end, doing zip compression, columnar compression or de-duplication, the goals are the same: lowering storage costs and/or reducing I/O. I usually regroup all that under a unique term: compression.
About transparency, I mean the fact you are loading a new file or a file which has been previously loaded is transparent to you, you always have your own URL (I was not evoking the fact MegaUpload was more or less transparent, only the fact the user is not aware of the de-duplication technique, it is transparent for him). And of course, multiple URL for the same contents improves tracking possibilities but is good also for Search Engine Optimization, as Google doesn’t know it is the same file, he thinks there are different contents, so the same file may appear multiple times on the Google index, decreasing the visibility of your competitors.
Thank you for the clarification. I now understand what you meant with transparency and I fully agree with the points made.
I wonder though whether MegaUpload just grew and a design was formed based on best-practices and decisions made at that precise moment, or if it was really well-designed. I'd like to compare this to Facebook where it was initially intended for Harvard University, and later on additions were made.
However they are great examples of how an idea and solution can grow and still be able to adjust to the situation at hand (flexibility), and still being (able to work towards) a well-designed system.
msevilla's picture
Very good question :)
I guess their first verson was already on the cloud and designed to be scalable, even if some functions (like maybe de-duplicate and the collection of websites) may have been implemented later.
Scalability has to be a major architecture key driver when building an Information management solution. If the system is not designed to be scalable, then there is a point where you may need to rebuild it (partially or fully). As you've noticed, only an unusual growth and budget (like Facebook) may allow a full rebuild :)

Leave a comment

Your email address will not be published. Required fields are marked *.