Friday 8th November, 2013
12:45pm to 1:30pm
Ad Networks act as the middleman between advertisers and publishers on the Internet. The advertiser is the agent that wants to allocate a particular ad in different medias. The publisher is the agent who owns the medias. These medias are usually web pages or mobile applications.
Each time an ad is shown in a web page or in a mobile application an impression event is generated. These impressions and other events are the source of analytical panels that are used by the agents (advertiser and publisher) to analyze the performance of its campaigns or its web pages.
Presenting these panels to the agents is a technical challenge because Ad Networks have to deal with billions of events each day, and have to present interactive panels to thousands of agents. The scale of the problem requires the usage of distributed tools. Obviously, Hadoop may come to the rescue with its storage and computing capacity. It can be used to precompute several statistics that are later presented in the panels.
But that is not enough for the agents. In order to perform exploratory analytics they need an interactive panel that allows to filter down by a particular web page, country and device in a particular time-frame, or whichever other ad-hoc filter.
Therefore, something more than Hadoop is needed in order to store the data and to perform some statical precomputations. At Datasalt, we have addressed this problem for some clients and we have found a solution than will be presented in the talk.
The solution includes two modules: the off-line and the on-line.
The off-line module is in charge of storing the received events and preforming the most costly operations: cleaning the dataset; performing some aggregations in order to reduce the size of the data; and create some file structures that will be used later to serve the on-line analytics. All these tasks are handled properly by Hadoop. The most innovative part on this process is the last step where some file-structures are created for being exported to the on-line part in order to serve the analytical panels.
The on-line module is in charge of serving the analytical queries received from the agents' panel webapp. The queries are basic statistics (count, count distinct, stdev, sum, etc) run over a subset of the input dataset represented by an ad-hoc filter. The challenge here is that the system has to serve statistics for filters “on the fly”. That makes it impossible to precalculate everything on the off-line side. Therefore, part of the calculations must be done on-demand. That would not be a problem if the scale of the data wouldn't be that big. Some kind of scalable database is needed for this task.
Datasalt has developed the open-source distributed database Splout SQL (http://sploutsql.com/) that allows to create SQL materialized views over Hadoop data. It scales by the use of partitioning.
Splout SQL is the perfect tool for serving on-line statistics at scale for Ad Networks as the data can be partitioned by agent. In this way, it is possible to isolate queries corresponding to an individual agent to a particular partition.
In the talk, we will sketch the architecture of the whole system. Specifically, we will talk about how Splout SQL was designed and why it is useful for the case of Ad Networks. Some other techniques that needed for the problem, like sampling and in-memory storage, will be stated as well.
- Learning the difficulties of building “Google Analytics” clones.
- Understanding the problems that Ad Networks face.
- Discovering new ways of serving Hadoop results.
Sign in to add slides, notes or videos to this session