Apps  Contact  Seminars 

Posts tagged ‘mapreduce’


August 3rd, 2011

Meetup – Big Data #3

by Amrinder

OK, here at big data# 3 – 3rd meetup in the big data meetup group – will be live blogging.  Two main presentations – Joey Echevarria from Cloudera presenting on HBase and Ted Dunning presenting on MapR.

Due to the explosion in the analytical requirements and the limitations of traditional RDBS based solutions, big data is the way that most of the systems are moving to, and HBase and MapReduce are some key components to grasp.

Key points from Joey’s presentation

  • Column families
  • Table regions

Reasons for using HBase – variable schema in each record, and row access to each column family.

HBase Applications

  1. LILY, OpenTSDB
  2. Real-time ad optimizations – capturing impressions and serving ads.  HBase front-end, and HBase back-end. User model is about 40 attributes.
  3. Click stream sessionization
  4. Mozilla Soccorro – to gather Firefox crashes (which going by my recent experience, happens a lot ;-) )
  5. Navteq – Location based content serving
  6. Cloudera – Gathers data about customer clusters, where each customer node is a key with Avro values

Key Points from Ted Dunning’s talk

Some Motivations for MapR system
  • Read-only files assumption, which doesn’t hold in enterprise setting
  • Shuffle was based on HTTP
MapR Improvements (Changes to things that exist in Hadoop)
  • Faster file system, with fewer copies, multiple NICS, NO file descriptor or page-buf competition
  • Faster map reduce – Direct RPC to receiver, and very wide merges
MapR Innovations (Things that don’t exist in Hadoop)
  • Volumes
  • Read/write random access file system that allows distributed meta-data
  • Application (framework) level NIC bonding, instead of switch level  (Q/A: I asked what is really the benefit, considering that performance is not likely to be changed.  As per Ted, the benefit here is on the virtualization of RPC receivers.  So, essentially, the main innovation here is the abstraction.  This idea of abstraction is very similar to how NTELX’s RTS transaction handler engines scale in the PREDICT system.)
  • MapR Containers - Containers are about 16-32 GB.  Each container can hold up to 1B files and directories.  100 M containers = ~ 2 Exabytes.  25GB to cache all containers for 2EB cluster
MapR’s Streaming Performance
Seems to be about twice as fast in reading and writing, and about twice as fast for Terasort.



August 24th, 2010

Google’s universal search gives non-deterministic answers? (Perhaps due to MapReduce?)

by Amrinder

One of the innovations at Google was the launch of the universal search a couple of years ago.  While it was considered a drastic change by outsiders (something that can fundamentally change the user experience), the search giant was able to roll it out without much fuss, and pretty much all users are very familiar with it by now.  You search for “Elvis” and you can see books about Elvis, blog posts, images, regular web pages, all interspersed using that magical ranking that made the search engine the king (no pun intended).

However, sometimes the universal search gives different results just a few second apart.  Here, consider the first try for RYN:

Google Search RYN - Try 1

Google Search RYN - Try 1

Now, let us try the same search again:

Google Search RYN - Try 2

Google Search RYN - Try 2

So, sometimes the stock results are shown, and sometimes not.  You can try this behavior yourself, by clicking on this search a few times: RYN.  I can’t say how many times you might have to try it, but chances are, you will be able to replicate this behavior easily.

Now, the universal search likely uses the MapReduce paradigm (I am entering purely speculative mode here, so be forewarned.)  Say the map function of a search term returns a list of search categories (which are then say farmed out to worker machines to process).  Some worker machines may or may not return to the master in time, and in the reduce phase, the master may be only putting together the results that it received in time (and ordering it using the search results rank and such).

So, in case the worker does not return the results for the search term from the “finance” search category in time, the master is not able to include those results.

Also, by repeatedly trying out the search, I observe that that this non-deterministic behavior manifests mostly for finance and image search categories.  One would imagine that Google has a check in place that if the “core webpage” search category worker has not returned, the results are not considered valid, and master must wait for that.

That is enough speculation for a day, so I guess I will just wait until someone with REAL knowledge of how Google’s universal search can enlighten everyone on this matter.

Oh, and btw, the original MapReduce paper does address non-determinism, but only as:

When the map and/or reduce operators are nondeterministic, we provide weaker but still reasonable semantics. In the presence of non-deterministic operators, the output of a particular reduce task R1 is equivalent to the output for R1 produced by a sequential execution of the non-deterministic program.

Clearly, this non-determinism is not directly related to the end users non-determinism that we refer to here.



Switch to our mobile site