Monday, December 26, 2011

Towards an Auction based internet


The post below was quoted and discussed extensively in GigaOM, 14 Jan 2011 – Software Defined Networks could create an auction-based bazaar.

Published in Telecom Asia-Jan 13,2012 - Towards an auction-based internet
Are we headed to an Auction based Internet? This train of thought (no pun intended), which struck me while I was travelling from Chennai to Bangalore last evening, was the result of the synthesis of different ideas and technologies which I had read in the recent past.

The current state of technology and the technology trends do seem to indicate such a possibility.  An auction-based internet would be a business model in which bandwidth would be allocated to different data traffic on the internet based on dynamic bidding by different network elements. Such an eventuality is a distinct possibility considering the economics and latencies involved in data transfer, the evolution of the smart grid concept and the emergence of the promising technology known as the OpenFlow protocol.  This is further elaborated below

Firstly, in the book “Grids, cloud and virtualization”, by Massimo Caforo and Giovanni Aloisio, the authors highlight a typical problem of the computing infrastructure of today. In the book, the authors contend that a key issue in large scale computing is data affinity, which is the result of the dual issues of data latency and the economics of data transfer. They quote, Jim Gray (Turing award in 1998) whose paper on “Distributed Computing Economics” states that that programs need to be migrated to the data on which they operate rather than transferring large amounts of data to the programs.  This is in fact used in the Hadoop paradigm, where the principle of locality is maintained by keeping the programs close to the data on which they operate.

The book highlights another interesting fact. It says “cheapest and fastest way to move a Terabyte cross country is sneakernet (i.e. the transfer of electronic information, especially computer files, by physically carrying removable media such as magnetic tape, compact discs, DVDs, USB flash drives, or external drives from one computer to another). Google used sneakernet to transfer 120 TB of data. The SETI@home also used sneakernet to transfer data recorded by their telescopes in Arecibo, Puerto Rico stored in magnetic tapes to Berkeley, California.

It is now a well known fact that mobile and fixed line data has virtually exploded clogging the internet. YouTube, video downloads and other streaming data choke the data pipes of the internet and Service Providers have not found a good way to monetize this data explosion. While there has been a tremendous advancement in CPU processing power (CPU horsepower in the range of petaflops) and enormous increases in storage capacity(of the order of petabytes) coupled with dropping prices,  there has been no corresponding drop in bandwidth prices in relation to the bandwidth capacity.

Secondly, in the book “Hot, flat and crowded” Thomas L. Friedman  describes the “Smart Homes” of the future in which all the home appliances will have sensors and will participate in the energy auction in real time as a part of the Smart Grid.  The price of energy in the Energy Grid fluctuates like stock prices since enterprises are bidding for energy during the day. In his Smart Home, Friedman envisions a situation in which the washing machine will turn on during off-peak hours when the prices of energy in the energy grid is low. In this way all the appliances in the homes of the future will minimize energy consumption by adjusting the cycles accordingly.

Why could not the internet also behave in a similar fashion? The internet pipes get crowded at different periods of the day, during seasons and during popular sporting events. Why cannot we have an intelligent network in place in which price of different data transfer rates vary depending on the time of the day, the type of traffic and the quality of service required.  Could the internet be based on an auction-mechanism in which different devices bid for bandwidth based on the urgency, speed and quality of services required? Is this possible with the routers, switches of today?

The answer is yes. This can be achieved by the new, path breaking innovation known as Software Defined Networks (SDNs) based on the OpenFlow protocol. SDN is the result of pioneering effort by Stanford University and University of California, Berkeley and is based on the Open Flow Protocol and represents a paradigm shift to the way networking elements operate. Do read my post Software Defined Networks : A glimpse of tomorrow   for a more detailed look at SDNs 
SDNs can be made to dynamically route traffic flows based on decisions in real time.  The flow of data packets through the network can be controlled in a programmatic manner through the OpenFlow protocol. In order to dynamically allocate smaller or fatter pipes for different flows, it necessary for the logic in the Flow Controller to be updated dynamically based on the bid price.

For e.g. we could assume that a corporate has 3 different flows namely, immediate, (ASAP), price below $x. Based on the upper ceiling for the bid price, the OpenFlow controller will allocate a flow for the immediate traffic of the corporation. For the ASAP flow, the corporate would have requested that the flow be arranged when the bid price falls between a range $a – $b. The OpenFlow Controller will ensure that it can arrange for such a flow. The last type of traffic which is not important it will be send during non-peak hours. This will require that the OpenFlow controller be able to allocate different flows dynamically based on winning the auction process that happens in this scheme. The current protocols of the internet of today namely RSVP, DiffServ allocate pipes based on the traffic type & class which is static once allocated. This strategy enables OpenFlow to dynamically adjust the traffic flows based on the current bid price prevailing in that part of the network.

The ability of the OpenFlow protocol to be able to dynamically allocate different flows will once and for all solve the problem of being able to monetize mobile and fixed line data.  Users can decide the type of service they are interested and choose appropriately. This will be a win-win for both the Service Providers and the consumer. The Service Provider will be able to get a ROI for the infrastructure based on the traffic flowing through his network. The consumer rather than paying a fixed access charge could have a smaller charge because of low bandwidth usage.

An auction-based internet is not just a possibility but would also be a worthwhile business model to pursue. The ability to route traffic dynamically based on an auction mechanism in the internet enables the internet infrastructure to be utilized optimally. It will serve the dual purpose of solving traffic congestion, as highest bidders will get the pipe but will also monetize data traffic based on its importance to the end user.

An auction based internet is a very distinct possibility in our future given the promise of the OpenFlow protocol.

All  thoughts, ideas or counter opinions are welcome!

Thursday, December 15, 2011

The story of virtualization

The journey from the early days of batch processing to these days of virtualized computing has been truly an exciting march of progress. The innovations and ideas have truly transformed the computing landscape as we know it which promises of still more breathtaking changes to come.

Batch processing: Programs written on the computers of those days used punch cards also known as Hollerith cards. A separate terminal would be used to edit and create the program which would result in a stack of punched card. The different stacks of user programs would be loaded into a card reader which would then queue the programs for processing by the computers of those days. Each program would be executed in sequential order.

Imagine if our days were structured sequentially where would need a particular task to complete fully before we start another one. That would be a true waste of time. While each task progresses we could focus on other tasks.

The inefficiencies of batch processing soon became obvious and led to the development of multi-tasked systems in which each user's applications is granted a slice of the CPU cycles for use. The Operating System (OS) would cycle through the list of processes granting then a specific number of cycles to compute each time. Soon this development led to different operating systems including Windows, Unix, Linux and so on.

Multitasking: Mutitasking evolved because designers realized that the Central Processor Unit (CPU) cycles were wasted when programs waited for input/output to arrive or complete. Hence the computer's operating system(OS) or the central nervous system would swap the user's program out of the CPU and grant the CPU to other user applications. This way the CPU is utilized efficiently.

The pen analogy : For this analogy let us consider a fountain pen to be the CPU. While Joe is writing a document, he uses the fountain pen. Now, lets assume that Joe needs to print a document. While Joe saunters to pick up his printout, the fountain pen is given to Kartik who needs his tax report. Kartik soon gets tired and takes a coffee break. Now the pen is given to Jane who needs to fill up a form. When Jane completes her form the pen is handed over to Joe who just returned with his print out. The pen (CPU) is thus used efficiently among the many users.

While multi-tasking was a major breakthrough it did lead to an organization's applications being developed in different OS flavors. Hence a large organization would be left with software silos each with its own unique OS. This was a problem when the organization wanted to consolidate all its relevant software under a common umbrella. For e.g. A telecom operator may have payroll applications that run on Windows, accounting on Linux and human resources on Unix. It thus became difficult for the organization to get a holistic view of what happened in the Finance department as a whole. Enter 'virtualization'. Virtualization enables applications created for different OS'es to run over a layer known as the “hypervisor” that abstracts the raw hardware.

Virtualization: Virtualization in essence abstracts the raw hardware through a software application called the Hypervisor. The Hypervisor runs on a bare metal of the CPU. Applications that run over the Hypervisor can choose the operating systems of their choice namely Windows, Linux, Unix etc. The Hypervisor would effectively translate the different OS instructions to the machine instructions of the underlying processor

The car analogy: Imagine that you got into a car. Once inside the car you had a button which when pressed would convert the car either into a roaring Ferrari, Lamborghini or a smooth Mercedes, BMW. The dashboard, the seats, engine all magically transformed into the car of your dreams. This is exactly what virtualization tries to achieve.

However, virtualization went further than just enabling applications created on different OS to run on a single server loaded with the hypervisor. Virtualization also enabled consolidation of server farms. Virtualization brings together the different elements of an enterprise namely the servers each with its memory, processors and different storage options (disk attached storage (DAS), fiber channel storage access network (FC SAN), Network Access Storage (NAS)) and networking elements. Virtualization consolidates the compute, storage and networking elements together and provides an illusion where appropriate compute, storage and network are provided to applications on demand. The applications are provided with virtual machines with the necessary computing, storage and network units as required. Virtualization also took care of providing high availability(HA), mobility and security to the applications besides enabling an illusion of shared resources. Besides if the any of the servers on which an application is executing goes down for any reason the application is migrated seamlessly to another server.

The train analogy: Assume that there was train with 'n' number of wagons. Commuters can get on and get off at any station. When they get on the train they are automatically allocated a seat, a berth and so on. The train keeps track of how occupied the train is and provides the appropriate seating dynamically. If the wheels of any wagon gets stuck the passenger is lifted and shifted,seamlessly, to another wagon while the stuck wagon is safely de-linked from the train.

Virtualization has many applications. It is the dominant technology that is used in the creation of public, private or a hybrid cloud thus creating providing an on-demand scalable computing environment. Virtualization is also used in consolidation of server farms enabling optimum usage of the servers.

Wednesday, December 7, 2011

Big Data - Getting bigger!


 Published in Telecom Asia - Big Data is getting bigger
There are two very significant ways that our world has changed in the past decade. Firstly, we are more “connected”. Secondly we are “awash with data.” In a planet with 7 billion people there are now 2 billion PCs and upward of 6 billion mobile connections. Besides the connection which we as human beings have there are now numerous connections to the internet from devices, sensors and actuators. In other words the world is getting more and more instrumented. There are in excess of 30 billion RFID tags which enable tracking of goods as they move from warehouse, to retail store, sensors on cars and bridges besides cardiac implants in the human body that are constantly sending a stream of data to the network (do look at my post The Internet of Things” . In addition we have the emergence of the Smart Grid with its millions and millions of smart meters that are capable of sensing power loads and appropriately redistributing power and drawing less power during peak hours.

All these devices be it laptops, cell phones, sensors, RFIDs or smart meters are sending enormous amounts of data to the network. In other words there is an enormous data overload happening in the networks of today. According to a Cisco report the projected increase in data traffic between 2014 and 2015 is of the order of 200 exabytes (10^18)). In addition the report states that the total number of connected to the network will be twice the world population or around 15 billion).

Fortunately the explosion in data has been accompanied by falling prices in storage and extraordinary increases in processing capacity. The data that is generated by the devices by the devices, cell phones, PC etc by themselves are useless. However if processed they can provide insights into trends and patterns which can be used to make key decisions. For e.g. the data exhaust that comes from a user's browsing trail, click stream provide important insight into user behavior which can be mined to make important decisions. Similarly inputs from social media like Twitter, Facebook provide businesses with key inputs which can be used for making business decisions. Call Detail records that are created for mobile calls can also be a source of user behavior. Data from retail store provide insights into consumer choices. For all these to happen the enormous amounts of data has to be analyzed using algorithms to determine statistical trends, patterns and tendencies in the data.

It is here that Big Data enters the picture. Big Data enables the management of the 3 V's of data , namely volume, velocity and variety. As mentioned above the volume of data is growing at an exponential rate and should exceed 200 exabytes by 2015. The rate at which the data is generated, or the velocity, is also growing phenomenally given the variety and the number of devices that are connected to the network. Besides there is a tremendous variety to the data. Data is both structured, semi-structured and unstructured. Logs could be in plain text, CSV,XML, JSON and so on. The issue of 3 V's of data makes Big Data most suited for crunching this enormous proliferation of data at the velocity at which it is generated.

Big Data : Big Data or Analytics (see my post “The Rise of Analytics” ) deals with the algorithms that analyze petabytes (10^15)of data and identify key patterns in them. The patterns that are so identified can be used to make important predictions in the future. For example Big Data has been used by energy companies in identifying key locations for positioning their wind turbines. To identify the precise location requires that petabytes of data be crunched rapidly and appropriate patterns be identified. There are several applications of Big Data including identifying brand sentiment from social media, to customer behavior from click exhaust to identifying optimal power usage by consumers.

The key difference between Big Data and traditional processing methods are that the volume of data that has be processed and the speed with which it has to be processed. As mentioned before the 3 V's of volume, velocity and variety make traditional methods unsuitable for handling this data. In this context, besides the key algorithms of analytics another player is extremely important in Big Data – that is Hadoop. Hadoop is a processing technique that involves tremendous parallelization of the task (for details look at To Hadoop, or not to Hadoop).

The Hadoop Ecosystem – Hadoop had its origins at Google during its work with the Google's File System (GFS) and the Map Reduce programming paradigm.

HDFS and Map-Reduce : Hadoop in essence is the Hadoop Distributed File System (HDFS) and the Map Reduce paradigm. The Hadoop System is made up of thousands of distributed commodity servers. The data is stored in the HDFS in blocks of 64 MB or 128 MB. The data is replicated among two or more servers to maintain redundancy. Since Hadoop is made of regular commodity servers which are prone to failures, fault tolerance is included by design. The Map Reduce Paradigm essentially breaks a job into multiple tasks which are executed in parallel. Initially the “Map” part processes the input data and outputs a pair of tuples. The “Reduce” part then scans the pair of tuples and generates a consolidated output. For e.g. The “map” part could count the number of occurrences of different words in different sets of files and output the words and their count as pairs. The “reduce” would then sum up the counts of the word from the individual 'map' parts and provide the total occurrences of the words in multiple files.

Pig and PigLatin : This is a programming language developed at Yahoo to relieve programmers of the intricacies of programming the Map-Reduce and assigning tasks to individual parts. Pig is made up of two parts namely PigLatin, the language and the environment in which it will execute.

Hive: Hive is a Hadoop run-time support structure that was developed by Facebook. Hive has a distinct SQL flavor to it and also simplifies the task of Hadoop programming.

JAQL : JAQL is a declarative query language developed by IBM for handling JSON objects. JAQL is another programming paradigm that is used to programming Hadoop.

Conclusion: It is a foregone conclusion that Big Data and Hadoop will take center stage in the not too distant future given the explosion of data and the dire need of being able to glean useful business insights from them. Big Data and its algorithms provide the way for identifying useful pearls of wisdom from otherwise useless data. Big Data is bound to become mission critical in the enterprises of the future.