Thursday, 28 August 2014

Distributed Crawling

Around 3 months ago, I have posted one article explaining our approach and consideration to build Cloud Application. From this article, I will gradually share our practical design to solve this challenge.

As mentioned before, our final goal is to build a Saas big data analysis application, which will deployed in AWS servers. In order to fulfill this goal, we need to build distributed crawling, indexing and distributed training systems.

The focus of this article is how to build the distributed crawling system. The fancy name for this system will be Black Widow.


As usual, let start with the business requirement for the system. Our goal is to build a scalable crawling system that can be deployed on the cloud. The system should be able to function in an unreliable, high-latency network and can recover automatically from a partial hardware or network failure.

For the first release, the system can crawl from 3 kind of sources, Datasift, Twitter API and Rss feeds. The data crawled back are called Comment. The Rss crawlers suppose to read public sources like website or blog. It is free of charge. DataSift and Twitter both provide proprietary APIs to access their streaming service. Datasift charges its users by comment count and the complexity of CSLD (Curated Stream Definition Language, their own query language). Twitter, in the other hand, offers free Twitter Sampler streaming.

In order to do cost control, we need to implement mechanism to limit the amount of comments crawled from commercial source like Datasift. As Datasift provided Twitter comment, it is possible to have single comment coming from different sources. At the moment, we did not try to eliminate and accept it as data duplication. However, this problem can be eliminated manually by user configuration (avoid choosing both Twitter and Datasift Twitter together).

For future extension, the system should be able to link up related comments to from a conversation.

Food for Thought

Centralized Architecture

Our first thought when getting requirement is to build the crawling on the nodes, which we called Spawn and let the hub, which we called Black Widow to manage the collaboration of effort among nodes. This idea was quickly accepted by team members as it allows the system to scale well with the hub doing limited work.

As any other centralized system, Black Widow suffers from single point of failure problem. To help easing this problem, we allow the node to function independently for a short period after losing connection to Black Widow. This will give the support team a breathing room to bring up backup server.

Another bottle neck in the system is data storage. For the volume of data being crawled (easily reach few thousands records per seconds), NoSQL is clearly the choice for storing the crawled comments. We have experiences working with Lucene and MongoDB. However, after research and some minor experiments, we choose Cassandra as the NoSQL database.

With that few thoughts, we visualize the distributed crawling system to be build following this prototype:

In the diagram above, Black Widow, or the hub is the only server that has access to the SQL database system. This is where we store the configuration for crawling. Therefore, all the Spawns, or crawling nodes are fully stateless. It simply wakes up, registers itself to Black Widow and does the assigned jobs. After getting the comments, the Spawn stores it to Cassandra cluster and also push it to some queues for further processing.

Brainstorming of possible issues

To explain the design to non-technical people, we like to relate the business requirement to a similar problem in real life so that it can be easier to understand. The similar problem we choose would be collaborating of efforts among volunteers.

Imagine if we need to do a lot of preparation work for the upcoming Olympic and decide to recruit volunteers all around the world to help. We do not know volunteers but the volunteers know our email, so they can contact us to register. Only then, we know their emails and may send tasks to them through email. We would not want to send one task to two volunteers or left some tasks unattended. We want to distribute the tasks evenly so that no volunteers are suffering too much.

Due to cost issue, we would not contact them through mobile phone. However, because email is less reliable, when sending out tasks to volunteers, we would request a confirmation. The task is consider assigned only when the volunteer replied with confirmation.

With above example, the volunteers represent Spawn nodes while email communication represent unreliable and high latency network. Here are some problems that we need to solve:

1/ Node failure

For this problems, the best way is to check regularly. If a volunteer stop responding to the regular progress check email, the task should be re-assign to someone else.

2/ Optimization of tasks assigning

Some tasks are related. Therefore assigning related tasks to the same person can help to reduce total effort. This happen with our crawling as well because some crawling configurations have similar search terms, grouping  them together to share the streaming channel will help to reduce final bill.

Another concern is the fairness or ability to distribute the amount of works evenly among volunteers. The simplest strategy we can think of is Round Robin but with a minor tweak by remembering earlier assignments. Therefore, if a task is pretty similar to the tasks we assigned before, the task can be skipped from Round Robin selection and directly assign to the same volunteer.

3/ The hub is not working

If due to some reasons, our email server is down and we cannot contact volunteer any more, it is better to let the volunteers stop working on the assigning tasks. The main concern here is over-running of cost or wasted efforts. However, stopping working immediately is too hasty as temporary infrastructure issue may cause the communication problem.

Hence, we need to find a reasonable amount of time for the node to continue functioning after being detached from the hub.

4/ Cost control

Due to business requirement, there are two kinds of cost control that we need to implement. First is the total of comments being crawled per crawler and second is the total of comments crawled by all crawlers belong to the same user.

This is where we have a debate about the best approach to implement cost control. It is very straight forward to implement the limit for each crawler. We can simply pass this limit to the Spawn node and it will automatically stop the crawler when the limit is reached.

However, for the limit per user, it is not so straight forward and we have two possible approaches. For the simpler choice, we can send all the crawlers of one user to the same node. Then, similar to the earlier problem, the Spawn node knows  the amount of comments collected and stops all crawlers when limit reached. This approach is simple but it limits the ability to distribute jobs evenly among nodes. The alternative approach is to let all the nodes retrieve and update a global counter. This approach creates huge network traffic internally and add considerable delay to comment processing time.

At this point, we temporarily choose the global counter approach. This can be considered again if the performance become a huge concern.

5/ Deploy on the cloud

As any other Cloud application, we can not put too much trust in the network or infrastructure. Here is how we make our application conform to the check-list mentioned in last article:
  • Stateless: Our spawn node is stateless but the hub is not. Therefore, in our design, the nodes do actual work and the hub only collaborates efforts.
  • Idempotence: We implement hashCode and equal methods for every crawler configuration. We store the crawler configurations in the Map or Set. Therefore, the crawler configuration can be sent multiple times without any other side effect. Moreover, our node selection approach ensure that the job will be sent to the same node.
  • Data Access Object: We apply the JsonIgnore filter on every model objects to make sure no confidential data flying around in the network.
  • Play Safe: We implement health-check API for each node and the hub itself. The first level of support will get notified immediately when anything wrong happened.
6/ Recovery

We try our best to make the system heal itself from partial failure. There are some type of failure that we can recover from:
  • Hub failure: Node register itself to the hub when it start up. From then, it is the one way communication when only the hub send jobs to node and also poll for status update. The node is consider detached if it failed to get any contact from Hub for a pre-defined period. If a node is detached, it will clear all the job configurations and start registering itself to the hub again. If the incident is caused by hub failure, a new hub will fetch crawling configurations from database and start distributing jobs again. All the existing jobs on Spawn nodes will be cleared when the Spawn node go to detached mode.
  • Node failure: When hub fail to poll a node, it will do a hard reset by removing all working jobs and re-distribute from beginning again to the working nodes. This re-distribution process help to ensure optimized distribution.
  • Job failure: There are two kind of failures happened when the hub do sending and polling jobs. If a job is failed in the polling process but the Spawn node is still working well, Black Widow can re-assign the job to the same node again. The same thing can be done if the job sending failed. 


Data Source and Subscriber

In the initial thought, each crawler can open it own channel to retrieve data but this does not make sense any more when inspecting further. For Rss, we can scan all URLs once and find out the keywords that may belong to multiple crawlers. For Twitter, it supports up to 200 search terms for one single query. Therefore, it is possible for us to open single channel that serve multiple crawlers. For Datasift, it is quite rare, but due to human mistake or luck, it is possible to have crawlers with identical search terms.

This situation lead us to split out crawler to two entities: subscriber and data source. Subscriber is in charge of consuming the comments while data source is in charge of crawling the comments. With this design, if there are two crawlers with similar keywords, a single data source will be created to serve two subscribers, each processing the comments their own ways.

Data source will be created when and only when no similar data source exist. It starts working when having the first subscriber subscribe to it and retire when the last subscriber unsubscribe from it. With the help of Black Widow to send similar subscribers to the same node, we can minimize the amount of data sources created and indirectly, minimize the crawling cost.

Data Structure

The biggest concern of data structure is Thread Safe issue. In the Spawn node, we must store all running subscribers and data sources in memory. There are a few scenarios that we need to modify or access these data:

  • When a subscriber hit the limit, it automatically unsubscribe from data source, which may lead to deactivation of data source.
  • When Black Widow send a new subscriber to Spawn nodes. 
  • When Black Widow send a request to unsubscribe an existing subscriber. 
  • Health check API expose all running subscribers and data sources. 
  • Black Widow regularly polls the status of each assigned subscriber.
  • The Spawn node regularly checks and disables orphan subscribers (subscriber which is not polled by Black Widow).
Another concern of data structure is idempotence of operations. Any of operation above can be missing or being duplicated. To handle this problem, here is our approach
  • Implement hashCode and equals method for every subscriber and data source. 
  • We choose the Set or Map to store collection of subscribers and data sources. For records with identical hash code, Map will replace the record when there is new insertion but Set will skip the new record. Therefore, if we use Set, we need to ensure new records can replace old record. 
  • We use synchronized in data access code.
  • If Spawn node receive a new subscriber that similar to existing subscriber, it will compare and prefer to update existing subscriber instead of replacing. This avoid the process of unsubscribing and subscribing identical subscribers, which may interrupt data source streaming.

As mentioned before, we need to find a routing mechanism that serve two purposes:
  • Distribute the jobs evenly among Spawn nodes.
  • Route similar jobs to the same nodes.
We solved this problem by generating an unique representation of each query  named uuid. After that, we can use a simple modular function to find out the note to route:

int size = activeBwsNodes.size();
int hashCode = uuid.hashCode();
int index = hashCode % size;
assignedNode = activeBwsNodes.get(index);

With this implementation, subscribers with similar uuid will always be sent to the same node and each node has equals chance of being selected to serve a subscriber. 

This whole practice can be screwed up when there is change to the collection of active Spawn nodes. Therefore, Black Widow must clear up all running jobs and reassign from beginning whenever there is a node change. However, node change should be quite rare in production environment.


Below is the sequence diagram of Black Widow and Node collaboration

Black Widow does not know Spawn node. It wait for the Spawn node to register itself to the Black Widow. From there, Black Widow has the responsibility to poll the node to maintain connectivity. If Black Widow fail to poll a node, it will remove the node from the its container. The orphan node will eventually go to detached mode because it is not being polled any more. In this mode, Spawn node will clear existing jobs and try to register itself again.

The next diagram is the subscriber life-cycle

Similar to above, Black Widow has the responsibility of polling the subscribers it send to Spawn node. If a subscriber is not being polled by Black Widow anymore, Spawn node will treat the subscriber as orphan and remove it. This practice help to eliminate the threat of Spawn node running obsoleted subscriber.

On Black Widow, when a subscriber polling fails, it will try to get a new node to assign the job. If the Spawn node of the subscriber still available, it is likely that the same job will go to the same node again due to our routing mechanism we used.


In a happy scenario, all the subscribers are running, Black Widow is polling and nothing else happen. However, this is not likely to happen in real life. There will be changes in Black Widow and Spawn nodes from time to time, triggered by various events.

For Black Widow, there will be changes under following circumstances:

  • Subscriber hit limit
  • Found new subscriber
  • Existing subscriber disabled by user
  • Polling of subscriber fails
  • Polling of Spawn node fails
To handle changes, Black Widow monitoring tool offers two services: hard reload and soft reload. Hard Reload happen on node change while Soft Reload happen on subscriber change. Hard Reload process takes back all running jobs, redistribute from beginning over available nodes. Soft Reload process removes obsoleted jobs, assigns new jobs and re-assigns failed jobs.

Compare to Black Widow, the monitoring of Spawn node is simpler. The two main concerns are maintaining connectivity to Black Widow and removing orphan subscribers.

Deployment Strategy

The deployment strategy is straight forward. We need to bring up Black Widow and at least one Spawn node. The Spawn node should know the URL of Black Widow. From then, the Health Check API will give use the amount of subscribers per node. We can integrate Health Check with AWS API to automatically bring up new Spawn node if existing nodes are overloaded. The Spawn node image will need to have Spawn application running as service. Similarly, when the nodes are not utilized, we can bring down redundant Spawn nodes.

Black Widow need special treatment due to its importance. If Black Widow fails, we can restart the application. This will cause all existing jobs on Spawn nodes to become orphan and all the Spawn nodes go to detached mode. Slowly, all the nodes will clean up itself and try to register again. Under default configuration, the whole restarting process will happen within 15 minutes.

Threats and possible improvement

When choosing centralized architecture, we know that Black Widow is the biggest risk to the system. While Spawn node failure only causes a minor interruption in the affected subscribers, Black Widow failure finally lead to Spawn nodes restart, which will take much longer time to recover. 

Moreover, even the system can recover from partial, there still be interruption of service in recovery process. Therefore, if the polling requests failed too often due to unstable infrastructure, the operation will be greatly hampered. 

Scalability is another concern for centralized architecture. We have not had a concrete amount of maximum Spawn nodes that the Black Widow can handle. Theoretically, this should be very high because Black Widow only do minor processing, most of its effort are on sending out HTTP requests. It is possible that network is the main limit factor for this architecture. Because of this, we let the Black Widow polling the nodes rather than the nodes polling Black Widow (other people do this, like Hadoop). With this approach, Black Widow may work at its own pace, not under pressure of Spawn nodes.

One of the first question we got is whether it is a Map Reduce problem and the answer is No. Each subscriber in our Distributed Crawling System processes its own comments and does not reporting result back to Black Widow. That why we do not use any Map Reduce product like Hadoop. Our monitor is business logic aware rather than purely infrastructure monitoring, that why we choose to build ourselves over using monitoring tools like Zoo Keeper or AKKA

For future improvement, it is better to walk away from Centralized Architecture by having multiple hubs collaborating with each other. This should not be too difficult provided that the only time Black Widow accessing database is loading subscriber. Therefore, we can slice the data and let each Black Widow load a portion of it. 

Another point that make me feel pretty unsatisfied is the checking of global counter for user limit. As the check happened on every comment crawled, this greatly increases internal network traffic and limit the scalability of system. The better strategy should be divide of quota based on processing speed. Black Widow can regulate and redistribute quota for each subscriber (on different nodes).

Wednesday, 20 August 2014

The Emergence of DevOps and the Fall of the Old Order

Software Engineering has always been dependent on IT operations to take care of the deployment of software to a production environment. In the various roles that I have been in, the role of IT operations has come in various monikers from “Data Center” to “Web Services”. An organisation delivering software used to be able to separate these roles cleanly. Software Engineering and IT Operations were able to work in a somewhat isolated manner, with neither having the need to really know the knowledge that the other hold in their respective domains. Software Engineering would communicate with IT operations through “Deployment Requests”. This is usually done after ensuring that adequate tests have been conducted on their software.
However, the traditional way of organising departments in a software delivery organisation is starting to seem obsolete. The reason is that software infrastructure have moved towards the direction of being “agile”. The same buzzword that had gripped the software development world has started to exert its effect on IT infrastructure. The evidence of this seismic shift is seen in the fastest growing (and disruptive) companies today. Companies like Netflix, Whatsapp and many tech companies have gone into what we would call “cloud” infrastructure that is dominated by Amazon Web Services.
There is huge progress in the virtualization technologies of hardware resources. This have in turn allowed companies like AWS and Rackspace to convert their server farms into discrete units of computing resources that can be diced and parcelled and redistributed as a service to their customers in an efficient manner. It is inevitable that all this configurable “hardware” resources will eventually be some form of “software” resource that can be maximally utilized by businesses. This has in turn bred a whole new genre of skillset that is required to manage, control and deploy these Infrastructure As A Service (IaaS). Some of the tools used by these services include provisioning tools like Chef or Puppet. Together with the software apis provided by the IaaS vendors, infrastructure can be brought up or down as required.
The availability of large quantities of computing resources without all the upfront costs associated with capital expenditures on hardware have led to an explosion in the number of startups trying to solve problems of all kinds imaginable and coupled with the prevalence of powerful mobile devices have led to a digital renaissance for many industries. However, this renaissance has also led to the demand for a different kind of software organisation. As someone who has been part of software engineering and development, I am witness to the rapid evolution of profession.
The increasing scale of data and processing needs requires a complete shift in paradigm from the old software delivery organisation to a new one that melds software engineering and IT operations together. This is where the role of a “DevOps” come into the picture. Recruiting DevOps in an organisation and restructuring the IT operations around such roles enable businesses to be Agile. Some businesses whose survival depends on the availability of their software on the Internet will find it imperative to model their software delivery organisation around DevOps. Having the ability to capitalise on software automation to deploy infrastructure within minutes allows a business to scale up quickly. Being able to practise continuous delivery of software allow features to get into the market quickly and allows a feedback loop in which a business can improve itself.
We are witness to a new world order and software delivery organisations that cannot successfully transition to this Brave New World will find themselves falling behind quickly especially when a competitor is able to scale and deliver software faster, reliably and with less personnel.

Sunday, 3 August 2014

Information is money

When people ask me what am I doing, my immediate response is IT. Even though, the answer is not very specific, it is the easiest to understand and it still helps to describe what we are doing. In fact, it doesn't matter what programming languages we use, our responsibility is to build the information system, which deliver information to end-user. Therefore, we should value information more than anyone else. However, in reality, I feel there are so much wasted information in modern information system.

In this article, I would like to discuss the opportunity to collect user behaviour and measure user happiness when building information system. I also want to share my idea on how to improve user experience based on data collected.

How important is user's behaviour information

Let begin with a story that happened in my earlier career. We need to implement an online betting system for customer, which function similarly to a stock market. In this system, there is no traditional bookmakers like William Hill. Instead, each user can people offer and accept the bets from another. Because it is a mass market with big pool of users, the rate offered is quite accurate and the commission is pretty small. However, the betting system is not our focus today. What capture my intention most is not the  technical aspect of the project, even though it is quite challenging. In stead, I feel interested with the way the system silently but legally make huge amount of profit based on the information it collected.

The system captured the bet history of every user, through that, identify top winners and top losers of each month. Based on that information, the system automatically place the bet follow the winners and against the losers. Can you imagine that you are the only person in the world who know Warren Buffett's activities in real time? Then, it should be quite simple to simulate his performance, even without his knowledge? Needless to say, this hidden feature generated profit at level of hundred thousands dollars every single day.

In the open market, information is everything and we see why the law punish insider trading or any other attempt to gain advantage of information so strictly like that. However, there is no such kind of law for online gambling activity yet and this practice is still legal. That early experience gave me a deep impression on how important is information.

Later, I have interest in applying psychology when dealing with customer. In order to persuade one person or making sale happened, one guy need to observe and understand his client. Relate what I have learnt to the information system that I built before, I feel that it is not so nice to implement a system just only serve as information provider or selling tool. Actually, we do have chance to do much better if we really want.

Website authors know the importance of user experience and they did try their best to collect user information using online survey. However, personally, I feel this approach will never work. I have never answer any survey myself. Any time I saw a popup, it doesn't matter how polite is the words or how beautiful is the design, I will just click on close button.

We should not forget that no matter how important is the user feedback, it is not the user's benefit to answer our survey. In fact, no sale person approach client to ask them to do customer experience survey, unless there is incentive to do it.

Hence, the information is still need to be collected, but in a way that user does not notice it (remember how Google silently monitor anyone using their services?)

How should we use the information?

We should not waste effort collecting information if we even don't know what to do with it. However, this is not something new. Whenever I go to a professional selling site like Amazon, I find it is quite cool that they have managed to use every single piece of information they have to push sale. One time, I went there searching for helmet, next time I saw all the items for a rider like me. They remembers every single item that users have viewed or bought and regularly offer new things based on the data they collected.

Google and Facebook also do similar things. They will try to guess what you like or care about before delivering any ads to you. The million dollars question is can we do any better than this?

I vote yes. It does not means that I do not appreciate the talent and the profession of the product teams in Amazon, Google for Facebook. However, I feel that there is still a distance between these products and an experience sale person. Let imagine there is a real person that sharing desktop view with customer, seeing every mouse-click, movement and keys entered. Given this guy can pause user for a while, so that he can think, analyse and decide what user will see next, what will he do?

It is apparently that the information we collect from user screen cannot compare to the information from a face to face communication, but we have not used up this information yet. Most of the system automatically make the guess that any product that customer clicked on is what he like. A person can do better than that. If an user open the phone in 3 seconds and immediately move to other phones, he may accidentally click on the phone rather than by intention. Moreover, if he spend more time on a phone, keep coming back to it and even open the specification, we can be very sure that this is what he is looking for.

How should we collect the information?

As mentioned above, it will never work if we interrupt users to ask questions. The right mechanism for collecting information must be observation. For all the available solution in the market, I think what is missing in the ability to measure time stamp of events and connecting individual events to form an user journey. Without connecting the dot, there will be no line, without connecting events, there will be no user journey. Without the time stamp, it will be very hard to measure user satisfaction and concern.

Capturing user actions is not very challenging provided that we are the owner of website. Google Analytic can help to capture user actions but it is a bit hard to use in our case because of the limited information that it carry (HTTP GET request). We should understand that this is the only choice that Google Analytic team have because any other kind of requests will be blocked by cross-site scripting prevention.

The better way to carry this information is through HTTP POST request, which can carry the full event object, serialized in JSON format. This is perfectly eligible as the events is sent back to the same domain. To link up the events together, it is best to assign an unique but temporary id for user. We do not need to remember or identify user, therefore, this information may not need to be stored as a persisted cookie on browser. With the temporary id, two separated visits to website by the same user will be logged to 2 different journeys. While it is not optimal, it is still offer some benefits over normal kind of tracking.

If you can persist the cookie on browser or if user login, things will bet more interesting as we can link individual journeys to one.

After this, there come the biggest and most challenging part of the system where you need to figure out one mechanism to optimize customer experiences based on his journey. Unfortunately, this part is too specific for each system that our experience and methods may not be very useful for you at all. However, in general, we can measure user satisfaction and happiness based on the time users spend at each step. We also can figure out user interest by measuring the time spend for each product. From there, please build and optimize your own analysing tool. This is a very challenging but interesting task.

Monday, 14 July 2014

From framework to platform

When I started my career as a Java developer close to 10 years ago, the industry is going through a revolutionary change. Spring framework, which was released in 2003, was quickly gaining ground and became a serious challenger to the bulky J2EE platform. Having gone through the transition time, I quickly found myself in favour of Spring framework instead of J2EE platform, even the earlier versions of Spring are very tedious to declare beans.

What happened next is the revamping of J2EE standard, which was later renamed to Java EE. Still, dominating of this era is the use of opensource framework over the platform proposed by Sun. This practice gives developers full control over the technologies they used but inflating the deployment size. Slowly, when cloud application become the norm for modern applications, I observed the trend of moving the infrastructure service from framework to platform again. However, this time, it is not motivated by Cloud application.

Framework vs Platform

I have never heard of or had to used any framework in school. However, after joining the industry, it is tough to build scalable and configurable software without the help of any framework.

From my understanding, any application is consist of codes that implement business logic and some other codes that are helpers, utilities or to setup infrastructure. The codes that are not related to business logic, being used repetitively in many projects, can be generalised and extracted for reuse.  The output of this extraction process is framework.

To make it shorter, framework is any codes that is not related to business logic but helps to dress common concerns in applications and fit to be reused.

If following this definition then MVC, Dependency Injection, Caching, JDBC Template, ORM are all consider frameworks.

Platform is similar to framework as it also helps to dress common concerns in applications but in contrast to framework, the service is provided outside the application. Therefore, a common service endpoint can serve multiple applications at the same time. The services provided by JEE application server or Amazon Web Services are sample of platforms.

Compare the two approaches, platform is more scalable, easier to use than framework but it also offers less control. Because of these advantage, platform seem to be the better approach to use when we build Cloud Application.

When should we use platform over framework

Moving toward platform does not guarantee that developers will get rid of framework. Rather, platform only complements framework in building applications. However, one some special occasions we have a choice to use platform or framework to achieve final goal.  From my personal opinion, platform is greater that framework when following conditions are matched:
  • Framework is tedious to use and maintain
  • The service has some common information to be shared among instances.
  • Can utilize additional hardware to improve performance.
In office, we still uses Spring framework, Play framework or RoR in our applications and this will not change any time soon. However, to move to Cloud era, we migrated some of our existing products from internal hosting to Amazon EC2 servers. In order to make the best use of Amazon infrastructure and improve software quality, we have done some major refactoring to our current software architecture. 

Here are some platforms that we are integrating our product to:

Amazon Simple Storage Service (Amazon S3) &  Amazon Cloud Front

We found that Amazon Cloud Front is pretty useful to boost average response time for our applications. Previously, we host most of the applications in our internal server farms, which located in UK and US. This lead to noticeable increase in response time for customers in other continents. Fortunately, Amazon has much greater infrastructure with server farms built all around the worlds. That helps to guarantee a constant delivery time for package, no matter customer locations.

Currently, due to manual effort to setup new instance for applications, we feel that the best use for Amazon Cloud Front is with static contents, which we host separately from application in Amazon S3. This practice give us double benefit in performance with more consistent delivery time offered by the CDN plus the separate connection count in browser for the static content.

Amazon Elastic Cache

Caching has never been easy on cluster environment. The word "cluster" means that your object will not be stored and retrieve from system memory. Rather, it was sent and retrieved over the network. This task was quite tricky in the past because developers need to sync the records from one node to another node. Unfortunately, not all caching framework support this feature automatically. Our best framework for distributed caching was Terracotta.

Now, we turned to Amazon Elastic Cache because it is cheap, reliable and save us the huge effort for setting up and maintain distributed cache. It is worth to highlight that distributed caching is never mean to replace local cache. The difference in performance suggest that we should only use distributed caching over local caching when user need to access real-time temporary data.

Event Logging for Data Analytics

In the past, we used Google Analytics for analysing user behaviour but later decided to build internal data warehouse. One of the motivation is the ability to track events from both browsers and servers. The Event Tracking system uses MongoDB as the database as it allow us to quickly store huge amount of events.

To simplify the creation and retrieval of events, we choose JSON as the format for events. We cannot simply send this event directly to event tracking server due to browser prevention of cross-domain attack. For this reason, Google Analytic send the events to server under the form of a GET request for static resource. As we have the full control over how the application was built, we choose to let the events send back to application server first and route to event tracking server later. This approach is much more convenient and powerful.

Knowledge Portal

In the past, applications access data from database or internal file repository. However, to be able to scale better, we gathered all knowledge to build a knowledge portal. We also built query language to retrieve knowledge from this portal. This approach add one additional layer to the knowledge retrieval process but fortunately for us, our system does not need to serve real time data. Therefore, we can utilize caching to improve performance.


Above is some of our experience on transforming software architecture when moving to the Cloud. Please share with us your experience and opinion.

Saturday, 5 July 2014

Common mistakes when using Spring MVC

When I started my career around 10 years ago, Struts MVC is the norm in the market. However, over the years, I observed the Spring MVC slowly gaining popularity. This is not a surprise to me, given the seamless integration of Spring MVC with Spring container and the flexibility and extensibility that it offers.

From my journey with Spring so far, I usually saw people making some common mistakes when configuring Spring framework. This happened more often compare to the time people still used Struts framework. I guess it is the trade off between flexibility and usability. Plus, Spring documentation is full of samples but lack of explanation. To help filling up this gap, this article will try to elaborate and explain 3 common issues that I often see people encounter.

Declare beans in Servlet context definition file

So, everyone of us know that Spring use ContextLoaderListener to load Spring application context. Still, when declaring the DispatcherServlet, we need to create the servlet context definition file with the name "${}-context.xml". Ever wonder why?

Application Context Hierarchy

Not all developers know that Spring application context has hierarchy. Let look at this method


It tells us that Spring Application Context has parent. So, what is this parent for?

If you download the source code and do a quick references search, you should find that Spring Application Context treat parent as its extension. If you do not mind to read code, let I show you one example of the usage in method BeanFactoryUtils.beansOfTypeIncludingAncestors():

if (lbf instanceof HierarchicalBeanFactory) {
    HierarchicalBeanFactory hbf = (HierarchicalBeanFactory) lbf;
    if (hbf.getParentBeanFactory() instanceof ListableBeanFactory) {
 Map parentResult = 
              beansOfTypeIncludingAncestors((ListableBeanFactory) hbf.getParentBeanFactory(), type);
return result;

If you go through the whole method, you will find that Spring Application Context scan to find beans in internal context before searching parent context. With this strategy, effectively, Spring Application Context will do a reverse breadth first search to look for beans.


This is a well known class that every developers should know. It helps to load the Spring application context from a pre-defined context definition file. As it implements ServletContextListener, the Spring application context will be loaded as soon as the web application is loaded. This bring indisputable benefit when loading the Spring container  that contain beans with @PostContruct annotation or batch jobs.

In contrast, any bean define in the servlet context definition file will not be constructed until the servlet is initialized. When does the servlet be initialized? It is indeterministic. In worst case, you may need to wait until users make the first hit to the servlet mapping URL to get the spring context loaded.

With the above information, where should you declare all your precious beans? I feel the best place to do so is the context definition file loaded by ContextLoaderListener and no where else. The trick here is the storage of ApplicationContext as a servlet attribute under the key


Later, DispatcherServlet will load this context from ServletContext and assign it as the parent application context.

protected WebApplicationContext initWebApplicationContext() {
   WebApplicationContext rootContext =

Because of this behaviour, it is highly recommended to create an empty servlet application context definition file and define your beans in the parent context. This will help to avoid duplicating the bean creation when web application is loaded and guarantee that batch jobs are executed immediately.

Theoretically, defining the bean in servlet application context definition file make the bean unique and visible to that servlet only. However, in my 8 years of using Spring, I hardly found any use for this feature except defining Web Service end point.

Declare Log4jConfigListener after ContextLoaderListener

This is a minor bug but it catch you when you do not pay attention to it. Log4jConfigListener is my preferred solution over -Dlog4j.configuration as we can control the log4j loading without altering server bootstrap process.

Obviously, this should be the first listener to be declared in your web.xml. Otherwise, all of your effort to declare proper logging configuration will be wasted.

Duplicated Beans due to mismanagement of bean exploration

In the early day of Spring, developers spent more time typing on xml files than Java classes. For every new bean, we need to declare and wiring the dependencies ourselves, which is clean, neat but very painful. No surprise that later versions of Spring framework evolved toward greater usability. Now a day, developers may only need to declare transaction manager, data source, property source, web service endpoint and leave the rest to component scan and auto-wiring.

I like these new features but this great power need to come with great responsibility; otherwise, thing will be messy quickly. Component Scan and bean declaration in XML files are totally independent. Therefore, it is perfectly possible to have identical beans of the same class in the bean container if the bean are annotated for component scan and declare manually as well. Fortunately, this kind of mistake should only happen with beginners.

The situation get more complicated when we need to integrate some embedded components into the final product. Then we really need a strategy to avoid duplicated bean declaration.

The above diagram show a realistic sample of the kind of problems we face in daily life. Most of the time, a system is composed from multiple components and often, one component serves multiple product. Each application and component has it own beans. In this case, what should be the best way to declare to avoid duplicated bean declaration?

Here is my proposed strategy:

  • Ensure that each component need to start with a dedicated package name. It makes our life easier when we need to do component scan.
  • Don't dictate the team that develop the component on the approach to declare the bean in the component itself (annotation versus xml declaration). It is the responsibility of the developer whom packs the components to final product to ensure no duplicated bean declaration.
  • If there is context definition file packed within the component, give it a package rather than in the root of classpath. It is even better to give it a specific name. For example src/main/resources/spring-core/spring-core-context.xml is way better than src/main/resource/application-context.xml. Imagine what can we do if we pack few components that contains the same file application-context.xml on the identical package!
  • Don't provide any annotation for component scan (@Component, @Service or @Repository) if you already declare the bean in one context file.
  • Split the environment specific bean like data-source, property-source to a separate file and reuse.
  • Do not do component scan on the general package. For example, instead of scanning org.springframework package, it is easier to manage if we scan several sub-packages like org.springframework.core, org.springframework.context, org.springframework.ui,...


I hope you found the above tips useful for your daily usage. If there is any doubt or any other ideas, please help to feedback.

Wednesday, 18 June 2014

How to increase productivity

Unlock productivity is one of the bigger concerns for any person taking management role. However, people rarely agree on the best approaches to improve performance. Over the years, I have observed different managers using the opposite practices to churn out best performance of the team they are managing. Unfortunately, some works and other don't. To be more accurate, what does not increase performance, actually reduce performance.

In this article, I would like to review what I have seen and learnt over the years and share personal view on the best approaches to unlock productivity.

What factors define teams performance?

Let start with analysing what compose a team. Obviously, a team is composed from team members, each has own expertise, strength and weakness. However, the total productivity of the team is not necessarily the total sum of individual productivity. Other factors like team work, process and environment also have major impact to total performance, which can be both positive or negative.

To sum up, the 3 major factors discussed in this article will be technical skills, working process and culture.

Technical Skills

In a factory, we can count the total productivity as sum of individual productivity of each worker, but this simplicity does not apply to IT field. The differences lie in natural of work. Programming until today is still an innovative work, which cannot be automated. In IT industry, nothing is more valuable than innovation and vision. That explains why Japan may be well known for producing high quality car but US is much more famous for producing well known IT company.

Contradict to factory environment, in a software team, developers does not necessarily do or good at the same things. Even if they have graduated from the same school, taking the same job, personal preference and the self studying quickly make developer's skills different again. For the sake of increasing total productivity, this may be a good thing. There is no use for all of member to be competent on the same kind of tasks. As it is too difficult to good at everything, life will be much easier if members of the team can compensate for each other weakness.

This is not easy to improve on technical skills of the team as it take many years for a developer to build up his/her skill set. The fastest way to pump up the team skill sets is to recruit new talent that offer what the team is lack of. That why the popular practice in the industry is to let the team recruit new member themselves. Because of this, the team, which is slowly built over the years normally normally offers a more balance skills set.

While recruitment is a quick and short term solution, the long term solution is to keep the team up to date with latest trends of technology. In this field, if you do not go forward, you go backward. There is no skill set that can be useful forever. One of my colleague even emphasize that upgrading developers's skills is beneficial to the company in the long run. Even if we do not count inflation, it is quite common that the company will offer pay rise after each annual review to retain staffs. If the staff do not acquire new skills, effectively, the company is paying higher price every year for a depreciating asset. It may be a good suggestion for the company to use monetary prize like KPI to motivate self-study and upgrading.

There are a lot of training courses in the industry but it is not necessarily the best method for upgrading skills. Personally, I feel most of the coursework offer more branding value than real life usage. If a developer is keen to learn, there should be quite sufficient knowledge on internet to pick up anything. Therefore, unless for commercial API or product, spending money on monetary prize should be more worthy than on training course.

Another well-known challenge for self-studying is the human natural laziness. There is nothing surprise about it. However, the best way to fight laziness is to find fun in learning new things. This only can be achieved if developers take programming as his hobby more than professional. Even not, it is quite reasonable that one should re-invest effort on his bread and butter tool. One of my friend even argue that if singer/musician take own responsibility in training, programmer should do the same.

Sometimes, we may feel lost due to the huge amount of technologies exposed to us every year. I myself feel that too. My approach for self studying is adding a delay in absorbing concepts and ideas. I try to understand but do not invest too much until the new concepts and ideas are reasonable accepted by the market.

Working Process

Working process can contribute greatly to team performance, positively or negatively. Great developer write great code but he will not be able to do so if wastes to much effort on something not essential. Obviously, when the process is wrong, developers may feel uncomfortable about their daily life. Unhappy developer may not perform his best.

There is no clear guideline to judge if the working process is well defined but people in the environment will feel it right a way if something is wrong. However, it is not as easy to get it right as people who have the right to make decision not necessarily the guys who suffer from bad process. We need an environment with effective feedback channels to improve on working process.

The common pitfall for working process is the lack of result oriented nature. The process is less effective if it is too reporting oriented, attitude oriented or based on some unreal assumptions. To define the process, it may be good if the executive can decide whether he want to build an innovative company or operation oriented company. The samples for former kind is Google, Facebook, Twitter while the latter may be GM, Ford, Toyota. It is not that operation-oriented company cannot innovate but the process was not built with the first priority for innovation. Therefore, the metric for measuring performance may be slightly different, which causes different results in long term. Not all companies in IT fields are innovative company. One counter example is the outsourcing companies or software house in Asia. To encourage innovation, the working process need to focus on people, minimize hassle, maximize collaboration and sharing.

Through my years in the industry with Water Fall, not so Agile and Agile companies, I feel that Agile work quite well for IT fields. It was built based on the right assumptions that software development is innovation work and less predictable compare to other kinds of engineering.

Company Culture

When Steve Job passed away in 2011, I bought his authorized biography by Walter Isaacson. The book clearly explains why Sony failed to keep its competitive edge because of inner competition amongst its departments. Microsoft suffer similar problem due to the controversy stack ranking system that enforce inner competition. I think that IT fields is getting more complicated and we need more collaboration than in the past to implement new ideas.

It is tough to maintain collaboration when your company grow to become an multi-culture MNC. However, it still can be done if management got the right mindset and continuously communicate their visions to the team. As above, the management need to be clear if they want to build an innovative company as it requires a distinct culture, which is more open, and highly motivated.

In silicon valley, office life end up quite late as most of developers are geeks and they love nothing more than coding. However, it is not necessary a good practice as all of us have a family to take care of. It is up to individual to define his/her own work life balance but the requirement is employee fully charged and feel exited whenever he come to office. He must feel that his work is appreciated and he has the support when he need it.


To makes it short, here are the kind of things that management can apply to increase productivity of the team:

  • Let the team involve in the recruitment. Recruit the person who takes programming as hobby.
  • Monetary prize or other kind of encouragements for self-study, self-upgrading.
  • Save money for company sponsored course unless for commercial products.
  • Make sure that the working process result oriented.
  • Apply Agile practices
  • Encourage collaboration, eliminate inner competition.
  • Encourage sharing
  • Encourage feedback.
  • Maintain employee work-life balance and motivation.
  • Make sure employee can find support when he need it.

Saturday, 24 May 2014

Testing effectively

Recently, there is a heaty debate regarding TDD which started by DHH when he claimed that TDD is dead.
This ongoing debate managed to capture the attention of developers world, including us.

Some mini debates have happened in our office regarding the right practices to do testing.

In this article, I will represent my own view.

How many kinds of tests have you seen?

From the time I joined industry, here are the kinds of tests that I have worked on:

  • Unit Test
  • System/Integration/Functional Test
  • Regression Test
  • Test Harness/Load Test
  • Smoke Test/Spider Test
The above test categories are not necessarily mutually exclusive. For example, you can crate a set of automated functional tests or Smoke tests to be used as regression test. For the benefit of newbie, let do a quick review for these old concepts. 

Unit Test

Unit Test aim to test the functional of a unit of code/component. For Java world, unit of code is the class and each Java class suppose to have an unit test. The philosophy of Unit Test is simple. When all the components are working, the system as a whole should work.

A component rarely work alone. Rather, it normally interacts with other components. Therefore, in order to write Unit Test, developers need to mock other components. This is the problem that DHH and James O Coplien criticize Unit Test for huge effort that gain little benefit. 

System/Integration/Functional Test

There is no concrete naming as people often use different terms to describe similar things. Contradict to Unit Test, for functional test, developers aim to test a system function as a whole, which may involve multiple components. 

Normally, for functional test, the data is retrieved and store to the test database. Of course, there should be a pre-step to set-up test data before running. DHH likes this kind of test. It helps developers test all the functions of the system without huge effort to set-up mock object.

Functional test may involve asserting web output. In the past, it is mostly done with htmlUnit but with recent improvement of Selenium Grid, Selenium became the preferred choice.

Regression Test

In this industry, you may end up spend more time maintaining system than developing new one. Software changes all the time and it is hard to avoid risk whenever making changes. Regression Test supposes to capture any defect that caused by changes. 

In the past, software house did have one army of testers but the current trend is automated testing. It means that developers will deliver software with full set of tests that suppose to be broken whenever a function is spoiled. 

Whenever a bug is detected, a new test case should be added to cover new bug. Developers create the test, let it fail, and fix the bug to make it pass. This practice is called Test Driven Development.

Test Harness/Load Test

Normal test case does not capture system performance. Therefore, we need to develop another set of tests for this purpose. In the simplest form, we can set the time out for the functional test that run in continuous integration server. The tricky part is this kind of test is very system dependant and may fail if the system is overloaded. 

The more popular solution is to run load test manually by using profiling tool like JMeter or create our own load test app. 

Smoke Test/Spider Test

Smoke Test and Spider Test are two special kinds of tests that may be more relevant to us. WDS provides KAAS (Knowledge as a Service) for wireless industry. Therefore, our applications are refreshed everyday with data changes rather than business logic changes. It is specific to us that system failure may come from data change rather than business logic. 

Smoke Test are set of pre-defined test cases run on integration server with production data. It helps us to find out any potential issues for the daily LIVE deployment.

Similar to Smoke Test, Spider Test runs with real data but it work like a crawler that randomly click on any link or button available. One of our system contains so many combination of inputs that it is not possible to be tested by human (closed to 100.000 combinations of inputs). 

Our Smoke Test randomly choose some combination of data to test. If it manage to run for a few hours without any defect, we will proceed with our daily/weekly deployment.

The Test Culture in our environment

To make it short, WDS is a TDD temple. If you create the implementation before writing test cases, better be quiet about it. If you look at WDS self introduction, TDD is mentioned only after Agile and XP

"We are:- agile & XP, TDD & pairing, Java & JavaScript, git & continuous deployment, Linux & AWS, Jeans & T-shirts, Tea & cake"

Many high level executives in WDS start their career as developers. That helps to fostering our culture as an engineering-oriented company. Requesting resources to improve test coverage or infrastructure are common here. 

We do not have QA. In worst case, Product Owner or customers detect bugs. In best case, we detect bugs by test cases or by team mates during peer review stage.  

Regarding Singapore office, most of our team members grow up absorbing Ken Beck and Martin Fowler books and philosophy. That why most of them are hardcore TDD worshipers. 

The focus of testing in our working environment did bear fruits. WDS production defects rate is relatively low.

My own experience and personal view with testing

That is enough about self appraisal. Now, let me share my experience about testing.

Generally, Automated Testing works better than QA 

Comparing the output of traditional software house that packed with an army of QA with modern Agile team that deliver fully test coverage products, the latter normally outperform in term of quality and even cost effectiveness. Should QA jobs be extinct soon?

Over monitoring may hint lack of quality

It sounds strange but over the years, I developed insecure feeling whenever I saw a project that have too many layer of monitoring. Over monitoring may hint lack of confidence and in deed, these systems crash very often with unknown reasons. 

Writing test cases takes more time that developing features

DDH is definitely right on this. Writing Test Cases mean that you need to mock input and assert lots of things. Unless you keep writing spaghetti code, developing features take much less times compare to writing tests.

UI Testing with javascript is painful

You know it when you did it. Life is much better if you only need to test Restful API or static html pages. Unfortunately, the trend of modern web application development involve lots of javascripts on client side. For UI Testing, Asynchronous is evil. 

Whether you want to go with full control testing framework like htmlUnit or using a more practical, generic one like Selenium, it will be a great surprise for me if you never encounter random failures. 

I guess every developer know the feeling of failing to get the build pass at the end of the week due to random failure test cases.

Developers always over-estimate their software quality

It is applicable to me as well because I am an optimistic person. We tend to think that our implementation is perfect until the tests failed or someone help to point out a bug.

Sometimes, we change our code to make writing test cases easier

Want it or not, we must agree with DHH on this point. Pertaining to Java world, I have seen people exposing internal variable, creating dummy wrapper for framework object (like HttpSession, HttpRequest,...) so that it is easier to write Unit Test. DHH find it so uncomfortable that he chose to walk way from Unit Test.

On this part, I half agree and half disagree with him. From my own view, altering design, implementation for the sake of testing is not favourable. It is better if developers can write the code without any concern of mocking input.

However, aborting Unit Testing for the sake of having a simple and convenient life is too extreme. The right solution should be designing the system is such a way that business logic is not so tight-coupling with framework or infrastructure. 

This is what called Domain Driven Design.

Domain Driven Design

For newbie, Domain Driven Design give us a system with following layers.

If you notice, the above diagram has more abstract layers than Rails or the Java adoption of Rails, Play framework. I understand that creating more abstract layers can cause bloated system but for DDD, it is a reasonable compromise.  

Let elaborate further on the content of each layer:


This layer is where you store your repository implementation or any other environment specific concerns. For infrastructure, keep the API as simple, dummy as possible and avoid having any business logic implemented here. 

For this layer, Unit Test is a joke. If there is any thing to write, it should be integration test, which working with real database.


Domain layer is the most important layer. It contains all system business logics without any framework, infrastructure, environment concern. Your implementation should look like a direct translation of user requirements. Any input, output, parameter are POJO only. 

Domain layer should be the first layer to be implemented. To fully complete the logic, you may need interface/API of the infrastructure layer. It is best practice to keep the API in Domain Layer and concrete implementation in Infrastructure layer. 

The best kind of test cases for Domain layer is Unit Test as your concern is not the system UI or environment. Therefore, it helps developers to avoid doing dirty works of mocking framework object. 

For mocking internal state of object, my preferred choice is using Reflection utility to setup object rather than exposing internal variables through setters.

Application Layer/User Interface

Application Layer is where you start thinking about how to represent your business logic to customer. If the logic is complex or involving many consecutive requests, it is possible to create Facades.

Reaching this point, developers should think more about clients than the system. The major concerns should be customer's devices, UI responsiveness, load balance, stateless or stateful session, Restful API. This is the place for developers to showcase framework talent and knowledge.

For this layer, the better kind of test cases is functional/integration test. 

Similar as above, try your best to avoid having any business logic in Application Layer.

Why it is hard to write Unit Test in Rails?

Now, if you look back to Rails or Play framework, there is no clear separation of layers like above. The Controllers render inputs, outputs and may contains business logic as well. Similar behaviours applied if you use the ServletAPI without adding any additional layer. 

The Domain object in Rails is an active record and has a tight-coupling with database schema. 

Hence, for whatever unit of code that developers want to write test cases, the inputs and output are nots POJO. This make writing Unit Test tough.

We should not blame DHH for this design as he follow another philosophy of software development with many benefits like simple design, low development effort and quick feedback. However, I myself do not follow and adopt all of his ideas for developing enterprise applications. 

Some of his ideas like convention over configuration are great and did cause a major mindset change in developers world but other ideas end up as trade off. Being able to quickly bring up a website may later turn to troubles implementing features that Rails/Play do not support. 

  • Unit Test is hard to write if you business logic is tight-coupling to framework.
  • Focusing and developing business logic first may help you create better design.
  • Each kinds of components suit different kinds of test cases.
This is my own view of Testing. If you have any other opinions, please feedback.