From humble beginnings as a kitchen table start-up in 1992, JoJo Maman Bébé offers gorgeous, quirky yet practical designs and excellent customer service while putting people and the planet before profit, with nearly 100 stores across the UK, Ireland and the US. The JoJo Maman Bébé required optimisation of its website to deal more efficiently with spikes in traffic. A queuing system was being used to take the load off the server. An advanced analysis and testing strategy was required in order to identify key bottlenecks and optimise the database and infrastructure to enable maximum efficiency during peak traffic.
Increase in performance
Add to Cart Page load time
When spikes in traffic occurred, a queuing system was being utilised to take the load off the server. This meant customers were held until a set point of CPU load was reached, and when that load reduced they were allowed to make the transaction. The queuing system allowed the site to operate even under extremely high numbers of users. But if the server load continued to rise, it would increase the amount of time users would have to wait (self regulating).
CPU load issues are common problems. Caching is in use to ensure the speed and service of the website, ie pages loading from cache rather than being loaded afresh each time. But for security reasons, checkout pages cannot be cached because they contain sensitive customer data. Therefore, if you've got a good conversion rate (which JoJo Maman Bébé did), issues can arise.
It was of course imperative that all pages were being loaded from the cache that needed to - an uncached page adds load to the database, as well as the web server.
In order to test the performance of the site, so we could benchmark and show what effect any optimisations we made were having, we needed to simulate the high levels of traffic the website was receiving.
On the live site at non-busy times, up to a point, we could simulate where the site ground to a halt. This allowed us to rule out various issues, such as the hardware not functioning properly. In order to monitor and break down exactly how long pages took to load, we used a third-party software analytics system called New Relic.
The process of benchmarking the site included making small, incremental changes to see what effect they had on the page load speeds. Bottlenecks in the database and MySQL were discovered, which were taking too long to query.
This was optimised, improving performance of MySQL by 100% (a query speed of 2 sec reduced to 1 sec). This was a good first step, but the problem was still on the live site, and at a higher level with double the amount of traffic.
Customer journey simulations
BlazeMeter (a load testing platform & benchmark system) allowed us to make customer journeys through the site (what users looked at, when they left the site, etc) and simulate those journeys; we could then program these sessions randomly and use them to test the site.
This turned out to be a critical element in the problem solving process, as the tests being run were looking at the cached version of the page without the hole punching element being taken into consideration, dramatically distorting the test data being generated. Tests were altered to include the hole punched requests, making them more realistic.
Using the now realistic test data, we utilised a system called NetData (a server monitoring system) that gives real time statistics on server performance, that helped us look at the way memory & hard drives are being used, ruling them out as bottlenecks. We created a simulated (staging) environment that was identical to the live site. This allowed us to test uninhibited, without impacting on the working site.
MySQL deadlock and query cache locking errors
The new testing allowed us to identify that the MySQL deadlock and query cache locking errors were occurring within the core database.
Database locking prevents simultaneous access to data in a database by locking tables - when a table is being written to, it's locked until that action has been completed to stop data being overwritten and becoming redundant. A problem that can result from this is deadlocking – when orders of two or more transactions are waiting for one another to give up locks in order to complete that action, and they become gridlocked.
The same system applies to query caching in MySQL - when a query goes through the database, the system checks to see if there's a cached version for the exact same query. If there is, it reuses it, saving time and resources - a valuable tool. But these queries also have to wait their turn; if another query is currently accessing the query cache, it is then locked until that query has finished.
By editing the core database files, we debugged and identified the database locking errors as and when they occurred. Once errors were caught, the database was set-up to run a loop and retry processing the order up to five times. This fixed nearly all cases of database locking errors, making a huge difference in performance.
By reducing the amount of time each query took in the query cache, many more users could get through. Certain offending queries in the cache were also running 3/4million queries per minute. By disabling the query cache on specific queries, the performance of the site was 40x faster; by writing a Magento extension that altered these top offending queries - top 5 highest volume queries - we managed to make the Add to Cart Page load in 1/4 of the time it took before (the slowest loading page took 6 secs before, and 2 secs after).
The above changes led to some of the most significant increases in speed and performance we had seen throughout the problem solving exercise: a 100% increase in performance. In total, after all of our alterations and problem solving strategies, the website was able to run 5x faster, with page load times dropping significantly, and the checkout process able to run smoothly without hindrance to the customer, especially during peak periods of traffic.