2011-05-09

Optimized websites

This article is about how to setup and develop a website in PHP.
At our company we develop a Learning Management System (LMS), and a couple of years ago we decided to rewrite the whole LMS from scratch and at the same time take all our experience and make the best of it.
We took into consideration everything from choice of web-server, scripting language, database backend, clients proxy caching (that we have had some problems with before), etc etc.
This is a guide on our experiences from nothing to version 2.0 of our LMS.

Web-server:
Our servers are running Linux, so the choice of web-server some years ago was always Apache. But since about 75% of all the content that our servers send is static files we decided to give nginx a try. Nginx is a asynchronous web-server that only runs in one process and serves files with very little footprint or load on the system. Nginx can not serve php-files like Apache can using mod_php, but instead you use fastcgi to execute php-files. The downside with Apache is that every request has it's own process, and every process will load mod_php even if only serving a static file, so every request will use up alot of memory with no advantage. Nginx will (when configured right) instead only proxy the fastcgi when it needs to execute php code.

PHP:
So the choice of nginx required us to execute php through fastcgi, and fortunately php now includes php-fpm that is a fastcgi process manager for php.
Setting up php-fpm is basically like setting up how Apache works. You need to set the number of startup/maximum/minimum processes.
At first I was a bit worried that executing php through nginx/fastcgi/php-fpm would affect the performance, but after setting it up and benchmarking some php scripts, we noticed that the performance was about the same compared to Apache/mod_php.
We also made the choice to use xcache that is set up and optimized for 4 cores that we have on the webserver. Optimizing xcache for several cores using the xcache.count parameter affects the locking of shared memory quite a bit.

Problems with web-server/PHP:
When we started to benchmark our LMS we ran into some problems that are important.

  1. The first problem was that PHP leaks memory, so after running the web-server for a couple of hours we noticed that the php-fpm processes was getting bigger and bigger, so after a while the server started swapping memory that affected the whole system.
    The solution was easy: Just set php-fpm parameter "pm.max_requests" to something like 20 that means that the process will only serve 20 requests and then shut down. Php-fpm will handle this nicely and start a new process if the processes are too few.
  2. When benchmarking under high load, we started to get alot of php errors that turned out to be problems with the database. PHP could not connect to the database anymore giving us some undocumented error code in return.
    We use PHPs PDO to connect to a MySQL-server running on a different server and the error we got turned out to be a error that there where no available ports anymore.
    The solution again was very easy: Just turn on persistent connections to the database. If you do not use persistent connections, the PHP script will connect to the database and make it's requests and then close the connection. But a normal system will still make that TCP port unavailable for another 60 seconds, and since there are only some thousand ports dedicated to this you will eventually run out of ports.
    When we turn on persistent connection, the connection will not close to the database unless the process is killed and that will only happen every 20 request with our configuration. 


Static files:
When we developed the new version of our LMS we made some very good decisions based on previous experience.
We see our static files in different ways:

  1. Shared files: This is files used everywhere and is basically javascript (like jquery, json, uploadify, etc), stylesheets and images.
  2. Uploaded files: In a LMS the administrators upload all the content like documents and SCORM-packages.

Ground rule: NEVER UPDATE STATIC FILES!!
This means that if we make a update on any of the shared files, we make a new revision in a new path. For instance we go from http://example.com/shared/r100/jquery.js to http://example.com/shared/r101/jquery.js if we upgrade jquery.
If the administrators upload any new content, we always update the path to the content like: http://example.com/data/documents/1/mydoc.pdf to http://example.com/data/documents/2/mydoc.pdf
If we follow this rule, we can setup nginx with the setting "expires max" on all our static files. This basically means that if a browser have downloaded the file once, it will not even check to see if the file is updated again and if we update a file, the path change and the browser will see this as a new file to download.
This will also take care of any miss-configured proxy server that doesn't check for updated files as it should. Proxy servers are very good with this aproach, since they can take the load off serving static files.

Rule: Compress appropriate static files.
For our shared file, we have made a script that will go through all the javascript and stylesheets and run minimize on them and then gzip them. So in the "shared" folder there is always both a myscript.js and a myscript.js.gz. Now we can turn on the nginx "gzip_static on" that will check if the browser can handle compressed files. Nginx will then serve the gzipped file instead.

With all this setup I compared the output before and after. Out front page was initially about 850Kb big and included alot of images, stylesheets and javascript. After all this settings the front page went down to 120Kb, and since the static files weren't served again with the "expires max", clicking on a link is very fast since it's only the page itself that needs to be served.
We also have turned on nginxs "gzip on" on out php files.

Bottlenecks:
Our web application is very fast. We have a very well normalized and indexed database running on one backend server and our webapp is developed using Zend Framework. When we make a load test on the LMS the bottleneck actually isn't the database as you normally predict, but instead it's PHP that is the main bottleneck. On maximum load on the web-server (4 cores) the database (1 core) server only has a load of about 1/4 compared to the web-server. We are looking at different approaches to optimize this like trying to optimize the autoloader or maybe even use PHP hip-hop compiler. The problem is not memory on the web-server; there is plenty of memory left and the maximum number of php-fpm processes is never reached under maximum load.

Conclusion:
After optimizing our application, we score 91/100 on the "page speed" firebug plugin. And we have no problems with high load on our servers at all. But it's always fun to optimize webapplications further since I'm a bit allergic to slow webapps :)
Any suggestions on how to optimize this further is appreciated.

2011-02-08

Howto design a webapp C++ MVC API part 2

Doing a MVC implementation that is intuitive in C++ isn't very easy. I'm thinking about how to implement the controllers. Should the controller-class be inherited by the implementations or should it provide some kind of callback/signal-system. For some reason I seem to be against inheriting from the controller, but compared to a signal/slot or similar callbacksystem, this is probably the best and most intuitive option.

So I'm thinking something like this:
-----
// Create a webapp
forumApp myforum;
http_server.add_webapp("forum",myforum); // add the forumApp to the "/forum/"-path
-----
// Looking at the forumApp-class, it inherits the webapp_controller
class forumApp : public webapp_controller{


    // Set up the "sub" controllers in a static method
    static void add_controllers(dispatcher::ptr dispatcher){
        // Add the controller "forumAdministration" to the subpath "/forum/admin/"
        dispatcher->add_controller<forumAdministration>("admin"); 
        // Add the controller "forumThread" to the subpath "/forum/thread/"
        dispatcher->add_controller<forumThread>("thread");
        // Also add actions related to this controller 
        dispatcher->add_action(&forumApp::about,"about");
     }

     void about(connection &con){
         con.send_response("this is the best forumapp ever");
     }
}
-------
So, this is just initial thoughts. Writing about this helps me figure out how code would look like, and who knows, maybe I can catch someones attention.

Next up is how views should work and what type if template-system to use or not to use.

2011-02-06

Howto design a webapp C++ MVC API?

So I decided to start on my webspeed-project doing a http/fastcgi server. So far I'm just implementing the http-server itself, but designing it so that a typedef will turn it into a fastcgi server instead.
The implementation is very simple, just create a doeplus::webapp::http_server, set the callback and then send content to the doeplus::webapp::connection provided in the callback.
It looks like this:

int new_connection(doeplus::webapp::connection::ptr connection){
std::string content("Send this to the browser 200 times. ");
for(int i(0);i!=200;++i){
connection->send_response(content,false);
}
connection->send_response(" Finished",true);
return 0;
}

int main(int argc, char *argv[]) {
boost::asio::io_service io_service;
doeplus::webapp::http_server myserver(io_service,"127.0.0.1",9010);
myserver.connection_callback() = &new_connection;
io_service.run();
return 0;
}

Pretty easy right. But it's not ready for production yet because I still have the major task in front of me. I will implement a MVC design for webapps.

How does a good MVC implementation looks like in C++?
Not an easy question. Lets see what we are trying to accomplish:

Classic webapps maps the address to the filesystem and associate file extension to something being executed. Instead of this approach I'd like to map the path to different controllers. Modern webapps are using addresses like this one: http://example.com/myapp/path/to/what/im/doing/dosomething
Breaking down the path we have one "root" path to the webapp "myapp". Next we have a path to something related to what we are doing, and last we have the path to the action we would like to take.
So with this design we can break down the path to different kind of controllers. First we have a WebappController "myapp". Next we have several PathControllers "path", "to", "what", "im", "doing". Finally we have a "Action" in the last "doing" controller called "dosomething".

I can see other kind of information that can be hidden in the path itself. Take this example:
We have a forum-webapp with a administration where the administrators can edit users information. The path to a page like that could look like this:
http://example.com/forum/admin/users/35/edit
Using a path like this we can easily go from the "edit" action to the "editpermission" action of the user without providing the users id (35) with a very easy link to "editpermission".
With this in mind, we probably need to provide a type of controller that can handle more dynamic paths - maybe a RegexpPathController.

So far we have 3 different controller types:
  • WebappController: This controller need to define the path to the webapp itself and maybe even on what virtual host to use. The path to the webapp itself should also be able to be just root "/".
  • PathController: This should be a subpath to another controller so that you can add the PathController to your WebappController or to another PathController.
  • RegexpPathController: Just like the PathController, but can also contain dynamic content.
  • There are probably other controllers to think of (like a ExtensionController), but this is what I can find out now.
The next thing is how a "action" will work. The first design that comes to mind is that a "action" is a method (function) in the controller, but there are alot more to think about.

What exactly will happen when a user of the webapp direct the browser to http://example.com/forum/admin/users/35/edit ???

This is the big question that I've been asking myself. What is the best design for this?
Lets assume that we somehow has build one hierarchy of the controllers and the actions when the webapp was started.
We could just find out in the controller-hierarchy what action to execute and provide necessary information to the action and execute it. Although this will assume that each action is threadsafe since several simultaneous calls can be made to the same action.
Another option is that we can find out the controller-hierarchy and create a copy of the controller-object to that path. This approach may be better since the controllers now can be used for user-specific information like a User-object and the controllers can contain helper-methods for authentications and stuff like that.
Maybe I should implement different kind of controllers depending on if I want a unique controller-object for each call or a "static" object for the whole server?

I'm not done yet, and I have several designquestions to make decisions about.
More to come... soon...