Of course these are general talks and to see real performance changes and compression ratios, one have to try those algorithms with his/her own data. Because of that hadoop applications prefer LZO, a real-time fast compression library, to ZLIB variants. Large capacity disks are far cheaper than fast storage solutions (think SSDs) so it is better for a compression algorithm being faster than being able to give higher compression ratios. If the infrastructure starves on disk capacity but has no performance problems it may be logical to use an algorithm that give huge compression ratios, losing some time on CPU but that’s usually not the case. It is simply trading IO load for CPU load. On the other hand we now need to uncompress that data so we use some CPU cycles. By using some sort of compression we reduce the size of our data achieving faster reads. Hadoop workloads i know about are generally data-intensive, thus making the data reads a bottleneck in overall application performance. Considering hadoop clusters almost always work on commodity machines, the reason for that is simple to explain: disks are slow. HBase documentation and several posts in hbase-user mailing list tell that using some form of compression for storing data may lead to an increase in IO performance. This post is about a recent research which tries to increase IO performance for our MapReduce jobs. Although i am not able to discuss details further than what writes on my linkedin profile, i try to talk about general findings which may help others trying to achive similar goals. Now and then, i talk about our usage of HBase and MapReduce.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |