Two ways to improve HadoopDB
- Integrate the file system for local database with HDFS. In the design of HadoopDB, the file system and the database are separated, which means that the more space one part is preallocated and the less space the other part will occupy. This design is not scalable because the system administrator has to partition the space by hand. We could investigate a way to let the database run on the HDFS directly. The database's file descriptor can be adapted to HDFS directly. Or, even better, we can add an extra virtualized layer between the database and the HDFS.
- Set up a better scheduler. It is mentioned in the paper that the original version of hadoop fair scheduler is not so good. It is queue-based and locally greedy and simply does not consider the big picture much. "Quincy: Fair Scheduling for Distributed Computing Clusters" advances a flow-based algorithm, which has a very good performance by increasing the throughput up to 40%.
Large analytical data management with commodity clusters
- Performance and efficiency
- is not for structured data analysis
- scale well
- open source, without cost
- scalability, fault-tolerance, and flexibility
- Previous parallel db fit into tens of nodes, not thousand of nodes.
- not scale well for failures, heterogeneous machines, performance untested
- MapReduce (Hadoop) + parallel db (or single-node DBs) = HadoopDB
Translation layer: Hive
Communication layer: Hadoop
Database layer: PostgreSQL (or MySQL, ... JDBC)
- Database Connector
- Interface among dbs on nodes
- extends InputFormat class -> InputFormat Implementation lib
- metainformation as XML = connection para + metadata
- Data Loader
- Global Hasher: MR, repartition raw data from HDFS to NO. of nodes
- Local Hasher: copy from HDFS, repartition the partition into chunks
- Better load balance than Hadoop
- SQL - MR - SQL (SMS) Planner
- extend Hive for
- open and low cost
- each table stored separately in HDFS, low performance in multi-table trans.
- intercept normal Hive flow in
- update MetaStore before query execution
- between query plan generation and MR jobs
retrieve fields, determine partition keys
traverse DAG bottom-up
only support filter, select, aggregation
- Extend performance  with fault tolerance and heterogeneous node exp
- TODO: modify the current task scheduler, connect not straggler node but the replicas.
- Exp on EC2
- HadoopDB's performance < parallel db for
PostgreSQL (not column-store, not compression)
Hadoop and Hive are young
- performance, heterogeneous environ., fault tolerance, flexibility
 A Comparison of Approaches to Large Scale Data Analysis
 Scope  Hive  C-store  What is the right way to measure scale?